DOESciDAC ReviewOffice of Science
HARDWARE
Specialized Hardware for Supercomputing
What kind of computer do you imagine when you hear the terms "supercomputing" or "high-performance computing?" A Cray XT3/4/5? An IBM BlueGene? Or a number of rack-mounted IBM/Intel/AMD servers with Infiniband or some other fast network? Certainly, these machines do many large simulations, and from such simulations you can easily find numerous beautiful computer graphics. However, these big machines are not the only way to do large scientific calculations. The GRAPE and GRAPE-DR hardware (figures 1 and 2), developed at the Center for Computational Astrophysics, National Astronomical Observatory of Japan, are alternatives to typical supercomputing architecture.
 
Transistor Usage and Power Consumption
Accelerator hardware is one alternative to the most widely used supercomputer architectures. The latest example is the IBM Roadrunner system ("Science-Based Prediction at LANL" SciDAC Review 4, Summer 2007, p33), which started operation in June 2008. It consists of approximately 13,000 Cell BE processors, originally developed for the Sony PS3 game console, with double-precision enhancement (PowerXCell 8i). Two Cell processors are mounted on a blade, and two blades are connected to a dual-socket, dual-core Opteron blade through a PCI-Express interface. Thus, conceptually, each of the four cores of an Opteron blade controls one Cell processor. This combination of one Opteron blade and two Cell blades is called TriBlade. The Cell processor itself consists of one IBM PowerPC processor core and eight simpler processors—synergistic processor elements, each with two double-precision fused multiply and add units (FMAs). With a 3.2 Hz clock rate, one Cell processor achieves a theoretical peak speed of 102.4 gigaflops (Gflops) for double-precision operation. The entire Roadrunner system consists of 3,060 TriBlades, connected with an x4 DDR Infiniband network.
The GRAPE and GRAPE-DR hardware, developed at the Center for Computational Astrophysics, National Astronomical Observatory of Japan, are alternatives to typical supercomputing architecture.
With three different processors in one computational node, it is not an easy task to port existing applications to this machine. Developing a new application would be hard, too. What is the reason for such complex hardware? The simple answer is cost and power consumption.
Reportedly, the acquisition cost of the Roadrunner system is nearly $100 million. Thus, per-blade cost is around $10,000, for a per-blade peak performance of approximately 150 Gflops. For a similar price, an x86-based server with approximately 80 Gflops of theoretical peak (but not more) could be bought. That configuration yields a price-performance ratio of about a factor of two better. Power consumption shows a similar difference. According to the Green500 list for June 2008, Roadrunner gives 437 megaflops (Mflops) per watt, while the latest Intel Xeon-based systems give approximately 250 Mflops/W.
Source: J. Makino Illustration: A. Tovey
Figure 1. The interaction calculation pipeline of a GRAPE chip. A GRAPE-6 chip contains six identical pipelines, resulting in more than 300 arithmetic units in a chip.
These differences in cost and power consumption come primarily from the contrast in the basic designs of the processors. The PowerXCell 8i processor has 16 double-precision FMAs on a single 212-mm2 chip based on 65 nm technology. On the other hand, an Intel "Harpertown" quad-core Xeon processor has eight double-precision FMAs on a dual-die package and a total silicon area of 212 mm2 based on 45 nm technology. Because the silicon area needed for one transistor is approximately two times larger for 65 nm technology compared to 45 nm technology, the Cell processor has four times more floating point units per transistor compared to the latest Xeon chip. This factor-of-four difference in the efficiency of transistor usage resulted in the factor-of-two advantage in cost and power consumption. The cost discrepancy is only a factor of two and not four because of the technology used.
Can we do better than the Cell processor? There are several processors with many more floating-point arithmetic units. The first example is a graphic processing unit (GPU). The latest GPU chip from NVIDIA, GTX280, has 240 single-precision FMAs on a die and 30 double-precision FMAs. The clock speed is lower than that of the Cell. However, the number of single-precision floating-point arithmetic units is about a factor of eight larger, and single-precision performance is more than two times better than that of a Cell processor. The double-precision performance, on the other hand, is roughly the same. The die size of the GTX280 is approximately 600 mm2, or three times that of the Cell processor, and the power consumption is nearly doubled. The GTX280 processor is based on 65 nm technology.
Another example is the ClearSpeed CSX700 chip announced in July 2008. With 192 double-precision FMAs operating at 250 MHz clock speed, the CSX700 offers a theoretical peak performance of 96 Gflops. Although this speed is similar to the Cell or the GTX280, the CSX700 runs at a clock speed less than one-tenth of the Cell and has 12 times more arithmetic units. It is fabricated with IBM 90 nm technology, with approximately half the transistors as the Cell processor (256M). Thus, the efficiency in transistor use is 24 times better than that of the Cell and almost 100 times better than that of Intel Xeon processors.
J. Makino
Figure 2. The GRAPE-6 chip.
This disparity in transistor usage yields a very large difference in power consumption. The CSX700 processor consumes around 25 W of power while in operation, which is better than Cell by a factor of four. The reason why this factor is not 24 is partly because the process technology is older (a factor of two). Another factor of two or three comes from the more efficient use of the transistors themselves. Because a fair fraction of transistors of the CSX700 chips are actually used for logic for floating-point multipliers and adders, the fraction of transistors that switch in one clock cycle is higher for CSX700 than for Cell or Xeon. This is a rather small factor, even though the efficiency in the transistor usage is different by almost a factor of 100. This means that in Xeon or Cell processors, more than 99% of the transistors are used for things other than the logic for floating-point arithmetic units, and yet they consume significant power, much more than the floating-point arithmetic units.
From a historical point of view, the extremely low efficiency of general-purpose processors is the result of the evolution of microprocessors in the past 20 years. In 1989, Intel began shipping the 80860 processor, which could perform one double-precision multiplication in every two clock cycles and contained approximately a million transistors. Thus, the transistor efficiency of the Intel 80860 processor is almost the same as that of the CSX700 processor. The decrease of the transistor efficiency of general-purpose microprocessors in the last 20 years is essentially due to the so-called "memory wall," which is the growing disparity between processor speed and the speed of memory outside the processor. The 80860 processor had an external 64 bit data bus that operated with the same speed as the core logic of the processor. Roughly speaking, the memory could provide one datum per one floating-point operation. A present-day Intel central processing unit (CPU) also has an external 64 bit data bus, which operates typically with 50% of the core clock, even though the number of floating-point units increased from one to eight.
In many scientific simulations, it is necessary to solve N-body problems numerically. The gravitational N-body problem is one such example, which describes the evolution of many astronomical objects, from the solar system to the entire Universe.
Thus, memory bandwidth, normalized by the performance of floating-point arithmetic units, has degraded by a factor of 20 or so. In other words, in a processor with a large number of floating point units, this factor becomes close to 200. In the case of the ClearSpeed CSX700, this factor approaches 200. For GPUs, this factor is similar to that of x86 CPUs. The designers of GPUs devoted a large fraction of the silicon area and power to the external memory interface.
Whether or not large memory bandwidth is needed depends entirely on the application or, actually, on the specific numerical method used.
 
The Gravitational N-Body Problem
While general-purpose processors have become less and less efficient in transistor usage, researchers at the Center for Computational Astrophysics have been developing a series of special-purpose computers—called GRAPEs, or GRAvity PipEs—for gravitational N-body problems. In many scientific simulations, it is necessary to solve N-body problems numerically. The gravitational N-body problem is one such example, which describes the evolution of many astronomical objects, from the solar system to the entire Universe. In some cases, it is important to treat non-gravitational effects such as the hydrodynamical interaction, radiation, and magnetic fields, but gravity is the primary driving force that shapes the Universe.
T. Takeda and T. Saito, 4D2U Project, National Astronomical Observatory of Japan
Figure 3. An image of a simulated galaxy. Calculation was done on GRAPE-5 hardware. The number of particles used is about 2 x 106.
To solve the gravitational N-body problem, the gravitational force on each body (particle) in the system must be calculated from all other particles in the system (figure 3). There are many ways to do this. For example, if relatively low accuracy is acceptable, one method is the Barnes-Hut tree algorithm, or if the system is nearly uniform and the range of timescales of particles is not too large, the fast multipole method (FMM) is an option.
T. Ishiyama, University of Tokyo
Figure 4. The formation of dark-matter halos calculated using parallel tree code on a Cray XT4 in the National Astronomical Observatory of Japan. The total number of particles is 1,6003, and the total number of timesteps is 15,844.
An unusual feature of gravitational interaction is its long-range attractive force with no characteristic length scale (figure 4). This feature is the reason there are self-gravitating systems with a wide range of mass and size, from solar systems to clusters of galaxies. This wide range makes the numerical integration of gravitational N-body problem difficult, because a wide range in space resolution and time resolution is needed. Consider our galaxy. It is believed that more than 80% of the total mass of our galaxy is dark matter, with a radius of 100 to 200 kiloparsecs (kpc); the distance from the Sun to the center of our galaxy is approximately 8 kpc. Most stars are in the thin disk with a characteristic radius of approximately 5 kpc. These stars orbit around the center of the galaxy with a typical orbital period of 100 million years. However, near the center of the galaxy there is an immense central black hole with a mass 3.6 x 106 times the mass of our Sun, and stars orbit around this black hole with a 10 year orbital period. Some of the stars in our galaxy are in star clusters, whose mass is in the range of 103 to 107 times the solar mass. Their size is typically 10 parsecs (pc), and their orbital timescale is roughly one million years.
The gravitational N-body problem is not a multi-scale problem but a continuous-scale problem. There is no clear interface between small-scale physics and large-scale physics.
A real difficulty emerges from the fact that in any timescale between 100 million years and 10 years, there are stars with that timescale, because the gravity is attractive and has power-law dependence to the distance. In this sense, the gravitational N-body problem is not a multi-scale problem but a continuous-scale problem. There is no clear interface between small-scale physics and large-scale physics.
Thus, to simulate the formation and evolution of our galaxy and to model the growth of the central black hole and evolution of stars around it at the same time, a numerical scheme is needed that can handle a wide range of timescales.
Unfortunately, most of the numerical schemes that allow fast evaluation of gravitational interactions between particles have limitations in time resolution or space resolution or both. Here, time resolution means the ability to efficiently evaluate the forces on a small fraction of all particles that require small timesteps. The fast Fourier transform (FFT) is a typical example. It can evaluate the potential field for M equispaced grid points with O(M log M) computational cost. Thus, if all particles in the system have similar timescales and spatial resolution, FFT can be used to solve the Poisson equation. However, it is not practical to try to use FFT to obtain the gravitational field of a 100 kpc galaxy while resolving star clusters with 0.1 pc grid size. There are efforts to use adaptive mesh refinement techniques, but particle-based methods like simple direct summation or tree/FMM schemes can handle a wide range in spatial resolution without any difficulty.
Time resolution is another issue. Traditionally, a method called the individual timestep algorithm has been used. In this scheme, a particle with index i has its own time ti and timestep Δti, and particles with the smallest next time, ti + Δti are integrated. In this method, forces from other particles are calculated using their predicted positions. Some high-order predictor polynomial is needed for all particles, and usually the variable-timestep variation of the Adams–Moulton predictor-corrector scheme is used. Historically, a four-step, fourth-order scheme had been used, but currently a scheme that uses the third-order Hermite polynomial is widely used. With this Hermite scheme, it is necessary to calculate the first time derivative of the gravitational force, which means the calculation cost of one pairwise interaction is about a factor of two larger than the force itself. However, the Hermite interpolation formula allows longer timesteps, and therefore it is worthwhile to use the Hermite scheme. Unfortunately, there is no known way to combine the fast methods like tree or FMM with the individual timestep algorithm, which can efficiently handle a system with a wide range in the timesteps of particles. Thus, for many systems that require a wide range in timesteps, the direct method, which calculates the forces from all other particles directly, is still the method of choice.
 
GRAPE
A series of computers specialized for astrophysical N-body simulations have been developed, and the basic idea is shown in figure 5. The system consists of a host computer and special-purpose hardware that handles the calculation of gravitational interactions between particles. The host computer performs other calculations such as the time integration of particles, I/O, diagnostics, and so on. This structure can accelerate both the tree algorithm and the direct summation with individual timesteps.
Source: J. Makino Illustration: A. Tovey
Figure 5. The basic structure of a GRAPE system.
The system's most important advantage is the special-purpose part, which is extremely efficient in transistor usage. For example, the GRAPE-6 chip (figures 1 and 2, p55), which was completed in 1999, contains more than 250 arithmetic units (total of multipliers and adders) and six units to calculate x-2.5. Based on 250 nm technology, it contains less than 2% of the transistors of the latest Intel Xeon chip with 45 nm technology. Even so, compared to the 16 units (eight adders and eight multipliers) of the Intel Xeon chip, a GRAPE-6 chip contains more than 20 times more arithmetic units and is almost 1,000 times more efficient in transistor usage. Even after 10 years, this factor of 1,000 in efficiency still means a significant advantage in power consumption. A GRAPE-6 chip consumes about 13 W and provides 30 Gflops, compared to 95 W and 48 Gflops for 45 nm Xeon chips. GRAPE-6 is about five times more power efficient. Compared to the x86 processors of 1999, a GRAPE-6 chip was more than 20 times faster and more than 50 times more power efficient.
One reason the efficiency of general-purpose computers has been degrading for 20 years is the memory wall. In the case of GRAPE processors, the memory wall is not an issue, because the bandwidth requirement is reduced to a very small amount by three mechanisms. First, the pipeline processor requires only three or four words (position and mass of the particle that exerts the force) to perform several tens of operations. Second, multiple pipelines can share one input data, calculating the force from one particle to several particles. Third, one pipeline can further reduce the bandwidth requirement by calculating the forces on multiple particles using multithreading. A GRAPE-6 chip contains six pipelines, and each calculates forces to eight particles. The bandwidth required is just 720 MB s-1.
The development of the GRAPE systems has continued not just because they are fast; they can also solve problems not possible with other supercomputers.
Thus, GRAPE processors have been fairly successful. In 2002 a large parallel machine consisting of 2,048 GRAPE-6 chips was completed (figure 6). It offered a peak performance of 64 teraflops (Tflops), faster than the Earth Simulator's 40 Tflops, for less than 1% of the development cost and less than 1% of the power consumption. A 2,048-chip GRAPE-6 system consists of five racks, while the Earth simulator consists of 640 racks of computing nodes and another 130 racks for the network.
 
Science on GRAPE
The development of the GRAPE systems has continued not just because they are fast; they can also solve problems not possible with other supercomputers, including the formation of the moon and terrestrial planets, the evolution of compact star clusters and binary and triple black holes in the center of a galaxy, the structure of dark-matter halos, and the statistics of substructures in dark halos. The following recent results were obtained either using GRAPE hardware or programs optimized initially for GRAPE hardware and then ported to general-purpose parallel computers.
J. Makino
Figure 6. The 64 Tflops GRAPE-6 system, consisting of 64 processor boards (eight boards in one subrack and one or two subracks in standard 19 inch racks) and its host computers (16 PCs with AMD K7 processors).
An example of the simulation of large-scale structure formation using the parallel tree method is shown in figure 4 (p57). Tomoaki Ishiyama calculated the structure formation through gravitational instability in a cubic volume of 43 megaparsecs (Mpc). In this calculation, the resolution required is approximately 500 pc, or around 1/10,000 of the size of the simulation box, and more than 20,000 substructures were resolved.
Around 140 galaxy-sized objects (halos) are formed, and two are depicted in figure 4 (p57). Although these two have almost the same size and mass, the distribution of matter within these halos is very different. In the left box, the halo is rather smooth and there are only a few substructures, while in the right box, the halo contains a number of substructures. The left one is formed in a relatively low-density area (see the left-middle box) and the right one in a high-density region. These differences might be related to the difference of galaxies in the field (low-density area) and in clusters of galaxies (high-density region).
Figure 3 (p56) shows a simulated galaxy, formed from initial density fluctuations much in the same way as in figure 4 (p57). However, here the calculations include baryons (hydrogen, helium, and heavier elements), while the previous one included only dark matter. Roughly speaking, here the mass resolution is about 1,000 times the mass of the Sun, and space resolution goes down to a few parsecs.
T. Takeda and T. Saito, 4D2U Project, National Astronomical Observatory of Japan
Figure 7. A simulation of a galaxy merger. After two galaxies approach to the nearest distance (top right panel), a massive gas filament forms (lower left) and breaks up (lower right). Fragments of filament eventually become massive star clusters. The calculation was done on a parallel GRAPE-7 system.
Figure 7 shows a simulation of two equal-sized galaxies merging. In total, 15 million particles are used to model these galaxies. The formation and fragmentation of a gas filament during the merger event was first found in this calculation. The calculations for figures 3 and 7 were done by Takayuki Saitoh.
 
Problems with Specialized Architecture
GRAPE has achieved much better cost performance than general-purpose computers, such as x86 processors made with technology 5-10 years more advanced. One reason for this success is GRAPE's architecture. Practically all the transistors are used as arithmetic units, without being limited by the memory wall problem. Another reason is the arithmetic units can be optimized to their specific uses in the pipeline. For example, in the case of GRAPE-6, the subtraction of two positions was performed in 64 bit fixed-point format, not in floating-point format. Final accumulation was also done in fixed point. Also, most arithmetic operations to calculate the pairwise interactions were done in single precision. These optimizations made it possible to pack more than 200 arithmetic units into a single chip with less than 10 million transistors.
But there was a problem. As the silicon semiconductor technology advanced, the initial cost of designing and fabricating custom chips skyrocketed. In 1990, when design began on the GRAPE-3 custom chip, the cost for such a chip was approximately $100,000. The cost was somewhat higher for GRAPE-4, started in 1992, because the chip was larger. In 1997, when design began of the GRAPE-6 chip, it cost more than $1 million; and for GRAPE-6's successor in 2003, the initial cost was approximately $4 million. Roughly speaking, initial cost has been increasing as n0.7, where n is the number of transistors that can fit into a chip.
GRAPE has achieved much better cost performance than general-purpose computers. One reason for this success is GRAPE's architecture.
The total budgets for the GRAPE-4 and GRAPE-6 projects were $3 million and $4 million. Thus, a similar budget was not enough to just design the processor chip. At least $10 million was needed to develop a fairly large machine based on custom processors. On the other hand, it was not practical to ask for $10 million just to do astrophysical many-body simulations. Alternative approaches had to be explored.
One approach was to ask for that amount of money. If $10 million was obtained, it could be used to build special-purpose hardware for astrophysical N-body problems. Using the technology available in 2004, the performance of the chip would be roughly 30 times more than that of the GRAPE-6 chip, or around 1 Tflops, with power consumption of approximately 20 W. With 4,000 chips or so, the machine would be the first machine to achieve multiple Pflops performance for real applications. In fact, the MDGRAPE-3 project, using 13 nm technology, achieved a peak performance of 1 Pflops in 2006, two years before the first Pflops general-purpose computer, the IBM Roadrunner.
Another approach was to use field-programmable gate array (FPGA) chips. An FPGA chip consists of a number of programmable logic blocks and also programmable interconnections. A logic block is essentially a small lookup table with multiple inputs, augmented with one flip-flop and sometimes full-adder or more additional circuits. The lookup table can express any combinatorial logic for input data, and with flip-flop it can be part of a sequential logic. An interconnection network is used to make larger and more complex logic by connecting logic blocks. The design of recent FPGA chips has become much more complex, with large functional units such as memory blocks and multiplier blocks (typically 18 x 18 bits).
It was not practical to ask for $10 million just to do astrophysical many-body simulations. Alternative approaches had to be explored.
Because of the need for programmability, the size of the circuit that can fit into an FPGA chip is much smaller than for a custom large-scale integration (LSI) chip, and the speed of the circuit is also slower. However, FPGA chips are mass-produced by several LSI companies and easily purchased, without paying an initial cost. Thus, a machine can be constructed with a budget much smaller than that needed for the design of a custom LSI.
A fairly large FPGA chip based on 65 nm technology can integrate around 150 single-precision multipliers and may achieve a speed two times faster than a GRAPE-6 chip, or approximately 5% of the speed of a custom chip based on 90 nm technology. Roughly speaking, FPGAs are less efficient than custom pipeline chips by a factor of 20. Also, large FPGA chips are much more expensive than microprocessors, which means the advantage of FPGAs over commercial microprocessors is not large, only a factor of three or so.
FPGAs can have a large advantage over microprocessors for some applications. In particular, if the necessary word length is short, the relative advantage can be very large. For example, gravitational force calculations can have less accuracy than single-precision floating-point arithmetic. For many problems, the pairwise interaction does not need 24 bit relative accuracy, and accuracy of less than 10 bits is enough. For such cases, the performance of FPGAs can be improved by a large factor, making them much better alternatives to microprocessors.
 
GRAPE-DR
The option chosen was still different and rather similar to the change of the design of GPUs. Figure 8 shows the basic idea of the current approach, the GRAPE-DR processor. Officially, GRAPE-DR is an acronym for Greatly Reduced Array of Processor Elements with Data Reduction. GRAPE-DR consists of a large number of very simple processors (processing elements or PEs), all operating in the single instruction, multiple data (SIMD) fashion. The PEs are organized quite differently from traditional SIMD processors. In the GRAPE-DR architecture, processors are organized into a two-level tree structure, and the processors in the same low-level groups—called broadcast blocks (BBs)—share single memory units. Each memory unit can broadcast its data to the BB processors, and the same data can be broadcast from the external input port to all BBs. Finally, when the data is read from the processor, arbitrary reduction operations, such as floating-point addition, logical operations, and integer operations can be applied to the data from all BBs. Unlike traditional SIMD processors such as Goodyear MPP, TMC CM-2, or Maspar MP-1, PEs in the GRAPE-DR architecture only have small on-chip memories.
Source: J. Makino Illustration: A. Tovey
Figure 8. The structure of the GRAPE-DR processor.
This processor's specialized architecture is used for the calculation of interactions between particles. When the forces on m particles from other n particles are calculated, there is potentially mn parallelism. However, if one particle is assigned to each processor to calculate the force on it, only m processors can be used. In order to make use of mn parallelism, let n processors calculate the forces from different n particles to one particle, and take the summation of the forces afterwards. The reduction tree takes this summation.
In this architecture, almost all transistors can be used to implement PEs, because the communication network, including the reduction tree, encompasses a rather small area compared to the PEs. The reduction tree requires NBB tree nodes, where NBB is the number of BBs. Since NBB is much smaller than the total number of PEs, the reduction tree does not require much area.
Within one PE, a floating-point multiplier, a floating-point adder, an integer arithmetic unit, and a local register file are needed. In the first implementation of the GRAPE-DR processor, there was one floating-point multiplier with a 25 x 50 bit integer multiplier, which can perform a single-precision multiplication per clock cycle, or one double-precision multiplication per two clock cycles. The floating-point adder can perform one addition of either single- or double-precision words per clock cycle, and the integer ALU can perform operations on 72 bit integers. For the local register file, a three-port (two read ports and one write port) 32 word register file, a 256 word single-port memory, and an auxiliary register were used. Figure 9 (p61) shows the layout of one PE, with the floating-point units consuming significant area.
Source: J. Makino (left) and Alchip Technologies (right) Illustration: A. Tovey
Figure 9. The GRAPE-DR processor element (PE). The left panel shows the logical design, and the right panel shows the physical layout. The black rectangular box is the local memory. Red, orange, green, and blue areas are the register file, FMUL, FADD, and ALU logics, respectively.
A single chip, approximately 300 mm2 in size, held 512 PEs. Figure 10 shows the layout of the chip. Each small white square is one PE, and each of the 16 identical blocks had 32 PEs. Each cycle, a PE can do either two single-precision operations or one double-precision operation, or roughly 1 mm2 per double-precision operation. The latest Intel Xeon chips have 200 mm2 for eight FMAs, or around 14 mm2 per double-precision operation. Taking into account the technological difference of a factor of four, GRAPE-DR is approximately 60 times more efficient in transistor use. Compared to GRAPE-6, however, this means GRAPE-DR is about a factor of 10 less efficient. Also, if compared to microprocessors with technology two or three generations ahead, the relative advantage of a GRAPE-DR chip is around a factor of 10—large, but not very large.
Alchip Technologies (left) and K. Hiraki, GRAPE-DR Project, University of Tokyo (right)
Figure 10. The GRAPE-DR chip layout (left) and chip photo (right).
A single GRAPE-DR board, with four GRAPE-DR processor chips, achieves a theoretical peak performance of one double-precision Tflop, with power consumption around 240 W. Two of these boards can be hosted by a single PC. In that case, the total power consumption would be around 700 W for 2 Tflops, and the theoretical peak performance per watt is about 3 Gflops/W. Actual performance for real applications would be somewhat lower, near the range of 1-2 Gflops/W. This number is 5-10 times better than the current Intel Xeon chips. Figures 11 and 12 (p64) show the processor board.
Source: J. Makino Illustration: A. Tovey
Figure 11. The structure of the GRAPE-DR processor board with four GRAPE-DR chips.
J. Makino
Figure 12. A photograph of the GRAPE-DR processor board with four GRAPE-DR chips.

 
Programming
Parallel programming is difficult, and programming heterogeneous multiprocessors is even more difficult. So how can hardware like GRAPE-DR, FPGAs, or general-purpose computing be programmed on GPUs?
This view of programming architectures may be different from others, because of two decades of work on the GRAPE hardware. The special-purpose part of GRAPE-3 used parallel pipelines, and the number of pipelines in GRAPE-6 was as large as 12,288. Moreover, each GRAPE-6 pipeline processor calculates the force on eight particles using virtual pipelines. Thus, the number of pipeline processors in GRAPE-6 system is as large as 100,000.
In this programming model, details of the parallelization such as the number of physical processors, the number of BBs in one chip, and how the data are distributed over multiple processors and BBs are all hidden from the application programmer.
The massive parallelism in the GRAPE-6 architecture is achieved mostly by hardware. One GRAPE-6 unit consists of up to 128 processor chips, all calculating the forces on the same set of 48 particles from a different set of particles. The units are connected to a single host computer through a broadcast data network and a reduction tree network. The summation of forces calculated on different chips is performed by the reduction tree network made of FPGA chips. The use of GRAPE-6 hardware from the host computer is accomplished through calls to several library functions that completely hide the hardware details (for example, the number of chips) from the application program.
For GRAPE-DR, the same approach was used. In many particle-based simulations such as classical molecular dynamics or astrophysical N-body simulations, the most expensive part of a calculation is the evaluation of particle-particle interactions. In general, it has the form

where xi denotes the variables associated with particle i, g(xi, xj) is some generalized force from particle j to particle i, and fi is the total force on particle i.
The PEs execute the code to evaluate function g. Because GRAPE-DR does not have hardware pipelines for force calculation, the code for that part needs to be captured. However, in many cases the evaluation of function g really requires a small piece of code, which can be written using either assembly language or a simple compiler. Software has been designed that generates a set of library functions from that description. GRAPE-DR hardware can be used by calling these functions from application programs running on the host computer. Figure 13 (p65) shows the gravitational force calculation function written in a prototype compiler language developed by Nakasato. This example code is for the gravitational interaction between particles written for this compiler. Here, keywords LMEM, BMEM, and RMEM denote that variables of the specified class are part of xi, xj, or the return value g in the above equation, respectively.
J. Makino
Figure 13. An example of a simple compiler language description for particle-particle interaction.
In this programming model, details of the parallelization such as the number of physical processors, the number of BBs in one chip, and how the data are distributed over multiple processors and BBs are all hidden from the application programmer. In most parallel computers, this kind of abstraction comes with a penalty of low execution efficiency. However, in the case of GRAPE-DR hardware, the structure of the hardware itself is optimized to the efficient execution of this kind of calculation of interaction between particles, with the networks for data broadcast and data reduction over multiple BBs, and the overhead for the parallelization is quite small.
The GRAPE-DR architecture and this programming model support two additional important types of calculations. First is the case of trivial parallelism, where all PEs receive different data from the host computer, perform the same operations, and return the results to the host computer. This arrangement means xj is not needed. The main limitation is the small amount of memory available to each PE, but GRAPE-DR hardware can be used for many important applications. An example is the evaluation of two-electron integrals for many quantum chemistry applications.
Another important application is dense-matrix calculations. For any dense-matrix calculations, the basic operation is matrix multiplication C = AB. The GRAPE-DR chip can perform the matrix multiplication with nearly 100% efficiency, and the four-chip processor board is designed so that it can achieve again nearly 100% efficiency.
Algorithms that require fast communication between PEs or large bandwidth to the main memory are not suited to GRAPE-DR. Examples are single FFT on large data and computational fluid dynamics with relatively simple numerical schemes. Roughly speaking, a global FFT of three-dimensional data with a one-dimensional size of around 1,024 requires the communication bandwidth with distant processors should be around one word per 100 or so operations. Thus, a 1 Tflop chip requires a communication speed of approximately 100 GB s-1, which is clearly not practical. This kind of balance between computation and communication is not possible with most parallel computers, except for expensive vector-parallel machines such as NEC SX-8.
The development of semiconductor technology and computer architecture in the last two decades made the approach of special-purpose hardware quite effective.
Other approaches should be investigated for FFT. For example, if FFT is used to solve the Poisson equation, a variety of methods could be used: FMM, a combination of FMM and FFT (short range by FMM and long range by FFT), purely iterative solvers, and so on. For a uniform grid, FFT might require the least number of operations but the largest amount of communications. Thus, other methods are certainly better if the communication bandwidth is limited; even if sufficient communication bandwidth is available, for non-uniform grids something other than FFT should be applied.
 
Future Directions
GRAPE special-purpose computers have supported astrophysical N-body simulations. The development of semiconductor technology and computer architecture in the last two decades made the approach of special-purpose hardware quite effective, because the efficiency of transistor usage of general-purpose microprocessors has been decreasing exponentially. Because this decrease is due to the memory wall problem, special- purpose design is about 1,000 times more efficient if a numerical method can be used that requires relatively small memory bandwidth compared to computing speed. In practice, the differences in the available technology and the development cycle make the difference slightly smaller but still significant.
However, the evolution of semiconductor technology in recent years has made the design of special- purpose computers for a narrow range of applications not cost-effective, simply because the initial development cost has become unrealistically high.
Because of the development of the numerical algorithms and application software optimized to one type of accelerator hardware, it is relatively easy to modify it for other types of accelerator hardware.
With a programmable processor, the application area is increased. The GRAPE-DR processor is, as of summer 2008, the fastest single-chip processor for double-precision arithmetic. But the cost to make the processor programmable is fairly large, close to a factor of 10. This factor of 10 is large enough to significantly reduce the advantage of the GRAPE-DR processor over other approaches, in particular GPGPUs. Though GRAPE-DR still has a large advantage in power consumption (and therefore operating cost), the price tag of the hardware is lower for GPUs because of much larger production quantity.
On the other hand, the amount of money spent for GRAPE-DR hardware development is tiny, compared to what national laboratories in the United States, European Union, Japan, and several other countries have spent on a single machine. For example, the Earth Simulator cost $400 million, and even the relatively inexpensive Roadrunner system cost more than $100 million. Japan's next-generation supercomputer system will reportedly cost more than $1 billion. The total budget to build a roughly 1 Pflop GRAPE-DR system is approximately $12 million. Thus, even with high initial development cost of LSI chips, it is possible to develop approximately 100 different systems, each with specialized processors, for roughly the same cost as a single large computer. Spending more than $100 million for a single large computer with a relatively small number of applications is only justified if the design of the processor achieves efficiency better than 1/50 of that of the optimal special-purpose design. The factor of 50 is important because this amount is necessary to make a specialized design—which would use older technology and be more expensive due to low production quantity—actually competitive.
Efficiency is not the performance normalized by the theoretical peak performance of the processor but that multiplied by the fraction of the transistors on the processor chip used for floating-point arithmetic units. For example, a typical general-purpose microprocessor with eight double-precision FMAs made with 45 nm technology uses less than 1% of its transistors for logic for floating-point operations. Thus, even if one application achieves 100% efficiency, overall efficiency of the combination of processor and application is less than 1%.
For users and people developing application software, the central question is: in which hardware should money and time be invested? Because of the development of the numerical algorithms and application software optimized to one type of accelerator hardware, it is relatively easy to modify it for other types of accelerator hardware. Therefore, as long as the application developers understand the differences between several accelerator architectures and design the algorithm and application so that it can work on multiple platforms, any architecture may be used, and the investment on the software development will not be lost. In fact, the algorithms and applications developed for GRAPE systems are quite efficient when run on GPUs, SIMD hardware on x86 processors, and FPGA-based systems.
 
Contributors Junichiro Makino
 
Further Reading Makino, J. and M. Taiji. 1998. Scientific Simulations with Special-Purpose Computers—The GRAPE Systems. John Wiley and Sons, Chichester.

Hut, P. and J. Makino. 1999. Astrophysics on the grape family of special-purpose computers. Science 283: 501-505.

Hut, P. and J. Makino (eds). 2003. Astrophysical Supercomputing using Particle Simulations. Astronomical Society of the Pacific, San Francisco.