DOESciDAC ReviewOffice of Science
HARDWARE
The Advantages of First-Generation HETEROGENEOUS COMPUTING on the Cray XT5h
Heterogeneous computing—the use of different computers connected by a network to solve a problem in parallel—offers the promise of significantly improving performance by running different parts of an application with different computational requirements on specialized pieces of hardware.

The ultimate vision of heterogeneous computing is that fine-grain parallel code structures—several layers of "building blocks" that include individual loops—would be targeted by a compiler for execution on different pieces of hardware or on single hardware that is re-configurable. For example, applications may contain sections of code that benefit from fast serial processing, powerful vector processing, or abundant thread-level processing. With hardware and software innovations, future systems will achieve that vision, but systems are available today to take a meaningful step in that direction.
Scientists can realize benefits from heterogeneous systems such as that of the Cray XT5h™ (figure 1) to improve the performance and throughput of heterogeneous application workloads. For the set of applications and scale considered in this article, the vector portion of the XT5h system has a processor-to-processor performance advantage of 2.5x to 9x over that of the microprocessor portion of the XT5h system. Given the system price per processor for vector and microprocessors, it is possible to calculate sustained price-performance (SPP, in units of Mflops/$) for all codes. This allows applications to be sorted into two buckets: those that have higher SPP on microprocessors and those that have higher SPP on vector processors.
Figure 1. Artist's rendering of the Cray XT5h.
Given user-specified target aggregate operations for the two application types in a workload, we can calculate the optimal sizes for the microprocessor and vector portions of the system and model the increased throughput that results from selecting a heterogeneous system over a pure microprocessor-based system.

A Look at Heterogeneous Applications
Current high-performance computing (HPC) applications achieve high performance by executing their work in parallel. While there are many approaches to developing parallel applications, HPC applications use a method called domain decomposition, by which an application's global workload is sliced into pieces that are then distributed to a number of processors for computation. Domain decomposition is further classified into two categories: weak scaling decompositions, in which the size of the computational domain of each processor remains constant as the number of processors working on the problem increases, thus increasing the global size (or resolution) of the computation; and strong scaling decompositions, in which the global size (or resolution) of the computation remains fixed, and the size of the computational domain of each processor decreases as the number of processors working on the global problem increases. With these definitions in mind, we can categorize today's high-performance computing applications as highly scalable and runtime sensitive.
Scientists can realize benefits from heterogeneous systems such as the Cray XT5h to improve the performance and throughput of heterogeneous application workloads.
Highly scalable applications usually benefit from larger computer systems and use weak scaling to increase resolution and thus study shorter-wavelength phenomena and demonstrate convergence of results. A classic example of a highly scalable application is turbulence, where the aim is to understand processes in which large-scale turbulence is dissipated at shorter and shorter wavelengths. However, as grid spacing decreases, many algorithms require more (but smaller) time steps or more iterations to maintain well-behaved numerical solutions. Thus, to achieve a fixed simulated time, the total number of operations done by each processor, as well as the "wall clock time" to completion, both increase. This eventually leads to strong scaling, where the goal is to reduce runtime by utilizing more processors on a fixed-resolution grid.
The primary goal of the second class of HPC applications, those that are runtime sensitive, is to reduce runtime to levels that allow rapid turnaround of results for scientific advancement or to meet real-time constraints. A classic example of a runtime-sensitive application is a climate code run to focus on the effects of adding new physical processes to the model (full carbon cycle, more chemical species, and so on), rather than on increasing resolution. This is particularly useful because runtimes on existing systems can be many weeks or months, whereas runtimes of a few days are more desirable. With strong scaling, communication bottlenecks can occur and limit scalability, as an increasing fraction of time is spent communicating to neighbors rather than doing operations. And some applications have only one-dimensional parallel decompositions, so this fundamentally limits the number of processors that can be utilized. For these reasons, runtime-sensitive applications benefit from more powerful processors rather than from higher numbers of processors.
Through the use of more powerful processors, heterogeneous hardware components can be viewed as application accelerators. All HPC applications have parallelism available at multiple levels. Domain decomposition, generally implemented with the Message Passing Interface (MPI) library, exploits the highest level (coarsest granularity) of parallelism. However, additional parallelism exists at lower levels and finer granularity, which generally cannot be exploited by MPI due to the high overhead of communication. These levels can be exploited by processors that support vector and multithread computation. Thus, even highly scalable applications can benefit from heterogeneous hardware. MPI processing requires full, state-of-the-art microprocessors capable of rapidly executing serial code; this may not be the most cost-effective or energy-efficient processor type for applications dominated by low-level vector or threaded parallelism.
How can we quantify the "value" of a heterogeneous computer system? Two key parameters are SPP and time-to-solution (TTS). The former relates to cost-effectiveness, the latter to rate of discovery. For scientists in very competitive commercial markets, TTS can weigh more heavily than SPP. A system with high SPP and high TTS can provide parallel runs of an application, but that does not always increase the rate of discovery. Some applications are run serially, with what is learned from one run determining input to the next.
In assessing the value of heterogeneous systems, this article focuses on which type of processor can run which applications with the highest SPP. It includes a simple model that allows data center managers to size components of a heterogeneous system to maximize throughput of a heterogeneous workload.

A Heterogeneous System: The Cray XT5h
As a first step toward enabling production-quality heterogeneous computer systems, the Cray XT5h was introduced at Supercomputing 2007 (SC07). The XT5h supercomputer can integrate multiple processor architectures into a single system, accelerating the most demanding computational workflows. The XT5h is a hybrid system consisting of X2 nodes (sidebar "X2 Specifications") with Cray BlackWidow (BW) vector processors, XT nodes with AMD Opteron scalar microprocessors, and optionally on nodes using field programmable gate arrays (FPGAs). With a proven and mature programming environment for vector processors, programs written in C or Fortran are compiled directly to the vector processors, a capability that doesn't exist today for other accelerators. The XT5h system has global addressing capabilities programmable by Co-Array Fortran (CAF) and Unified Parallel C (UPC) that can solve problems beyond the capabilities of MPI.
Through the use of more powerful processors, heterogeneous hardware components can be viewed as application accelerators.
The XT partition includes a set of service and input/output (SIO) nodes, which provide login, programming environment, job scheduling, system administration, global file system, and other operating system services for the system. SIO nodes are shared by the X2 compute nodes, XT compute nodes, and any other compute nodes in the system, providing a unified user environment.
This architecture allows data center managers to purchase one system but to vary the size of XT and X2 partitions, depending on the fraction of their workload that is best suited for microprocessors or vector processors. This system design also supports applications with heterogeneous workflows that exchange data between the scalar (XT) and vector (X2) nodes through a common file system or via direct memory-to-memory communication. Each vector processor supports abundant memory-level parallelism with up to 4,000 outstanding global memory references per processor. Latency hiding and efficient synchronization are central to the X2 design, and the network provides high global bandwidth and low latency for efficient synchronization. The high-radix fat-tree network allows the system to scale up to 32,000 processors.

Scalar and Vector System Performance
Because the XT5h hybrid system focuses on optimizing heterogeneous workloads, it is useful to discuss the performance of application kernels that demonstrate that the vector processor accelerates workloads that have significant vector content relative to commodity microprocessors. Such results are an important first step in helping scientists size X2 versus XT portions of an XT5h system.
Figure 2. System configurations for benchmarking.
In this article, results are presented comparing X2 performance to an HP XC3000 system using the dual core Intel Woodcrest processor and to an XT4 system using dual core AMD Opteron processors. System configuration parameters for benchmarks are shown in figure 2. A processor consists of one or more silicon die in a single package or socket; the package may contain one or more cores, each of which is capable of independently processing an instruction stream. The raw, bi-directional network injection bandwidths shown in row 7 of figure 2 are per-socket. Performance comparisons between X2, the HP XC3000 system, and the XT4 system will be both in terms of the individual core level and at the processor level.
These two benchmarks focus on two important attributes of compute phases of applications—local memory bandwidth and floating point execution rates.
Figure 3 illustrates processor-to-processor comparison of STREAM (which measures sustainable memory bandwidth) and DGEMM (which measures raw capability) results from the HPCC Benchmarks for X2 and Woodcrest systems.
These two benchmarks focus on two important attributes of compute phases of applications—local memory bandwidth and floating-point execution rates. The table shows much higher memory bandwidth of the BW vector processor compared to the Woodcrest processor. Both processors show high floating-point rate for matrix multiply, although the X2 system runs at 86% of peak while the Woodcrest system runs at 39% of peak.
Figure 3. HPCC STREAMS, DGEMM on HP XC300 versus Cray X2.
Figure 4 shows G-PTRANS, G-RA, and RandomRing results from the HPCC Benchmarks for X2 and Woodcrest systems. These benchmarks highlight the main network attributes—global bandwidth, single-word random access, and MPI latency—that determine the communication performance of applications. These results were for 16-processor systems. The Cray X2 system with 16 vector processors shows slightly better MPI latency but significantly higher MPI bandwidth than the HP XC3000 system with 16 Woodcrest processors (32 cores). In particular, native hardware support in X2 for random, single-word, global references through the use of CAF leads to much higher performance than the HP system.
Figure 4. HPCC G-PTRANS, G-RA, RandomRing on HP XC300 versus Cray X2.
Figure 5 shows a comparison between the X2 and quad-core Clovertown systems for two standard test cases of the Overflow2 aerodynamics code. The Intel Clovertown system is composed of two dual-core Intel Woodcrest die packaged on a multi-chip module. Thus, the Clovertown processor provides four cores per socket. Overflow2 generally requires high-memory bandwidth. Scalability is limited (by load imbalance), and time-to-solution is critical. (These are signatures of a runtime-sensitive application.) This code is chosen to illustrate the X2 system's designation as a workload accelerator for Overflow2. The runs are labeled as Test 1 and Test 2. Overflow2 assigns multiple computational grids to processors to roughly achieve load balance. Each test case has the same number of grids (7) and the same total number of points (1.119 million). Test 1 runs 100 time steps on a coarse grid, then 100 time steps on a medium grid, then 20 time steps on a fine grid. Test 2 runs 100 time steps on a coarse grid, 100 time steps on a medium grid, and 400 time steps on the fine grid. Because Test 2 runs longer on the fine grid, we see greater acceleration over the course of the run.
At Supercomputing 2007, it was announced that BW set a new world record of 82 million lattice updates per second for single-processor LBM.
As shown in figure 5, the BW processor holds a 5.1x to 7.4x advantage over a single core of the Clovertown system. When four MPI processes are running (four BWs on a node of X2 and four cores of the quad-core Clovertown processor), the BW holds a 5.6x to 10.1x advantage over the Clovertown system. The last row in figure 5 shows a 2.6x to 3.8x advantage of two BW processors (2-way MPI) relative to two Clovertown processors (4 cores each, 8-way MPI).
Figure 5. Overflow2 benchmarking comparing BlackWidow and Intel Woodcrest processors.

Single-Processor Runs
In this section, the term "processor" refers to a single core. Three single-processor application kernels discussed here are the Lattice-Boltzmann Method (LBM), Mersenne Twister, and "Swim." LBM is an alternative formulation of computational fluid dynamics that is especially attractive for time-dependent compressible flows. Macroscopic fluid quantities such as velocity and density are given as moments of a particle distribution function, whose evolution each time step results from advancing the linearized Maxwell-Boltzmann equation relative to an equilibrium distribution function. A relaxation time parameter is introduced, which is related to kinematic viscosity of the fluid. LBM performance can be modeled as being limited either by memory bandwidth time (true of microprocessors) or peak compute rate time (true of Cray and NEC vector processors). At Supercomputing 2007, it was announced that BW set a new world record of 82 million lattice updates per second for single-processor LBM. The Mersenne Twister code is a single-processor code for computing random numbers. Optimizing this code to use the Bit-Matrix Multiply (BMM) unit of BW results in high performance relative to microprocessors. Swim is a long-standing benchmark kernel in the climate modeling community, which uses an explicit solution of the shallow water equations. It is characterized by low computational intensity (few flops per memory reference), so high-bandwidth systems are preferred.
The results of the three single-processor runs are shown in figure 6. Blue bars of the figure show performance of one core of XT4 normalized to 1.0, and green bars show the relative performance of one BW processor. In this set of kernels, BW has memory bandwidth or special hardware feature advantages over Opteron, which make it an attractive accelerator.
Figure 6. Relative single processor performance of X2 (green) versus XT4 (blue) on three application kernels.

Multiprocessor Runs Involve Variety of Applications
The multiprocessor applications discussed in this article compare 16 BW processors (16-way MPI) to 16 dual core Opterons processors (32-way MPI) used in XT4. The first code is Leslie3D, a general purpose computational fluid dynamics (CFD) code used in studies of mixing, combustion, acoustics, and fluid mechanics. Xflow is a highly dynamic, finite element compressible CFD code. Xflow is unique in that it is capable of efficiently adapting the finite element mesh every time step. POP2 is a popular ocean dynamics code used as a module in climate studies. The baroclinic portion of the code uses finite difference operators over a 3D grid and nearest-neighbor boundary exchanges. The barotropic portion of the code solves a 2D problem on the surface of the ocean by the conjugate gradient algorithm, which is dominated by global dot products. Himeno is a benchmark code focused on studying performance of generic 3D stencil operations and near-est-neighbor halo exchanges. VH1 is a hydrodynamics code augmented with nuclear physics and is used in astrophysical and nuclear fission reactor studies. It has rich computational loops and extensive use of math intrinsic functions like exponential and logarithm. S3D is a 3D multi-physics combustion code that uses high-order finite differences, adaptive mesh refinement, and other techniques to resolve the wide range of length and time scales associated with turbulence, flame fronts, chemical reactions, and multiphasic phenomena. The final multiprocessor application is fvCAM, a finite-volume dynamics module for global climate studies.
The results of the multiprocessor runs are shown in figure 7 (p48). Blue bars of the figure show performance of 16 XT4 Opteron processors (32 cores) normalized to 1.0, and green bars show relative performance of 16 X2 BW processors. The codes have a variety of flop rate, memory bandwidth, and communication characteristics. Problems chosen for the application runs were typical of the problem sizes solved on 16–32 processors using MPI. Thus, for this representative set of vector-friendly applications, X2 exhibits a 2.5x to 9x performance advantage over XT4, at least for 16 processors. This robustness encourages consideration of X2 as a viable workload accelerator. A second conclusion is that some codes may receive large boosts in performance due to special X2 features. Xflow is a memory bandwidth-intensive code, but native support for UPC in X2 hardware and compiler also helps give X2 a large advantage over XT4. On XT4, the University of California–Berkeley UPC compiler was used for Xflow, but performance was not good. Xflow makes single-word remote references, and these present efficiency issues because the XT4 does not have underlying hardware support for them. Furthermore, preliminary investigation suggests that the advantage of X2 over XT4 increases as the number of MPI ranks increases.
How many processors of each type should be purchased to maximize system throughput and provide "the most science per dollar?"
Figure 7. Relative multiprocessor performance of X2 (green) versus XT4 (blue).

Sustained Price Performance: A Simple Model
How can scientists quantify the benefit of a heterogeneous system? One can look at a model of system-level SSP for sizing partitions of heterogeneous processor types for a heterogeneous workload. That is, for fixed total system cost and given workload, how many processors of each type should be purchased to maximize system throughput and provide "the most science per dollar?" Our model provides some answers.
Let "Type 1" processors be microprocessors such as the AMD Opteron processors used in XT4, while "Type 2" processors will be custom vector processors, specifically the BlackWidow processor of X2. A heterogeneous XT5h system has some number of Type 1 and Type 2 processors. Researchers whose workload includes non-vector-friendly applications may have most of their investment in Type 1 processors. In this case, Type 2 processors are viewed as specialized processors for more specialized (vector-friendly) applications.
Consider two systems for addressing a heterogeneous workload. System "A" is a homogeneous system composed solely of p1 Type 1 processors, while "B" is a heterogeneous system composed of p1 processors of Type 1 and p2 processors of Type 2. Fully loaded costs (or prices) per socket of Type 1 and Type 2 processors are C1 and C2, respectively. Equation 1 (figure 8) shows fixed cost C of the two systems.
Figure 8. Equations, 1-8.
The heterogeneous workload can be divided into two categories. Type 1 applications are those that have higher SPP on Type 1 processors, and Type 2 applications are those with higher SPP on Type 2 processors. In determining this split, application, problem definition and processor counts of each type are chosen to provide the desired single-job performance improvement over the previous-generation system. In choosing a problem definition for each application code, the number of processors or number of MPI tasks could be different for the two processor types. In the example presented here, 16 processors of each type (but varying number of cores) were used on the same global problem. Note that highly scalable applica-tions may also be vector friendly. Thus, either Type 1 or Type 2 processors could have higher SPP on highly scalable applications. Vector-friendly runtime-sensitive applications are most likely to have higher SPP on Type 2 processors.
In System A, both types of applications are run on Type 1 processors. In System B, Type 1 applications are run on Type 1 processors and Type 2 applications on Type 2 processors. While any consistent and quantitative performance metric would suffice for our analysis, floating-point operation rate (flop rate) is sufficiently independent of "core count" and will be used for this analysis, though the result is general for any consistent quantitative performance metric. Let R1 be the average sustained flop rate per processor for Type 1 processors (System A or B) on Type 1 applications. Let R2 be the average sustained flop rate per processor for Type 2 processors (System B) on Type 2 applications. Let r12 be the average sustained flop rate per processor for Type 1 proces-sors on Type 2 applications (System A). With these definitions, SPP metrics (units of Mflops/$) for systems A and B can be defined as equation 2 (figure 8).
Given loaded system costs for X2 and XT4, one can calculate sustained price-performance for all codes.
We can state the constraint that Type 2 applications are those that run with higher SPP on Type 2 processors than on Type 1 processors as equation 3 (figure 8).
To complete a quantitative system-level SPP model, one more workload attribute is specified by the data center manager. Let F1 and F2 denote the target total number of operations to be computed, say annually, for each type of application. F1 and F2 reflect the "weighting" by users of the portion of their workload assigned to Type 1 and Type 2 applications.
Runtime of a real application has contributions from input/output (I/O), communication, and computation phases. In the simple model presented here, flop rates in equation 2 represent overall flop rates, including all phases of application runtime. It is clear from figures 3 and 4 that X2 has both high flop rate and communication rates, so overall flop rate R2 will have like-kind contributions from both phases of runtime. The time for System B to run the workload is equation 4 (figure 8).
Equation 4 indicates that the XT portion of the heterogeneous system is kept busy executing Type 1 applications, while the X2 portion is kept busy executing Type 2 applications. In fact, equation 1 and equation 4 can be used to size the Opteron and BW portions of an XT5h system in terms of other specified parameters, as in equation 5 (figure 8).
The time required for System A to run the workload is equation 6 (figure 8). equation 6 says the entire System A first runs Type 1 applica-tions and then Type 2 applications. By using equation 4 through equation 6, the two runtimes are related by equation 7 (figure 8).
From equation 3, it follows that TA is greater than or equal to TB (equation 8, figure 8). That is, equation 8 says that for a fixed price, the homogeneous System A will take longer to run the heterogeneous workload than the heterogeneous System B. Furthermore, equation 7 makes it possible to quantify how much more work can be done on System B compared to System A in a given amount of time.

Conclusions on Performance Advantages
For the set of applications and scale considered here, the X2 vector system has a performance advantage of 2.5x to 9x over the XT4 microprocessor system on a processor-to-processor basis. Given loaded system costs for X2 and XT4, one can calculate sustained price-performance (SPP in units of Mflops/$) for all codes. For this article, we looked at applications with higher SPP on XT4 processors and applications with higher SPP on X2 processors. Given user-specified target aggregate operations for the two application types in a workload, a model was provided to size the XT and X2 portions of the systems and to calculate the increased throughput that results from selecting a heterogeneous system over a pure microprocessor-based system.
The benchmarks presented here from the SC07 paper used dual core AMD Opteron processors of the XT4 system. The XT5h system will use the latest quad core AMD Opteron processor, so the Opteron benchmark results should be updated and new comparisons made with BW processors.

Further Reading
Condensed Results for HPCC Challenge Benchmarks.
http://icl.cs.utk.edu/hpcc/hpcc results.cgi.

Intel Core2 Duo.
http://h20311.www2.hp.com/HPC/cache/274276-0-0-0-121.html.

G. Wellein, T. Zeiser, G. Hager, and S. Donath. 2006. On the single processor performance of simple Lattice Boltzmann kernels. Comput and Fluids, 35: 910-919.