| |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| In this article, results are presented comparing X2 performance to an HP XC3000 system using the dual core Intel Woodcrest processor and to an XT4 system using dual core AMD Opteron processors. System configuration parameters for benchmarks are shown in figure 2. A processor consists of one or more silicon die in a single package or socket; the package may contain one or more cores, each of which is capable of independently processing an instruction stream. The raw, bi-directional network injection bandwidths shown in row 7 of figure 2 are per-socket. Performance comparisons between X2, the HP XC3000 system, and the XT4 system will be both in terms of the individual core level and at the processor level. | These two benchmarks focus on two important attributes of compute phases of applications—local memory bandwidth and floating point execution rates. |
||||||||||||||||||||||||||||||||||||||||||
| Figure 3 illustrates processor-to-processor comparison of STREAM (which measures sustainable memory bandwidth) and DGEMM (which measures raw capability) results from the HPCC Benchmarks for X2 and Woodcrest systems. | |||||||||||||||||||||||||||||||||||||||||||
| These two benchmarks focus on two important attributes of compute phases of applications—local memory bandwidth and floating-point execution rates. The table shows much higher memory bandwidth of the BW vector processor compared to the Woodcrest processor. Both processors show high floating-point rate for matrix multiply, although the X2 system runs at 86% of peak while the Woodcrest system runs at 39% of peak. | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| Figure 4 shows G-PTRANS, G-RA, and RandomRing results from the HPCC Benchmarks for X2 and Woodcrest systems. These benchmarks highlight the main network attributes—global bandwidth, single-word random access, and MPI latency—that determine the communication performance of applications. These results were for 16-processor systems. The Cray X2 system with 16 vector processors shows slightly better MPI latency but significantly higher MPI bandwidth than the HP XC3000 system with 16 Woodcrest processors (32 cores). In particular, native hardware support in X2 for random, single-word, global references through the use of CAF leads to much higher performance than the HP system. | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| Figure 5 shows a comparison between the X2 and quad-core Clovertown systems for two standard test cases of the Overflow2 aerodynamics code. The Intel Clovertown system is composed of two dual-core Intel Woodcrest die packaged on a multi-chip module. Thus, the Clovertown processor provides four cores per socket. Overflow2 generally requires high-memory bandwidth. Scalability is limited (by load imbalance), and time-to-solution is critical. (These are signatures of a runtime-sensitive application.) This code is chosen to illustrate the X2 system's designation as a workload accelerator for Overflow2. The runs are labeled as Test 1 and Test 2. Overflow2 assigns multiple computational grids to processors to roughly achieve load balance. Each test case has the same number of grids (7) and the same total number of points (1.119 million). Test 1 runs 100 time steps on a coarse grid, then 100 time steps on a medium grid, then 20 time steps on a fine grid. Test 2 runs 100 time steps on a coarse grid, 100 time steps on a medium grid, and 400 time steps on the fine grid. Because Test 2 runs longer on the fine grid, we see greater acceleration over the course of the run. | At Supercomputing 2007, it was announced that BW set a new world record of 82 million lattice updates per second for single-processor LBM. |
||||||||||||||||||||||||||||||||||||||||||
| As shown in figure 5, the BW processor holds a 5.1x to 7.4x advantage over a single core of the Clovertown system. When four MPI processes are running (four BWs on a node of X2 and four cores of the quad-core Clovertown processor), the BW holds a 5.6x to 10.1x advantage over the Clovertown system. The last row in figure 5 shows a 2.6x to 3.8x advantage of two BW processors (2-way MPI) relative to two Clovertown processors (4 cores each, 8-way MPI). | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| Single-Processor Runs In this section, the term "processor" refers to a single core. Three single-processor application kernels discussed here are the Lattice-Boltzmann Method (LBM), Mersenne Twister, and "Swim." LBM is an alternative formulation of computational fluid dynamics that is especially attractive for time-dependent compressible flows. Macroscopic fluid quantities such as velocity and density are given as moments of a particle distribution function, whose evolution each time step results from advancing the linearized Maxwell-Boltzmann equation relative to an equilibrium distribution function. A relaxation time parameter is introduced, which is related to kinematic viscosity of the fluid. LBM performance can be modeled as being limited either by memory bandwidth time (true of microprocessors) or peak compute rate time (true of Cray and NEC vector processors). At Supercomputing 2007, it was announced that BW set a new world record of 82 million lattice updates per second for single-processor LBM. The Mersenne Twister code is a single-processor code for computing random numbers. Optimizing this code to use the Bit-Matrix Multiply (BMM) unit of BW results in high performance relative to microprocessors. Swim is a long-standing benchmark kernel in the climate modeling community, which uses an explicit solution of the shallow water equations. It is characterized by low computational intensity (few flops per memory reference), so high-bandwidth systems are preferred. |
|||||||||||||||||||||||||||||||||||||||||||
| The results of the three single-processor runs are shown in figure 6. Blue bars of the figure show performance of one core of XT4 normalized to 1.0, and green bars show the relative performance of one BW processor. In this set of kernels, BW has memory bandwidth or special hardware feature advantages over Opteron, which make it an attractive accelerator. | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| Multiprocessor Runs Involve Variety of Applications The multiprocessor applications discussed in this article compare 16 BW processors (16-way MPI) to 16 dual core Opterons processors (32-way MPI) used in XT4. The first code is Leslie3D, a general purpose computational fluid dynamics (CFD) code used in studies of mixing, combustion, acoustics, and fluid mechanics. Xflow is a highly dynamic, finite element compressible CFD code. Xflow is unique in that it is capable of efficiently adapting the finite element mesh every time step. POP2 is a popular ocean dynamics code used as a module in climate studies. The baroclinic portion of the code uses finite difference operators over a 3D grid and nearest-neighbor boundary exchanges. The barotropic portion of the code solves a 2D problem on the surface of the ocean by the conjugate gradient algorithm, which is dominated by global dot products. Himeno is a benchmark code focused on studying performance of generic 3D stencil operations and near-est-neighbor halo exchanges. VH1 is a hydrodynamics code augmented with nuclear physics and is used in astrophysical and nuclear fission reactor studies. It has rich computational loops and extensive use of math intrinsic functions like exponential and logarithm. S3D is a 3D multi-physics combustion code that uses high-order finite differences, adaptive mesh refinement, and other techniques to resolve the wide range of length and time scales associated with turbulence, flame fronts, chemical reactions, and multiphasic phenomena. The final multiprocessor application is fvCAM, a finite-volume dynamics module for global climate studies. |
|||||||||||||||||||||||||||||||||||||||||||
| The results of the multiprocessor runs are shown in figure 7 (p48). Blue bars of the figure show performance of 16 XT4 Opteron processors (32 cores) normalized to 1.0, and green bars show relative performance of 16 X2 BW processors. The codes have a variety of flop rate, memory bandwidth, and communication characteristics. Problems chosen for the application runs were typical of the problem sizes solved on 16–32 processors using MPI. Thus, for this representative set of vector-friendly applications, X2 exhibits a 2.5x to 9x performance advantage over XT4, at least for 16 processors. This robustness encourages consideration of X2 as a viable workload accelerator. A second conclusion is that some codes may receive large boosts in performance due to special X2 features. Xflow is a memory bandwidth-intensive code, but native support for UPC in X2 hardware and compiler also helps give X2 a large advantage over XT4. On XT4, the University of California–Berkeley UPC compiler was used for Xflow, but performance was not good. Xflow makes single-word remote references, and these present efficiency issues because the XT4 does not have underlying hardware support for them. Furthermore, preliminary investigation suggests that the advantage of X2 over XT4 increases as the number of MPI ranks increases. | How many processors of each type should be purchased to maximize system throughput and provide "the most science per dollar?" |
||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| Sustained Price Performance: A Simple Model How can scientists quantify the benefit of a heterogeneous system? One can look at a model of system-level SSP for sizing partitions of heterogeneous processor types for a heterogeneous workload. That is, for fixed total system cost and given workload, how many processors of each type should be purchased to maximize system throughput and provide "the most science per dollar?" Our model provides some answers. |
|||||||||||||||||||||||||||||||||||||||||||
| Let "Type 1" processors be microprocessors such as the AMD Opteron processors used in XT4, while "Type 2" processors will be custom vector processors, specifically the BlackWidow processor of X2. A heterogeneous XT5h system has some number of Type 1 and Type 2 processors. Researchers whose workload includes non-vector-friendly applications may have most of their investment in Type 1 processors. In this case, Type 2 processors are viewed as specialized processors for more specialized (vector-friendly) applications. | |||||||||||||||||||||||||||||||||||||||||||
| Consider two systems for addressing a heterogeneous workload. System "A" is a homogeneous system composed solely of p1 Type 1 processors, while "B" is a heterogeneous system composed of p1 processors of Type 1 and p2 processors of Type 2. Fully loaded costs (or prices) per socket of Type 1 and Type 2 processors are C1 and C2, respectively. Equation 1 (figure 8) shows fixed cost C of the two systems. | |||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||
| The heterogeneous workload can be divided into two categories. Type 1 applications are those that have higher SPP on Type 1 processors, and Type 2 applications are those with higher SPP on Type 2 processors. In determining this split, application, problem definition and processor counts of each type are chosen to provide the desired single-job performance improvement over the previous-generation system. In choosing a problem definition for each application code, the number of processors or number of MPI tasks could be different for the two processor types. In the example presented here, 16 processors of each type (but varying number of cores) were used on the same global problem. Note that highly scalable applica-tions may also be vector friendly. Thus, either Type 1 or Type 2 processors could have higher SPP on highly scalable applications. Vector-friendly runtime-sensitive applications are most likely to have higher SPP on Type 2 processors. | |||||||||||||||||||||||||||||||||||||||||||
| In System A, both types of applications are run on Type 1 processors. In System B, Type 1 applications are run on Type 1 processors and Type 2 applications on Type 2 processors. While any consistent and quantitative performance metric would suffice for our analysis, floating-point operation rate (flop rate) is sufficiently independent of "core count" and will be used for this analysis, though the result is general for any consistent quantitative performance metric. Let R1 be the average sustained flop rate per processor for Type 1 processors (System A or B) on Type 1 applications. Let R2 be the average sustained flop rate per processor for Type 2 processors (System B) on Type 2 applications. Let r12 be the average sustained flop rate per processor for Type 1 proces-sors on Type 2 applications (System A). With these definitions, SPP metrics (units of Mflops/$) for systems A and B can be defined as equation 2 (figure 8). | Given loaded system costs for X2 and XT4, one can calculate sustained price-performance for all codes. |
||||||||||||||||||||||||||||||||||||||||||
| We can state the constraint that Type 2 applications are those that run with higher SPP on Type 2 processors than on Type 1 processors as equation 3 (figure 8). | |||||||||||||||||||||||||||||||||||||||||||
| To complete a quantitative system-level SPP model, one more workload attribute is specified by the data center manager. Let F1 and F2 denote the target total number of operations to be computed, say annually, for each type of application. F1 and F2 reflect the "weighting" by users of the portion of their workload assigned to Type 1 and Type 2 applications. | |||||||||||||||||||||||||||||||||||||||||||
| Runtime of a real application has contributions from input/output (I/O), communication, and computation phases. In the simple model presented here, flop rates in equation 2 represent overall flop rates, including all phases of application runtime. It is clear from figures 3 and 4 that X2 has both high flop rate and communication rates, so overall flop rate R2 will have like-kind contributions from both phases of runtime. The time for System B to run the workload is equation 4 (figure 8). | |||||||||||||||||||||||||||||||||||||||||||
| Equation 4 indicates that the XT portion of the heterogeneous system is kept busy executing Type 1 applications, while the X2 portion is kept busy executing Type 2 applications. In fact, equation 1 and equation 4 can be used to size the Opteron and BW portions of an XT5h system in terms of other specified parameters, as in equation 5 (figure 8). | |||||||||||||||||||||||||||||||||||||||||||
| The time required for System A to run the workload is equation 6 (figure 8). equation 6 says the entire System A first runs Type 1 applica-tions and then Type 2 applications. By using equation 4 through equation 6, the two runtimes are related by equation 7 (figure 8). | |||||||||||||||||||||||||||||||||||||||||||
| From equation 3, it follows that TA is greater than or equal to TB (equation 8, figure 8). That is, equation 8 says that for a fixed price, the homogeneous System A will take longer to run the heterogeneous workload than the heterogeneous System B. Furthermore, equation 7 makes it possible to quantify how much more work can be done on System B compared to System A in a given amount of time. |
|||||||||||||||||||||||||||||||||||||||||||
| Conclusions on Performance Advantages For the set of applications and scale considered here, the X2 vector system has a performance advantage of 2.5x to 9x over the XT4 microprocessor system on a processor-to-processor basis. Given loaded system costs for X2 and XT4, one can calculate sustained price-performance (SPP in units of Mflops/$) for all codes. For this article, we looked at applications with higher SPP on XT4 processors and applications with higher SPP on X2 processors. Given user-specified target aggregate operations for the two application types in a workload, a model was provided to size the XT and X2 portions of the systems and to calculate the increased throughput that results from selecting a heterogeneous system over a pure microprocessor-based system. |
|||||||||||||||||||||||||||||||||||||||||||
| The benchmarks presented here from the SC07 paper used dual core AMD Opteron processors of the XT4 system. The XT5h system will use the latest quad core AMD Opteron processor, so the Opteron benchmark results should be updated and new comparisons made with BW processors. |
|||||||||||||||||||||||||||||||||||||||||||
| Further Reading Condensed Results for HPCC Challenge Benchmarks. http://icl.cs.utk.edu/hpcc/hpcc results.cgi. Intel Core2 Duo. http://h20311.www2.hp.com/HPC/cache/274276-0-0-0-121.html. G. Wellein, T. Zeiser, G. Hager, and S. Donath. 2006. On the single processor performance of simple Lattice Boltzmann kernels. Comput and Fluids, 35: 910-919.
|
|||||||||||||||||||||||||||||||||||||||||||
| Published by IOP Publishing in association with Oak Ridge National Laboratory, for the US Department of Energy, Office of Science. Copyright © 2008 by IOP. | |||||||||||||||||||||||||||||||||||||||||||