| Hardware: Core Strength |
| Pixel-Painting Technology is the FUTURE of HPC |
As computing interfaces have become more and more graphical, industry has developed special hardware to meet the increasing demands for faster and more complex visual displays. Such graphics hardware is focused on dividing its tasks among more and more small engines — similar to the parallel computing approach used in scientific research and other fields. Thus, hardware and software developers are currently advancing graphics processing technology, making it a real option for the next generation of cutting-edge supercomputers. |
| In the dark ages, there was the command line interface and the black-and-white or black-and-green monitor. You had a keyboard and a screen. With the keyboard you told the computer — through typed commands — what you wanted it to do, and the screen (or printer) showed the results. Ever more powerful computer chips, or central processing units (CPU) ran the show, and they were more than capable of displaying characters on a screen. |
| Then computing became visual. For most people, the command line gave way to the point-and-click graphical user interface. Computers gained the ability to display and manipulate photographs and other images. Games became more dazzling, and the internet turned the computer screen into a smaller cousin of the television set. |
The computing approach being explored today by researchers and hardware manufacturers is known as GPGPU — general-purpose computing on graphics processing units. |
| The CPU, which was good at running computers and crunching numbers, strained under the new load of advanced graphics. Calculating the right color for a screen pixel was not difficult, but CPUs had trouble calculating the rapidly changing colors of thousands, and then millions, of pixels. In response, the industry developed specialized processors designed to handle this growing task. The processing cores that comprised early graphics chips were not especially powerful or flexible, but there were many of them, and in large numbers they were capable. |
Taking a New Look
The computing approach being explored today by researchers and hardware manufacturers is known as GPGPU — general-purpose computing on graphics processing units (GPU) (figure 1). In this approach, CPUs control the simulations while GPUs handle the heavy number-crunching (sidebar “GPUs and Real World Applications” p62). |
 |
| NVIDIA |
| Figure 1. The Tesla C1060 GPU computing board (top) has one Tesla GPU for workstations. The board features 4 GB on-board memory and delivers 1 teraflop/s of processing power. The Tesla S1070 1U GPU computing system for datacenters contains four Tesla GPUs, 4 GB on-board memory per GPU, and a total of 4 teraflop/s processing power. |
|
| “CPUs will always be a component of systems,” noted Marc Adams of graphics chip maker NVIDIA. “There are fundamental things CPUs do extremely well, and they always will. We’re talking about large dataset computing, where you can break up data and parallelize certain aspects of that at a kernel level or function level — any of those things that can be parallelized to take advantage of our architecture is going to work well.” |
| NVIDIA introduced the GeForce 256 in 1999, billing the new chip as the first GPU. As graphics chips became more programmable and powerful (figures 2 and 3, p60, 61), they drew the attention of some members of the scientific computing community. |
 |
| NVIDIA |
| Figure 2. The Fermi architecture block diagram. |
|
 |
| Source: NVIDIA Illustration: A. Tovey |
| Figure 3. Comparison of the NVIDIA G80, GT200, and Fermi architectures. |
|
| The world’s most powerful research supercomputers are parallel behemoths. Oak Ridge National Laboratory’s (ORNL) Jaguar system, currently the world’s most powerful for open scientific computing, sports more than 180,000 computing cores that deliver up to 1.64 thousand trillion calculations every second (1.64 petaflop/s). |
| For supercomputing centers, then, the attraction of GPUs was obvious. As researchers careened down the inevitable road of parallel computing, they began looking closely at chips that were, literally, made for this approach to computing. |
| “Right now on the XT5 there are 12 cores per node,” explained Douglas Kothe, director of science at ORNL’s National Center for Computational Sciences. “Using one of these accelerators is like bringing a 500-ish core processor right on the node. You can look at these accelerators as a collection of processor cores; they’re not as fast, they’re less able, but there are many, many of them.” |
| For Kothe, one major advantage of GPUs is they decrease the time spent by processors communicating with one another. |
| “I don’t have to go off-node to talk to another node to get at that resource,” he said. “It’s like having a powerful cluster hooked up to my node. I’ve got a processor with hundreds of cores close that I don’t have to go out to communicate to. Even though each core is fairly weak, when you put them together and work in concert, the net parallel performance is impressive.” |
Software developer John Stone from the University of Illinois at Urbana-Champaign (UIUC) noted another advantage GPU usage has over alternative computing technologies such as the Cell processor, which is most famous for running Sony’s PlayStation 3 gaming console.“The economics of GPUs makes them a very desirable option,” he said. “That itself is an almost overriding issue. The average research does not have a large budget. If you can buy a GPU for $300 or $700 and get significant acceleration, that becomes far more attractive than the alternatives.” |
As researchers careened down the inevitable road of parallel computing, they began looking closely at chips that were, literally, made for this approach to computing. |
Pass the Heavy Lifting
Yesterday’s processors were good at what they did because of their quantity. Also known as graphics accelerators, the chips were an early approach to parallel computing. To calculate the color value for a pixel, nothing else needed to be known about the other pixels. As a result, the process of controlling a million-pixel monitor could conceivably be farmed out to as many as a million processors with minimal inconvenience. |
| “A CPU wasn’t very good at drawing tens of thousands of dots on a screen,” noted Andy Keane of NVIDIA. “It could kind of do that, but ultimately, there was a special chip that was good at it — known as a video controller. The CPU could tell the chip to draw a line or a square. The chip had fixed functions, but it could do them faster than the CPU.” |
| According to Keane, the CPU and the video controller advanced in different directions. CPUs focused on taking advantage of Moore’s Law — Intel cofounder Gordon Moore’s observation that the number of transistors that can be placed on an integrated circuit doubles every decade or so. Graphics hardware, on the other hand, focused on dividing its tasks among more and more small engines. CPUs had become very fast at tackling jobs that were complex and sequential, or serial; video controllers shone on jobs that were straightforward and parallel. |
Also known as graphics accelerators, the chips were an early approach to parallel computing. |
| “For example, you might get up in the morning, get dressed, drink a cup of coffee, and drive to work,” Keane explained. “You have to do those things in order. They are inherently serial. But let’s say when you get to work you find the meeting room messy and grab five people to help clean it up. You can get the job done five times faster because you have five people helping.” |
| In the last decade, however, makers and users of these two technologies have run up against the limits of the technologies’ strengths, and each has drawn from the strengths of the other. Researchers approaching problems far too large and complex for a traditional processor tied the technologies together to make clusters and supercomputers that incorporate thousands and hundreds of thousands of processing cores, creating enormous parallel computing machines. At the same time, developers in the computer graphics arena chafed at the limits of the simple, dedicated processors contained in the highly-parallel graphics processors. They needed more power and control. |
“Video controls had a limited set of functions,” noted Keane. “We got to the point where game developers wanted to combine these pieces. The list of commands people wanted to use could be too unwieldy, so we created programmable processors in 2000 and 2001. We got more and more sophisticated until the nature of the GPU changed, and most of the resources on the chip were processors, which were programmable. We really wanted the creativity of game developers and CADD (computer-aided drafting and design) developers to shine.” |
Pluses and Minuses
For scientific computing, the potential benefits from GPUs are impressive. Nevertheless, it is taking a substantial and coordinated effort among graphics chip makers, technology developers, and researchers to realize the full potential of this approach. |
| In the first place, graphics processors were not created for scientific research; they were created to light pixels on a computer screen, creating challenges for both manufacturers and coders. “In the past two years or so, a lot of strides have been made in programmability and architecture,” said Jeffrey Vetter, leader of the Future Technologies Group in ORNL’s Computing and Computational Sciences directorate. “Before, they didn’t have the types of instructions and floating-point operations that people needed in scientific computing. The GPUs were for gaming and graphics. They would sacrifice accuracy for speed.” |
Graphics processors
were not created for scientific research;
they were created to
light pixels on a computer screen, creating challenges for both manufacturers
and coders. |
| Although the GPUs became programmable in the early years of this decade, the programming models and languages were designed to give graphics developers maximum video control and performance. These tools were alien to scientific researchers, and until the last few years the only way to use them for a scientific application was to trick them into thinking it was in fact a graphics application. |
| “I was programming these when there was barely a graphics language for it,” said Future Technologies Group member Jeremy Meredith, “back in the days when you were essentially writing assembler language for your graphics card. You weren’t dealing with x, y, and z coordinates; you weren’t even dealing with unnamed sets of values. You had to know that this floating point meant red, this floating point number meant green, and this floating point number meant blue. |
“There were all these subtle things you had to worry about when you were doing graphics programming that scientists — not only did they not care, they didn’t even want to know about. Scientists don’t want to have to learn about graphics and lighting, and what you do with polygons. It was very much doing graphics programming to get real general-purpose computation out of it.” |
Getting with the Programmer
The benefits of marrying scientific computing with highly-parallel graphics chips have been clear for the better part of a decade, but there was a problem: scientists and graphics hardware did not talk the same language. |
| Early graphics hardware was created simply to color pixels on a screen, a job it did efficiently and automatically. Programmers, in fact, had no direct control over them. Even as the processors became more powerful and more programmable — largely in response to pressure from game developers — the improvements were not at first intended for scientific users. |
| ORNL’s John Turner grappled with these issues several years ago, when he led an effort at Los Alamos National Laboratory (LANL) to explore the power of graphics accelerators. |
| “Back in those days, the software environment was much more immature,” said Turner, who now leads ORNL’s Computational Engineering and Energy Sciences Group, “and it was changing very rapidly. You had to do a lot more very low-level programming, and you had to do it pretending your physics computation was a graphics computation.” |
| There were early tools for programming the chips, but the process was still alien to scientific researchers unaccustomed to thinking of their codes in terms of primary colors, simple shapes, and shading values. |
| “GPUs were, up until about 18 months ago, intolerable to have to program,” noted Vetter. “They were exotic architectures that nobody knew how to program. And just in the past two years or so a lot of strides have been made in programmability and architecture for applying to scientific computing.” |
Chip makers and developers alike have worked to overcome these obstacles, pushing toward a world in which researchers don’t have to trick hardware into thinking it’s doing something it’s not. NVIDIA is developing a programming environment known as CUDA (Compute Unified Device Architecture) that allows researchers to program in the languages they are most comfortable with. |
CUDA Libre
The biggest change in the past couple of years is that graphics chip maker NVIDIA launched CUDA, an environment that allows researchers to code in their favorite languages — at this point principally C and Fortran — while ignoring the graphics heritage of the hardware itself. |
| “The advantage to CUDA was that, because it came out of NVIDIA, it wasn’t trying to abstract all the graphics stuff; it literally bypassed it,” said Meredith, also a member of Vetter’s group at ORNL. “You’re not just hiding all the graphics stuff under another layer; you’re actually talking a little closer to the hardware, but in a more direct way. It’s not like it was trying to hide something from you, the complexity simply wasn’t there.” |
| “When CUDA came out, it was really a breakthrough technology,” said Jim Phillips, a colleague of Stone’s at UIUC. Phillips and Stone are working with the National Institutes of Health to develop two prominent scientific computing tools using CUDA. The first, nanoscale molecular dynamics (NAMD) is a tool for calculating large biomolecular systems; the second, called Visual Molecular Dynamics (VMD), is a molecular visualization and analysis program (figure 4, p63). “First of all, CUDA is a pretty good programming model for GPUs,” said Phillips. “Secondly, it’s supported by a major GPU vendor in a real way. It has major vendor support. That shows this was not a research project; this was a company saying, ‘We’re going to do this, and if you buy it, it will run.’” |
NVIDIA launched CUDA, an environment that allows researchers to code in their favorite languages — at this point principally C and Fortran — while ignoring the graphics heritage of the hardware itself. |
 |
| K. Schulten, J. Stone, and L. Trabuco, UIUC |
| Figure 4. This simulation shows the placement of ions near the RNA and ribosome in nucleic acids in preparation for stimulation. The project, which uses the VMD application, is led by Klaus Schulten of UIUC. The image was created by sernior research programmer John Stone and graduate student Leonardo Trabuco, also of UIUC. See Further Reading (p65) for more information. |
|
| According to NVIDIA’s Adams, the research community has responded to the approach. “We have hundreds of applications that have been CUDA-ized and ported over in a multitude of domains,” he said, noting that CUDA is being taught in more than 200 universities. In addition, he said, the CUDA software development kit has been downloaded nearly 200,000 times, and there are 50,000 or so developers worldwide. |
| Easier, but Not Easy
While the various APIs available for CUDA may have liberated the coder from unnecessary complexity, plenty of necessary complexity remains. Not only must an application divide the computing job among many compute nodes, it must now divide jobs within each compute node among many processing cores in the GPU (figure 5). |
 |
| Source: NVIDIA Illustration: A. Tovey |
| Figure 5. NVIDIA third-generation streaming multiprocessor. |
|
| “We’ve got pictures of code written in C and the same code written in C for CUDA,” noted Adams. “It is basically C with some extensions, so it doesn’t require you to learn a new programming language. Scientists typically get started with C for CUDA
or another API like CUDA Fortran from PGI and
can very quickly see a two- to three-fold speedup
at the ‘application’ level with very minimal programming effort. This really gets them excited as to
the potential for GPUs and they will then delve further into additional optimizations that are possible. This requires them to think a lot more about their data structures and application usage but with
this additional effort the ability to achieve greater than twenty-fold speedups at an application level is not unusual.” |
“The learning curve for C for CUDA was about 80–20,” explained Dan Negrut of the University of Wisconsin’s mechanical engineering faculty. Negrut is working with collaborators Mihai Anitescu of Argonne National Laboratory and Alessandro Tasora of the University of Parma in Italy on a CUDA project simulating the dynamics of sand. “With 20% of the effort you get 80% of the results. It was easy to understand what we needed. You just have to identify the computational bottlenecks. Usually it’s a small portion of the code, and you focus on that.” |
A Vendor-Agnostic Approach
NVIDIA is also working with researchers and other manufacturers to develop a cross-platform environment for programming mixed CPU/GPU systems. Known as OpenCL, the effort aims to let researchers rework applications totaling as many as tens of thousands of lines of code to take advantage of graphics accelerators without having to rely on a single system or manufacturer. |
“People are saying, okay, we’re starting to believe in the revolution, so now we want to make it easier for applications people to write their code once and have it work on multiple heterogeneous devices,” said Vetter. “Right now if you go to any supercomputer center, any procurement is going to have mandatory requirements for a C standard compiler, a Fortran standard compiler, an MPI-compliant library, a parallel file system, and some other nooks and crannies of the software development environment that are absolutely critical. Part of the reason for that is that everybody’s built all their codes around these APIs and languages. Right now there isn’t a similar software stack for heterogeneous computing. The hope is if OpenCL matures, it will be one language that you can program an NVIDIA, or an Intel, or an AMD graphics processor, or even a ClearSpeed or a Cell processor.” |
The Future of Supercomputing
In any case, it looks clear that many, if not most, computational scientists will have to adapt to the use of accelerators. |
| “The point is that the next-generation systems are going to have this more and more cores, some big and some small,” said Kothe. “Our argument is that the applications have to be prepared for this and refactor and redesign. Codes as they are aren’t the way forward. You’re going to have to rewrite.” |
| Graphics chip manufacturers have also worked to correct hardware shortcomings that might stand in the way of GPUs being used for scientific research (figures 6 and 7). |
 |
| Source: NVIDIA Illustration: A. Tovey |
| Figure 6. Overlapped kernel execution in the Fermi architecture. |
|
 |
| Source: NVIDIA Illustration: A. Tovey |
| Figure 7. Early performance evaluations show Fermi performing 4.2 times faster than the GT200 in doubleprecision applications. |
|
| “NVIDIA realized early on that this was not just a flash-in-the-pan trend,” said Adams. “Fundamentally, GPUs had become more and more important within the context of computing. We started adding things like shared memory caches on our processing units (figure 8) and other things, to allow programmers to section data for large dataset computing. It also allows for better communication because you have on-chip memory now, which is available to programmers so you’re not having to constantly access off-chip memory.” |
 |
| Source: NVIDIA Illustration: A. Tovey |
| Figure 8. The Fermi memory hierarchy. |
|
| Hardware and software developers’ efforts are showing a payoff by advancing graphics accelerators beyond the realm of promising oddity and making them a real option in the next generation of cutting-edge supercomputers. |
| “We’ve looked at this before and have some experience, but it wasn’t the right time then,” said Turner, who led an effort at LANL to explore these technologies several years ago. “We’ve used that experience, and now’s the right time to be doing this, and it’s the kind of thing a leadership computing facility should do. It’s not going to be easy, but we should be doing it.” |
Kothe agreed. “This is the way machines are headed. Period,” he said. “I don’t see any other way to get beyond where we are.” |
Contributors Leo Williams, ORNL; Marc Adams and James Wang, NVIDIA |
Further Reading
J. E. Stone et al. 2007. Accelerating molecular modeling applications with graphics processors. J. Comput. Chem 28: 2618-2640.
http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2007 |