DOE SciDAC Review Office of Science
BIOLOGY and Medical Research at the EXASCALE
Advances in computational hardware and algorithms that have transformed areas of physics and engineering have recently brought similar benefits to biology and biomedical research. Biological sciences are undergoing a revolution. High-performance computing has accelerated the transition from hypothesis-driven to design-driven research at all scales, and computational simulation of biological systems is now driving the direction of biological experimentation and the generation of insights.
As recently as 10 years ago, success in predicting how proteins assume their intricate three-dimensional forms was considered highly unlikely if there was no related protein of known structure. For those proteins whose sequence resembles a protein of known structure, the three-dimensional structure of the known protein can be used as a template to deduce the unknown protein structure. At the time, about 60% of protein sequences arising from the genome sequencing projects had no homologs of known structure.
In 2001 Rosetta, a computational technique developed by Dr. David Baker and colleagues at the Howard Hughes Medical Institute (HHMI), successfully predicted the three-dimensional structure of a folded protein from its linear sequence of amino acids. Baker now develops tools to enable researchers to test new protein scaffolds, examine additional structural hypotheses regarding determinants of binding, and ultimately design proteins that tightly bind endogenous cellular proteins.
Several global challenge problems were identified in the areas of energy, the environment, and basic
science — all with significant opportunities to exploit computing at the exascale and all with a common thread: biology.
In 2003 a 13-year project to sequence the human genome was declared a success, making available to scientists worldwide billions of "letters" of DNA to conduct post-genomic research, including annotation of the human genome. Today, with technology driving the sequencing and annotation of thousands of single-celled organisms, many of the diverse microbial organisms will be completely sequenced by the time exascale computing arrives.
In results also published in 2003, a research team led by the University of Chicago described the first molecular dynamics (MD) computation that accurately predicted a folded protein — a small, 36-residue alpha helical protein called the villin headpiece. This protein, which folds on the microsecond timescale, remains the subject of computational studies today.
In 2005, when Los Alamos National Laboratory (LANL) researchers Dr. Kevin Sanbonmatsu and Dr. Chang-Shung Tung conducted the first million-atom simulation in biology, they noted that the first million-particle simulations in materials science and cosmology had been performed over a decade previously (and today computational physicists in both fields perform multibillion-particle simulations). The all-atom biomolecular simulation of the ribosome is considered a high-performance computing milestone in the field of computational molecular biology — it is the most demanding in terms of computation, communication speed, and memory.
Figure 1. A coarse-graining scheme. (a) Complete STMV particle in the CG representation (simulation simSTMVfull). The capsid (gray) consists of 60 identical proteins; 5 proteins around the 5-fold symmetry axis are shown in various colors. Part of the capsid is removed to demonstrate the RNA core of the virus (red). The inset shows a unit protein of STMV in both the all-atom and the CG (green) representations. (b) The CG model of the reovirus core (partially open to show the inside), the largest system simulated (simRV). The STMV particle (upper left) is shown to scale for comparison. At the bottom, several proteins (drawn to scale) from the reovirus core are presented (all-atom versus CG, the coloring of the CG models corresponds to the colors used in the snapshot of the full particle).
In 2006, Dr. Klaus Schulten and colleagues from University of Illinois–Urbana-Champaign (UIUC) and the University of California–Irvine, published findings from an all-atom MD simulation of a complete virus, the satellite tobacco mosaic virus (STMV). Although the full processes of assembly and disassembly were too slow to be fully simulated, brief MD simulations on the STMV structure under various conditions provided a time-resolved picture of the molecular processes involved in the virus life cycle (figure 1). Understanding the mechanics of these processes is the key to developing more effective treatments for viral diseases.
Exascale Computing Challenges
Fifteen years ago, petascale computation was not possible. The introduction of the Cray XT and IBM Blue Gene leadership-class computers enabled researchers to begin addressing more complex problems (sidebars "Leadership Facilities Advancing Biology: INCITE 2008" p34, and "Biology Research and INCITE 2009" p36). Today, simulation of biological processes is already pushing beyond the petascale class of computing systems coming online. Such simulations are now capable of delivering sustained performance approaching 1015 floating-point operations per second (petaflop/s) on large, memory-intensive applications.
Understanding the mechanics of the molecular processes involved in the virus life cycle is the key to developing more effective treatments for viral diseases.
A series of three town hall meetings was held in 2007 to engage the computational science community about the potential benefits of advanced computing. Several global challenge problems were identified in the areas of energy, the environment, and basic science — all with significant opportunities to exploit computing at the exascale (1018) and all with a common thread: biology.
In energy, scientists look to exascale to be able to attack problems in combustion, the solution of which could improve the efficient use of liquid fuels, whether from fossil sources or from renewable sources. First-principles computational design and optimization of catalysts will also become possible at the exascale, as will de novo design of biologically mediated pathways for energy conversion.
In the environment, scientists anticipate the need for exascale computing in climate modeling; integrated energy, economics, and environmental modeling; and multiscale modeling from molecules to ecosystems. Many biological processes of interest to the U.S. Department of Energy (DOE) Office of Science are mediated by membrane-associated proteins, including the detoxification of organic waste products.
In biology, the challenges of modeling at multiple scales — from atomic, through genomic and cellular, to ecosystems — are driving the need for exascale computing and a new set of algorithms and approaches. For example, a computational approach to understanding cellular machines and their related genes and biochemical pathways, referred to as systems biology, aims to develop validated capabilities for simulating cells as spatially extended mechanical and chemical systems in a way that accurately represents processes such as cell growth, metabolism, locomotion, and sensing. Modeling and simulation provide only a local view of each process, without interaction between modalities and scales. Exascale computing is needed to represent the various macroscopic subsystems and to enable a multiscale approach to biological modeling.
New Tools Needed for New Challenges
Even with exascale systems, however, the infrastructure needed to generate and analyze molecular data will require development of simulation management tools that encompass clustering, archiving, comparison, debugging, visualization, and communication — all of which must also address current computing bottlenecks that limit the scope of analysis.
For example, researchers are limited by the current microsecond timescale for protein folding required by the huge number of intermolecular, interaction computations. Scientists also lack rigorous coarse-grained models that permit the scaling up of macromolecular pathways and supramolecular cellular processes. Similarly, systems biology methods lack the dynamic resolution needed for coupling genomic and other data in order to map cellular networks, to predict their functional states, and to control the time-varying responses of living cells. Nor can current analytic models adequately analyze the dynamics of complex living systems.

In biology, the challenges of modeling at multiple scales are driving the need for exascale computing and a new set of algorithms and approaches.

Researchers have achieved impressive methodological advances that permit the modeling of the largest assemblies in the cell, but only for short periods of time. And unfortunately, these simulations are unlikely to scale to the size of a single cell, even a small bacterium, for relevant times such as minutes or hours — even if researchers can employ computers capable of achieving 1,000 petaflop/s. New, scalable, high-performance computational tools are essential.
Seven Success Stories
In anticipation of exascale computing, and capitalizing on the capabilities of current leadership-class computers, researchers are conducting simulations previously believed infeasible.
Large-Scale Simulations of Cellulases
Dr. Jeremy Smith, a molecular biophysicist at Oak Ridge National Laboratory (ORNL), together with colleagues at the National Renewable Energy Laboratory in Colorado and Cornell University, are using ORNL's Jaguar supercomputer to model bacterial and fungal cellulases in action. Large-scale molecular dynamic simulations generated by the supercomputer allow researchers to "watch" these simulated enzymes attack digital cellulose strands, transfer a strand's sugar molecules to the enzyme's catalytic zone, and chemically digest the sugar to provide the microbe with energy. Understanding how cellulases degrade cellulose is the key to increasing the efficiency and lowering the cost of ethanol production using sugar from cellulose in trees and other biomass. If the team can understand how the cellulase enzyme functions, how it recognizes cellulose strands, and how the chemistry is accomplished inside the enzyme, it may be able to determine what the rate-limiting steps are that might be genetically engineered to make cellulase more efficient at degrading cellulose into glucose.
Building a Cognitive Computing Chip
Scientists from IBM Research and five university partners are leading an effort to understand the complex wiring system of the brain and to build a computer that can simulate and emulate the brain's abilities of sensation, perception, action, interaction, and cognition while rivaling its low power consumption and compact size. Using the Dawn Blue Gene/P supercomputer at Lawrence Livermore National Laboratory (LLNL) with 147,456 processors and 144 terabytes of main memory, the team achieved a simulation with 1 billion spiking neurons and 10 trillion individual learning synapses. This is equivalent to 1,000 cognitive computing chips, each with 1 million neurons and 10 billion synapses, and exceeds the scale of a cat cerebral cortex. The simulation ran 100 to 1,000 times slower than real time. The team has also developed a new algorithm, BlueMatter, which exploits the Blue Gene supercomputing architecture to noninvasively measure and map the connections between all cortical and subcortical locations within the human brain using magnetic resonance diffusion-weighted imaging. Mapping the wiring diagram of the brain is crucial to untangling its vast communication network and understanding how it represents and processes information. Only recently has the technology increased sufficiently to match the density of neurons and synapses in real brains — around 10 billion to 1 square centimeter.
International Union of Physiological Sciences Physiome Project
The Physiome Project is a worldwide public-domain effort to provide a computational framework for understanding human and other eukaryotic physiology. It aims to develop integrative models at all levels of biological organization, from genes to the whole organism, via gene regulatory networks, protein pathways, integrative cell function, and tissue and whole organ structure/function relations. Current projects include the development of ontologies to organize biological knowledge and access to databases; markup languages to encode models of biological structure and function in a standard format for sharing between different application programs and for reuse as components of more comprehensive models; databases of structure at the cell, tissue, and organ levels; software to render computational models of cell function in 2D and 3D graphical form; and software for displaying and interacting with the organ models that will allow the user to move across all spatial scales.
In anticipation of exascale computing, and capitalizing on the capabilities of current leadership-class computers, researchers are conducting simulations previously believed infeasible.
Large-Scale Simulation of the Ribosome
Dr. Kevin Sanbonmatsu and Dr. Chang-Shung Tung of LANL have simulated the rate-limiting step in genetic decoding by the ribosome. The simulations used experimentally determined ribosome structures in different functional states as the initial and final conditions, making the simulations rigorously consistent with the experimental data. The calculations required approximately 1 million CPU-hours on 768 CPUs, or about 10% of the 13.88 teraflop/s LANL Q Machine. Previously, only static snapshot structures of the ribosome were available; limitations in time resolution and spatial resolution prevented experimental imaging of the ribosome in motion in atomic detail. The simulations on the Q Machine allowed the researchers to visualize the motion of transfer RNAs inside the ribosome occurring during decoding. The ribosome simulations helped to elucidate a crucial molecular mechanism for gene expression, which opens the door for simulations of other large molecular machines important for gene expression and drug design.
Figure 3. An alpha-synuclein (Asyn) pentamer (various colors for each participating Asyn molecule) on the cell membrane interacting with beta-amyloid 1–42 (Abeta) shown in orange. This interaction can contribute to neurodegeneration during the combination of Parkinson’s and Alzheimer’s diseases.
Figure 4. An Asyn pentamer on the membrane. The pentamer is constructed with theoretical docking of Asyn conformers that occur at 4 ns of molecular dynamics simulation. These conformers have the best membrane contacting properties (calculated by the program MAPAS). The geometrical dimensions of this pentamer correspond to those experimentally elucidated by electron microscope.
Large-Scale, Folding-Based Molecular Simulation
In 2005 Kobe University researchers Dr. Nobuyasu Koga and Dr. Shoji Takada published the results of their folding-based molecular simulations that revealed the mechanisms of the rotary motor F1-ATPase. Biomolecular machines, such as the ribosome, transporter, and molecular motors, fulfill their function through large-amplitude conformational change. Molecular dynamics simulation is potentially powerful because it can provide full time-dependent structural information about biomolecular machines, but functional cycles of these systems typically take milliseconds or longer, which is far beyond the current reach of molecular simulations with all-atom standard force fields. Another approach is to use a coarse-grained molecular representation, thereby enabling the simulation orders of magnitude longer time scales; however, coarse-graining drops some details from the model. Structural information before and after conformational change has been provided by X-ray crystallography and other methods for many cases, but these methods do not directly observe the molecular dynamics that connects two-end structures. These dynamical aspects can be observed directly by fluorescence and other time-resolved spectroscopy; however, the latter methods monitor local structure but do not give global structural information. The solution formulated by Drs. Koga and Takada was a computational framework, called a "switching Gō model," for simulating large-amplitude motion of biomolecular machines. Gō models are suitable for representing large-amplitude conformational dynamics because they account for both small fluctuations around the native basin and large fluctuations that involve local unfolding. By combining all available experimental data with simulation results, the team identified the rotary motion of F1-ATPase, which had long been under debate in the field. This work opens an avenue of simulating large-scale motion involved in dynamical function of large biomolecular complexes by folding-based models.
Large-Scale Simulation of Ion Channels
Voltage-gated ion channels, or Kv channels, are involved in the generation and spread of electrical signals in neurons, muscle, and other excitable cells. In order to open the gate of a channel, the electric field across the cellular membrane acts on specific charged amino acids that are strategically placed in the protein in a region called the voltage sensor. In humans, malfunction of these proteins, sometimes owing to the misbehavior of only a few atoms, can result in neurological diseases. A wealth of experimental data exists from a wide range of approaches, but its interpretation is complex. One must ultimately be able to visualize atom-by-atom how these tiny mechanical devices move and change their shape as a function of time while they perform. Dr. Benoit Roux and a team ofresearchers from Argonne National Laboratory (ANL) and the University of Chicago are using a tight integration of experiment, modeling, and simulation to gain insights into Kv channels (sidebar "Understanding the Structure and Function of Ion Channels" p39). Their studies serve as a roadmap for simulating, visualizing, and elucidating the inner workings of these nanoscale molecular machines. Because these channels are functional electromechanical devices, they could be used in the design of artificial switches in various nanotechnologies (figure 5, p38). The practical applications of this work are significant. For example, the research in ion channel mechanisms may help identify strategies for treating cardiovascular disorders such as long-QT syndrome, which causes irregular heart rhythms and is associated with more than 3,000 sudden deaths each year in children and young adults in the United States. Moreover, the studies may help researchers find a way to switch or block the action of toxins — such as those emitted by scorpions and bees — that plug the ion channel pores in humans.
A wealth of experimental data exists from a wide range of approaches, but its interpretation is complex.
Figure 5. Complete model of the Kv1.2 channel assembled using the Rosetta method. The atomic model comprises 1,560 amino acids, 645 lipid molecules, 80,850 water molecules and ~300 K+ and Cl- ion pairs. In total, there are more than 350,000 atoms in the system. The simulations were generated by using NAMD on the Cray XT (Jaguar) at ORNL and the Blue Gene/P at ANL.
Protein Folding
In 2001, HHMI investigator Dr. David Baker and his colleagues at the University of Washington successfully predicted the 3D structure of a folded protein from its linear sequence of amino acids (figure 6). Key to the success was Rosetta, a computer algorithm for predicting protein folding. Experimental studies of protein folding by Dr. Baker's laboratory and many others had shown that each local segment of the chain flickers between a different subset of local conformations. Folding to the native structure occurs when the conformations adopted by the local segments and their relative orientations allow burial of the hydrophobic residues, pairing of the beta strands, and other low-energy features of native protein structures. In the Rosetta algorithm, the distribution of conformations observed for each short sequence segment in known protein structures is taken as an approximation of the set of local conformations that sequence segment would sample during folding. The program then searches for the combination of these local conformations that has the lowest overall energy. Dr. Baker has also set up a project, called Rosetta@home (see Further Reading), to run the Rosetta program on unused computer resources. The intent is to accurately predict and design protein structures and protein complexes that may ultimately lead to finding cures for some major human diseases.
The research in ion channel mechanisms may help identify strategies for treating cardiovascular disorders, and may help researchers find a way to switch or block the action of toxins — such as those emitted by scorpions and bees.
Figure 6. Researchers determined this large protein, ALG13, which is 200 amino acids in length, with a new methodology called NMR structure determination without side chain assignments.
Exciting Applications Awaiting Exascale
Researchers have identified numerous exciting areas in biology, including the following, for which exascale computing is required.
  • "Building the system" problems: rapid and high-fidelity assessment of metabolic and regulatory potential of thousands of cultured and sequenced prokaryotes of DOE mission importance
  • "Simulating the behavior" problems: predicting and simulating microbial behavior and response to changing environmental or process-related conditions — from simple to complex communities and ecosystems — spanning a range of spatial and temporal scales
  • Reverse engineering of the brain: bottom-up models incorporating all available physiological detail in order to capture the biological function of the brain, predicting consequences of activation and of pharmacological or electrophysiological intervention
  • Image-based phenotyping: segmentation of images that scale well to large amounts of data (for example, a human) that could lead to personalized medicine
  • Phylogenetics: phylogeny estimation, models of evolution, comparative biological methods, and population genetics, with particular focus on understanding horizontal gene transfer and the evolution of populations
  • Genome analysis and sequence analysis: genome assembly, genome and chromosome annotation, gene finding, alternative splicing, comparative genomics, multiple sequence alignment, sequence search and clustering, function prediction, motif discovery, and functional site recognition in protein, RNA, and DNA sequences
  • Structural bioinformatics: structure matching, prediction, analysis, and comparison; methods and tools for docking; protein design and drug design
  • Systems biology: systems approaches to molecular biology, multiscale modeling, pathways, gene networks, large-scale development of models for many organisms and comparative modeling
  • Microbial ecology: engineering of stable microbial communities for practical applications, and understanding the carbon cycle through multiscale modeling of complex ecosystems.
Contributors Laura Wolf and Dr. Gail W. Pieper, Argonne National Laboratory
Further Reading