DOESciDAC ReviewOffice of Science
BIOPILOT
DATA-DRIVEN Computing for BIOLOGICAL Systems
Figure 1. The B-band O-antigen chain structure of Pseudomonas aeruginosa, illustrating the potential cation complexation sites as revealed by a molecular dynamics simulation.
Bioenergy production, carbon sequestration, and bioremediation of contaminated sites are the urgent issues in today's society. Major incentives—economic, geopolitical, and environmental—drive a scientific mandate for the Department of Energy (DOE) to research and develop cost-effective and beneficial solutions for energy and environmental security. The BioPilot project strives to provide the computational tools necessary for large-scale predictive modeling and simulation of complex biological systems.
What are the design principles for a biological system that attains low-cost ethanol production from plant biomass? How do microbes convert toxic waste to nontoxic substances? What are the mechanisms underlying bacterial degradation of lignin, the second most abundant carbon polymer on Earth?
Answers to such questions are of groundbreaking significance, but are hard to obtain. Although the starting point is often experimental data, it is imperative to truly synthesize the three pillars of scientific discovery—experimentation, advanced theory, and large-scale computation—to address these challenges effectively and comprehensively.
Biological systems are inherently complex (figure 1). This complexity arises from the selective and nonlinear interconnection of functionally diverse components to produce coherent behavior. For example, in eukaryotes, such as humans, an intricate interplay of 50 or more proteins may regulate the activity of a single gene performing a relatively complex function. Computational modeling and simulation that reproduce and predict such behavior form the Holy Grail of next-generation biological science.
Arguably, the paradigm shift from descriptive to predictive biological science was begun in earnest by the 1962 Nobel Prize description of the double-helix and the DOE human genome project. The beauty of these discoveries is that mathematics, computing, and technology enrich each other and potentially lead to engineering solutions to design and control biological systems. That promise has inspired researchers to try to translate many complex biological patterns into computer models defined by sets of simple rules.
Unlike the four Maxwell's equations describing all the electromagnetic phenomena, the fundamental rules (simplicity) that quantify the low-dimensional behavior of biological systems are yet to be discovered. Many Grand Challenge problems in life sciences would greatly benefit if a systematic framework for discovery of such rules existed. The promising approach aims to interrelate emerging disparate and noisy "-omics" data by relying on mathematics, computer science, information technology, and computing. Conventional software and hardware have, unfortunately, been unable to efficiently and effectively deal with such massive datasets.
The goal of DOE's BioPilot project, "Data-Intensive Computing for Complex Biological Systems," is to provide an integrated suite of flexible high-performance data analysis tools to enable large-scale predictive modeling and simulation of complex biological systems. The BioPilot project is a collaborative, multidisciplinary research effort between Pacific Northwest National Laboratory (PNNL) and Oak Ridge National Laboratory (ORNL). Dr. Nagiza Samatova of ORNL and Dr. T. P. Straatsma of PNNL serve as principal investigators of the project. The BioPilot's open source software is being used to address specific challenges facing the DOE and our society.

From Data-Driven to Simulations-Driven Biological Science
Ultrascale simulations in physical sciences such as fusion, combustion, and accelerator science have revolutionized the way the science is conducted. Over the past decade, a phase transition in simulation science has occurred—a shift from validation science, when simulation results agree with experiments, to predictive science, when simulation results drive novel experiments.
Newly-developed simulations of complex biological phenomena are quite promising. Yet the majority of biological problems are still being addressed qualitatively. There remains a huge gap between qualitative, experiment-driven biological science and quantitative, simulation-driven, predictive science. Closing this gap is critical for addressing the important challenges in energy and environmental security.
Our understanding of biological systems can be measured by our ability to symbolically reconstruct their inner workings and to predict their dynamic behaviors in response to changes in environmental conditions. Predictive simulations of biological systems' dynamics require data-driven model-building, unlike simulations from "first principles," where underlying models are described by a system of equations.
Figure 2. Iterative data-driven predictive modeling and simulation of complex biological systems.
Data-driven model construction is an iterative process (figure 2). It frequently starts with a comprehensive enumeration of "components" derived from experimental data (data analysis). Putative interactions between these "components" then lead to model abstractions, such as biological networks (model abstraction). Parameterizations for specific environmental conditions enable simulations of biological systems' dynamics. Analysis of simulation results then leads to predictions of biological systems' functions (modeling and simulation). These predictions generate specific hypotheses that can then be experimentally tested. Comparisons between predictions and experiments can then be used to refine in silico models to reflect improved understanding of biological systems. The organizational structure of the BioPilot project parallels this three-step process of data-driven model building.

Enabling End-to-End Scientific Discovery Cycle
Iterative data-driven construction of predictive models for biological systems faces challenges from both data intensity and computational complexity, yet neither high-end computing hardware nor software is optimized to address them. Historically, each has been well configured for running simulations. Fundamental differences exist, however, between running simulations and building data-driven models of biological systems.
Figure 3. Distinct data access in running simulations and in building models. Data-driven model building requires a different mix of memory, disk storage, and communication trade-offs.
Data-driven model construction is often considered as a combinatorial optimization problem, where a search for a particular object or enumeration of all the objects with given properties is being sought. The data-intensive nature of this problem, however, makes existing methods fail to meet the required scale of data size, heterogeneity, and dimensionality. Solving combinatorial optimization problems for data-intensive applications will require a fundamentally novel approach. Figure 3 highlights some of those differences in terms of input size and structure, memory access, disk access, output size, communication requirements, and the types of arithmetic operations. Such differences necessitate novel architectural designs and algorithms with an intelligent mix of memory, disk storage, and communication trade-offs.
Making these methods computationally tractable requires:
  • the ability to translate a domain-specific problem to a mathematical problem for which existing techniques are relatively mature;

  • theoretical advances in computational science, mathematics, and statistics to deal with combinatorial problems over such high-dimensional, heterogeneous, noisy, and massive data;

  • high-performance scalable implementations on next-generation high-end computing architectures utilizing cutting-edge computer science technologies.

Figure 4. A software stack for building data-driven in silico models.
To meet these requirements, the BioPilot team exploits the strategy of developing cross-cutting computational technologies that are applicable to many data-intensive biological problems. Figure 4 (p14) depicts the software stack diagram that is being developed by the BioPilot team for building in silico models of biological systems, such as 3D protein structures or networks of protein-protein interactions. By employing and advancing mathematical and computer science technologies, BioPilot aims to provide robust production implementations of our computational methods, to facilitate the use of these methods for solving large-scale biological applications of DOE relevance, and to make them available as open source software to the biological science community. Thus, the BioPilot libraries and tools will ultimately provide the link between the DOE Office of Advanced Scientific Computing Research (ASCR) and Bioeneregy Genomics:GTL programs.

From High-Throughput Data to Components of Biological Systems
High-throughput experimental technologies, such as DNA sequencing and mass spectrometry (MS) have created a unique opportunity to screen thousands of genes and proteins in a matter of hours. Yet there remains the growing gap between the high-throughput biological data and the analytical tools capable of deriving from the data all the working "components"—the key step in building models of biological systems. As a result, the number of hypothetical (unknown function) proteins is on the Moore's Law trajectory of growing at an exponential rate, most of the metabolic pathways are incomplete, and pieces of regulatory pathway models are obtained only for a handful of organisms. These are just a few examples where having this tsunami of data is becoming a curse rather than a blessing, and today's computational limitations in mining these data are tomorrow's nightmares when attempting full systems-level understanding. Closing this gap is critical for the success of DOE missions in energy production and environmental bioremediation.
Figure 5. Technology trends are a rate-limiting factor. From left to right: a graph depicting the growth of DNA sequences; the performance of memory and disk access latency; and the trend of compute power and disk capacity.
The available genome sequence information is growing at an exponential rate, with a doubling time of approximately 18 months. However, time for doubling of bandwidth to memory and to disk is 2.7 years; during this time, memory latency only improves by 20%, and disk latency only improves by 30% (figure 5). At these scales, performing multiple sequence analysis, which is routinely used by biologists, is limited by integer-operation performance and by memory bandwidth, and is increasingly out of biologists' reach. The core calculations on these databases are at least quadratic in time and in number of operations with the number gene sequences. Search operations access the entire sequence database residing in the memory and produce the output comparable with the input database size that often needs to be globally sorted before being written to disks. As new genomes are being sequenced, the analysis results often have to be recalculated.
Figure 6. Processor scaling demands for parallel BLAST. Callouts indicate anticipated database size over time.
Figure 6 projects the scaling demands in terms of the number of processors for completing within 24 hours an "all genomes versus all genomes" comparison with a frequently used Basic Local Alignment Search Tool (BLAST) from National Center for Biotechnology Information (NCBI). Starting from approximately two million protein microbial sequences in 2007 and assuming the expected 18-month sequence doubling time, compute power and memory bandwidth capacity increases, BLAST must scale to thousands of processors within the next few years. The BioPilot's ScalaBLAST tool developed by Dr. Christopher Oehmen and his colleagues from PNNL is on its path towards achieving this scalability (sidebar "New Tool Speeds Up Genomic Sequence Processing").
Figure 7. An individual MS peak gives the mass and amount of a fragment. Peaks identical in nature will differ by masses of amino acids. Chains of peaks spell out the underlying sequence. In this example, the two peaks indicated have the mass difference of the alanin (A) residue.
The data size for mass spectrometry protein databases currently exceeds the terabyte level and is expected to grow to the petabyte scale, as the DOE focus will shift from individual microbes to microbial communities (figure 8). About 70 terabytes of MS-proteomics data is currently stored in the DOE Environmental Molecular Sciences Laboratory (EMSL) facility at PNNL. Moreover, the upper orange line in figure 8 demonstrates that the use of spectral libraries (sidebar "Computational Mass Spectrometry Proteomics in a Nutshell") adds three orders of magnitude to the amount of data to be considered. The inclusion of even one type of amino acid modification increases the data size by another order of magnitude. Typically, a standard analysis should be able to search for at least three types of amino acid modifications, which increases the data size combinatorially. However, with current computational resources this is not feasible.
Figure 8. Protein database size for mass spectrometry database searches.
In addition to computational challenges, a number of mathematical and statistical challenges have to be addressed to make the existing analytical tools practical. Due to a poor characterization of the expected signal in the data, typically only 7-25% of mass spectrometry spectra are identified with a peptide and the remaining 75% or more of the spectra are thrown away. The BioPilot teams led by Dr. William Cannon at PNNL and Dr. Andrey Gorin at ORNL are advancing mathematical technologies to push the boundaries of peptide identification and confidence estimation through their MSPolygraph database search technology and Probability Profile Method (PPM) de novo sequencing technology, respectively.
Figure 9. Search performance of de novo sequencing for different database sizes. (a) Processing time (in seconds) for protein databases of different sizes with simple string search (orange), mass index search (green) and sequence index search (blue). (b) Performance ratios of simple-vs-mass and simple-vs-sequence searches with ~1,400 times advantage of sequence index search over simple string search.
To accelerate the search time for de novo sequencing algorithms, the use of precomputed memory-residing indices used for constant time look-ups has been proposed by Dr. Gorin's team for the first time in the field. Figure 9 demonstrates the time reduction by a factor of 1,400 per processor on real-life applications using a large set (105) of sequence tags on the ORNL SGI Altix 3700 computer.
Both database search and de novo methods are now beginning to tackle the large-scale problem of analyzing mass spectrometry datasets from communities of microbes on a level not previously feasible. These capabilities are expected to have significant impacts on experimental studies of microbes involved in carbon fixation, hydrogen production, and biofuels.

Advanced Mathematics for Interpretation of Proteomics Data
Our team has developed new mathematical approaches for both database search and de novo algorithms (sidebar "Computational Mass Spectrometry Proteomics in a Nutshell," p16). While intuitively simple—directly reading a protein sequence from the spectrum—de novo reconstruction is tremendously complex algorithmically. Historically it was done as an artwork, only by human experts and on a very small fraction of all spectra. Currently there is no theory for confidence estimates applicable for de novo approaches, while limited results are available for validation of the database search outputs.
To address these challenges, the BioPilot team led by Dr. Gorin has developed the probability profile method (PPM) for de novo peptide identification. PPM infers chemical identities of individual spectral peaks by examining their spectral neighborhoods within the context of Bayesian statistics. It also analytically estimates the confidence of peptide identifications. In order to derive analytical confidence function we developed a theory dealing with statistical measures in large sets of strongly correlated events. Our formulae rigorously generalize the approximation of variable-independent Bernoulli trials to arbitrary correlated events. These developments set a foundation for comprehensive analytical measures of the informational content in the mass-spectrometry data flows.
De novo search with PPM demonstrated significant gains in the identification rates, especially pronounced in highly confident identifications, where the number of identifications was doubled. The capability to detect unexpected phenomena led to significant corrections in the datasets regarded as benchmarks in the field.
Addressing the problem of identification reliability for database search methods, Dr. Cannon and co-workers have developed MSPolygraph, a spectral analysis tool, and STRIP, a kernel-based analysis of the peptide candidates, able to handle massive numbers of peptide candidates as well as large numbers of spectra. The mathematical algorithms used in the process allow for the incorporation of physical models of the fragmentation derived from molecular simulations and empirical approximations. The use of statistical analyses of the peptide candidates as well as kernel-based functions for comparison of theoretical and experimental spectra makes the analysis both selective and sensitive.
We are currently analyzing 13,000 spectra from seawater collected at the Bermuda Atlantic Time Series Station for a DOE-funded project using the database search methodology. This work will open the door for comprehensive analyses employing a full list of peptides from community proteomics from sea water, and is related to the DOE missions of climate change, carbon sequestration, and global warming.

From Components to In Silico Models
The complexity of biological systems comes from the interconnections of their constituent components. While analytical tools that derive the components from high-throughput experimental data significantly reduce the amount of data to be dealt with, the challenge still remains of how to "connect the dots," that is, to construct predictive in silico models of these biological systems. The combinatorial space of feasible solutions is enormous, and advanced methods for constraining such a space and for efficient search of optimal solutions are in great demand.
Specifically, prediction of high-resolution computational models of interacting molecules is hampered by a number of mathematical and computational challenges. First, the interaction between two molecules, such as protein-protein interactions, can be described through their set of contacting amino acid residues. A possible set of features to describe this interface is almost infinite. How to select the features that distinguish between the true interfaces and merely feasible associations of two rigid bodies? Second, the hierarchical nature of most biological systems leads to short- and long-range interactions between the features. How to select independent features and/or deal with the consequences of feature dependences? Third, the space of all possible features to consider is of extremely high dimensionality. Robust estimation of required statistics is quite challenging since the number of observed instances is often limited. How will potential errors in the estimation of Bayesian factors affect the prediction precision? Molecular docking codes are also very computationally demanding. Successful docking codes based on all atom models of the interacting proteins will require tens of thousands of central processing unit (CPU) hours per single protein complex. To address these challenges, the BioPilot team led by Dr. Gorin and Dr. Ed Uberbacher (ORNL) is developing the Comprehensive Orientation Sampling Method (COSM), a fast and high-resolution prediction tool for protein docking (sidebar "COSM: Fast and High-Resolution Protein Docking," p18).
Analysis and modeling of biological networks also faces combinatorial intractability challenges. Exact algorithms for combinatorial problems on biological networks frequently use recursive strategies for exploiting the search space to find the optimum solution. Since input instances are typically huge (thousands or millions of nodes), they should not be copied indiscriminately in a recursive search process. The demands for storage are often enormous and intelligent memory management is a critical feature. In addition, enumeration problems (such as maximal clique enumeration) can generate output that is exponential with the size of the input network, and may even reach petabyte scale on even modest-sized networks.
Parallelization of recursive enumeration algorithms requires special attention in deploying a load-balancing scheme. Although some strategies (such as breadth-first traversal) may seem embarrassingly parallelizable, they inherently suffer from extremely unbalanced loads. Due to the nature of the recursive search problems, a search tree may grow highly irregular and practically impossible to predict a priori. This essentially prohibits adapting static allocation strategies with which many processors may finish exploring their search trees quickly, while very few, "unlucky" ones, would still be struggling to expand their search trees. Load rebalancing via synchronization and data movement sounds promising, but for data-intensive applications it is often prohibitively expensive. Therefore, a highly tailored strategy that minimizes the end-to-end execution time by balancing these discrepancies is particularly desirable. The BioPilot team tackles these challenges and develops scalable graph libraries for modeling and analysis of biological networks.

Scalable Graph Algorithms for Analysis of Biological Networks
The design of the microbial system that, for example, efficiently produces ethanol or degrades toxins will require an understanding of how its interacting biochemical pathways result in specific traits (ethanol resistance, high ethanol yield, or toxin uptake). Such problems cannot be solved by experiments alone. Comparative analysis of biological pathways and networks across multiple genomes and across "-omics" information spaces has a unique potential to direct bioengineers to the right solutions. This opportunity, however, presents a large-scale computing challenge—to perform such analyses across hundreds of organisms with millions of genes organized into thousands of metabolic pathways controlled by hundreds of regulatory processes, which are uncertainly defined.
Biological pathways and networks can be mathematically represented as graphs where nodes might be genes or metabolites and the edges represent some kind of relationship between them. The types of relationships might be regulation, physical interaction, or catalytic activity for a conversion of a metabolite to a substrate. Then questions about these biological networks could be translated to some problems on graphs. For example, the question of identifying all protein complexes from mass spectrometry pull-down experiments can be reduced to the problem of enumerating maximal cliques in the network of pair-wise protein interactions.
Figure 11. Finding critical genes responsible for a given trait, such as aerobic microbial growth, may require running clique enumeration algorithms on graphs with millions of nodes, such as the example depicted here.
While the idea of leveraging a relatively mature apparatus of graph theory and graph algorithms sounds attractive, in reality, most of these algorithms fail for large-scale biological networks. For example, the question of finding critical genes responsible for a given trait such as aerobic microbial growth may require running clique enumeration algorithms on graphs with millions of nodes (figure 11). The difficulty lies in exponential time complexity of these algorithms with the graph size and the lack of scalable parallel implementations of these algorithms on high-performance computing systems.
Figure 12. Design and implementation of a parallel depth-first backtracking search strategy has reduced memory requirements by four orders of magnitude.
The BioPilot team led by Dr. Samatova is advancing the theory of combinatorial search space reduction and developing a library of scalable graph algorithms with demonstration of their capabilities to address large-scale problems of DOE relevance. The key to scaling of the graph codes lies in exploiting multilevel parallelism, in probabilistic-based load balancing strategy, and in confining the search for solutions to (relatively) small sub-graphs that are generated in polynomial time. Our design and implementation of parallel depth-first backtracking search strategy has reduced memory requirements by four orders of magnitude (figure 12). With the current library, we are able not only to apply them to very large networks (millions or even billions of nodes) for which the existing codes fail to run, but also to reduce the overall execution time from days to hours.

From System Models to Predictive Modeling and Simulation
Computer modeling and simulation have become an integral component of scientific research in many scientific disciplines, and are a major source of massive scientific datasets. While optimized for running simulations, hardware and software are not adequately configured for analysis and visualization of such data.
Fundamental data context differences exist between the task of running the simulation and the task of processing simulation output. With the former, space-time simulation proceeds from one time step to the next and requires the context of only two time steps at a time. Analysis of the simulation output—such as harmonic analysis, dimension reduction, and clustering—often requires the full space-time context of the available data. In fact, simulations that are driven by local space-time relationships are largely performed with the purpose of discovering or explaining non-local and large-scale space-time relationships through analysis and visualization. Most full-context data analysis software, with a few notable exceptions, requires the entire dataset to be in memory. Scaling up to full context faces severe constraints of computer memory and sometimes also multiplicative effects of feature search combinatorics.
Addressing those challenges for large-scale analysis of molecular dynamics trajectories generated by biomolecular simulations is the focus of the research effort led by Dr. T. P. Straatsma at PNNL. The continuing increase in available protein structures allows for the comparative analysis of molecular dynamics simulations of a wide range of proteins, protein-protein complexes and other complex biological systems. Analysis of protein conformational dynamics is important for understanding a protein function, and has a number of applications such as bioengineering of proteins that increase bioethanol yield. This requires a rigorous framework that provides capabilities for: efficient analysis of large-scale trajectories on massively parallel architectures; automated detection of interesting events in these very large datasets; and comparative analysis of multiple trajectories. Dr. Straatsma's team is developing a software package, DIANA, that aims to bring these capabilities to the large biological community (sidebar "DIANA: Scalable Analysis of Large-Scale Molecular Simulations," p19).
For simulation of biological networks dynamics, kinetic models described by nonlinear ordinary or partial differential equations are typically used. Although there are many commonalities, two particular aspects distinguish biological networks from networks found in other science areas. First, biological systems tend to have well-organized structural hierarchies that require the use of multiscale mathematical models. Second, in biological systems, copy numbers of many species are very low, which can give rise to significant relative fluctuations making the problem inherently stochastic. Therefore, a deterministic approach alone may not be sufficient for cellular systems and stochastic methods may be more suitable.
The use of stochastic algorithms in the simulation of spatially resolved biological models is particularly demanding. This stems from the fact that stochastic simulation algorithms do not scale well with the network size. Therefore, scaling properties of the stochastic simulation algorithms need significant improvements before they can be widely used for larger problems. "Fully stochastic 3D simulations, where the location and dynamical properties of individual molecules are tracked, are both data- and compute-intensive and pose a significant computational challenge," says Dr. Haluk Resat from PNNL. His team has been developing the NWLANG, a simulation framework for spatially resolved multiscale modeling of biological systems where the resolution can be kept at the individual reaction level (sidebar "Multiscale Kinetic Simulation of Biological Networks," p21).

From Predictive Models to Discoveries
The BioPilot develops conceptually new approaches to dealing with biological data, and the inherent complexity of biological data cannot be fully appreciated without working with the real-life bioscience projects aiming to more fully understand the carbon cycle, or biology of biofuel production. The BioPilot team has this crucial experience through our collaborations with DOE Bioenergy Genomomics:GTL researchers from several large efforts related to DOE missions in energy and bioremediation. The ultimate confirmation of the strategy is new insights, hypotheses, and discoveries done by the teams using our algorithms.
Figure 14. The cellular machinery involved in stress-induced transcriptional reprogramming of yeast cells was discovered by a graph-theoretical computational framework developed at ORNL.


Enhancing Yeast Tolerance to Stresses During Bioethanol Production
In industrial production of bioethanol by yeast, high concentrations of toxic constituents produced during thermochemical pretreatment of plant material expose the organism to diverse environmental stresses. At present cellular mechanisms underlying the stress resistance of the organism are poorly understood. To reveal these mechanisms and to find specific genes that might enhance yeast tolerance to ethanol, we applied graph-theoretical computational tools to genome-scale studies of yeast proteome and transcriptome.
From thousands of protein-protein interactions, we discovered that the cellular stress response was characterized by intensive transcriptional reprogramming of the yeast genome for activation of the stress-related genes (figure 14). From the gene expression profiles over 173 experimental conditions representing the stress response of the yeast cells, we computationally inferred the gene network induced by stresses. Coupling with 526 yeast deletion mutants on sensitivity to ethanol, we identified two specific enzymes that are the most essential for the yeast growth under high ethanol concentration. We predicted that an independent expression of the genes in a genetically modified yeast strain can enhance yeast tolerance to ethanol. Additional studies are necessary to validate the computationally derived hypotheses.
Figure 15. A diagram depicting Earth's carbon cycle.


Managing the Global Carbon Cycle via Biodegradation
Biodegradation of aromatic compounds is a vital link in the carbon cycle of our ecological system (figure 15). Recycling of lignin, perhaps the second most abundant carbon polymer on Earth, requires degradation of its phenolic monomers. Large quantities of industrially generated aromatic compounds in the environment need to be cleaned up by degradation. A photosynthetic bacterium, Rhodopseudomonas palustris, is capable of degrading a variety of complex aromatic compounds, including lignin monomers. It is hypothesized that these aromatic compounds are degraded through the benzoyl-CoA pathway. But it is unclear how the complex aromatic compounds enter this pathway and how the global metabolism is impacted by utilizing the aromatic compounds as the carbon source.
To address these questions, we collaborated with the DOE Genomics:GTL teams led by Dr. Carol Harwood at University of Washington, Dr. Michelle Buchanan, and Dr. Robert Hettich, both at ORNL. By applying the ProRata tool to large-scale proteomics data and integrating transcriptomics data, we predicted the metabolic pathway that converts p-coumarate to benzoyl-CoA in R. palustris. For high-throughput proteomics measurements with varying signal-to-noise ratios, ProRata not only robustly estimates abundance ratios for thousands of proteins but also provides an "error bar" that reflects its estimation precision and statistical uncertainty. ProRata was developed in collaboration with the SciDAC Scientific Data Management Center ("From Data to Discovery," SciDAC Review, Fall 2006, p28).
Molecular Simulations Reveal Microbial Uranyl Uptake Dependence on pH
Heavy metal environmental contaminants cannot be destroyed but require containment, preferably in a solid or immobile form for recycling or final disposal. Microorganisms are able to take up and deposit high levels of contaminant metals, including radioactive metals such as uranium and plutonium, into their cell walls. Consequently, these microbial systems are of great interest for bioremediation technologies.
The outer membranes of Gram-negative microbes are highly non-symmetric and exhibit a significant electrostatic potential gradient across the membrane. This gradient has a significant effect on the uptake and transport of charged and dipolar compounds. To aid the design of microbial remediation technologies, knowledge of what factors determine the affinity of a particular bacterial outer membrane for the most common ionic species found in contaminated soils and groundwater is of great importance.
Figure 17. On the left, deprotonated groups slightly basic in pH are cross-linked, making it difficult for uranyl to penetrate the membrane. On the right, protonated groups with slightly acidic pH create channels, exposing phosphate groups for uranyl binding.
Membrane ultrastructure and its ability to uptake a variety of ions (including uranyl) from the environment under different physico-chemical conditions, such as pH, are being rationalized by comparative analysis of large-scale simulations. We predicted that with slightly basic pH, deprotonated groups are cross-linked, making it difficult for uranyl to penetrate, while with slightly acidic pH, protonated groups create channels, exposing phosphate groups for uranyl binding (figure 17).

Educating a Crop of Young Researchers The BioPilot team mentored a number of students, including those from high schools and from under-represented groups joining us through DOE-sponsored programs. Three high school students placed first nationally at the 2006-2007 Siemens Math, Science, and Technology competition for their thesis "Linking Supercomputing and Systems Biology for Efficient Bioethanol Production" (sidebar "Students' Perspective: Realizing the Power of Supercomputing"; "Mentoring Scientists of the Future," SciDAC Review, Spring 2007, p39). Additionally, a high school senior became a semi-finalist at the 2006-2007 Intel Science Talent Search competition.
Contributors: Dr. Nagiza F. Samatova, Dr. Andrey Gorin, Dr. Ed Uberbacher, Dr. Tatiana Karpinets, Dr. Byung-Hoon Park, and Dr. Chongle Pan, ORNL; Dr. T. P. Straatsma, Dr. William Cannon, Dr. Haluk Resat, Dr. Roberto D. Lins, and Dr. Christopher Oehmen, PNNL