DOESciDAC ReviewOffice of Science
ECOLOGY AND METAGENOMICS
Gaining New Insights into MICROBIAL Communities
We are not alone on this planet. Of course, we share the Earth with plants and animals that we see all around us, but what we usually do not appreciate are the unseen organisms that surround us, encompass us, and affect us every day—microbes. With leadership-class computing facilities and new technologies for sequencing and analyzing genomes, microbiologists are gaining new insights into what these tiny organisms look like, how they function, and how we might utilize them to improve our environment and our lives.
 
An estimated 1030 microbial cells exist on the Earth. These are the smallest members of our environment and are visible only under the microscope (hence their name). "Microbes" is a catch-all term that covers each of the three domains of life—bacteria, archaea, and small eucaryotes—and viruses (not really considered alive by most). Microbes are single-celled entities that live by basic rules: eat, do not be eaten, and divide (sex is optional). Depending on where they are and what they are using for food, their life cycle can be as short as a few hours or days or as long as hundreds or thousands of years.
Microbes inhabit every conceivable environment on Earth and are responsible for a vast array of processes touching every aspect of our lives. For example, the oceans are teeming with them; every mouthful of seawater has about 10 million microbes swimming around. There are vastly more microbes than sharks in the ocean, and the microbes may also be more dangerous: most of the diseases that ail us are caused by our microbial foes. Although microbes are well known as causes of food poisoning and tuberculosis, they are increasingly becoming implicated in other acute diseases that take many years to develop, such as atherosclerosis (heart disease) and cancer.
Microbes also are responsible for many beneficial processes. Most people are aware of the important roles of microbes in food production (such as beer, wine, and cheese). Microbes additionally are the source of almost all antibiotics that are used to treat infections. Microbes that live in our bodies are beneficial, too. They are critical for our survival through the production of essential amino acids that we cannot make ourselves. Beneficial microbes also coat our surfaces (skin, intestines, and so on) and prevent or limit "bad bugs" from binding to those surfaces. This is the natural action of many of the probiotics promoted by the health food industry. 
Microbes inhabit every conceivable environment on Earth and are responsible for a vast array of processes touching every aspect of our lives.
Microbes are tenacious, able to survive in seemingly inhospitable environments. Recently, bacteria have been found living inside rocks deep in the Earth; these remarkable creatures use uranium radiation instead of sunlight or heat as an energy source. Microbes have adapted to extremes of acidity or temperature as well; some can live in solid ice at or below 0°C, forming small channels with highly-concentrated solutes that prevent them from freezing and allow them to move around. In fact, the physical limits—too hot, too cold, too much pressure, and so forth— beyond which microbes are unable to survive are unknown.
Biologists have spent years trying to understand what the different kinds of microbes are doing and how they are doing it. But recently, studies called microbial ecology have received a double jolt in the arm—from new sequencing technology and from leadership-class computing facilities—with the promise of dramatically enhancing our understanding of these fascinating creatures and how we can use them to improve our lives.
 
DNA—The Code of Life
The genetic material of life, deoxyribonucleic acid (DNA), encodes almost all functions that microbes (and the rest of us) carry out. Ribonucleic acid (RNA) is a primordial variant of DNA. A DNA molecule is generally a long strand made up of four different chemical bases: guanine (G), cytosine (C), adenine (A), and thymine (T). All variation in life is due to the different organizations of these bases, or "letters." A single strand of DNA can be hundreds, thousands, or millions of letters long. The human genome, for example, is approximately three billion letters spread over 23 strands of DNA. Most microbe genomes are much smaller, typically a few million letters long on one or two strands of DNA. Viruses are smaller still, often only a few thousand or tens of thousands of letters long. This genetic information is a storage system. It evolved to allow data to be retained over long periods of time and to be faithfully replicated. The entire complement of DNA in a cell, its genome (sidebar "Key Terms" p51), is analogous to a file system, and a single DNA molecule is analogous to a single disk drive: it contains multiple files that can be accessed essentially in any order whenever they are needed.
J. Insley
Figure 1. Tyrosine, one of the 20 amino acids used by cells to synthesize proteins.
Biologists call the equivalent of a file on a strand of DNA a gene—a string of letters that encode a specific series of amino acids. A single DNA molecule will typically have thousands of genes along its length in a somewhat linear order (genes can overlap slightly, though tend not to). Within a gene, the letters are read by the cell in triplets—three letters at a time—and the triplets are translated into one of the 20 possible amino acids. The amino acids are joined together as proteins to make the enzymes, cell structures, and so on that are the key components of life. For example, the triplet CCA encodes for proline and the triplet GGC for glycine. Each gene is demarked by a start and stop triplet, so four triplets have special meaning to the machinery that reads the DNA: ATG means "start here with a methionine"; and TGA, TAA, and TAG mean "stop here without adding anything else." (ATG also serves a dual function meaning "add a methionine" if it is in the middle of a region.) Thus, the machinery runs along the DNA until it finds an ATG and then starts with a methionine. It encodes every set of three letters into a different amino acid until it reaches one of TGA, TAA, or TAG, when it stops the translation. The machinery then tracks along until it finds another ATG signaling the start of the next region. Observant readers will note that there are 20 different amino acids commonly used in creating proteins (figure 1) but 64 possible combinations of three-letter words using a four-letter alphabet. More than one combination of three letters can encode for a single amino acid. For example, CCA, CCC, CCG, and CCT all encode for a proline. Thus, the genetic code is redundant because more than one triplet of DNA can represent a single amino acid.
Recently, studies called microbial ecology have received a double jolt in the arm--from new sequencing technology and from leadership-class computing facilities--with the promise of dramatically enhancing our understanding of these fascinating creatures and how we can use them to improve our lives.
Figuring out the order of amino acids in a protein requires incredibly complex chemical reactions. The process is difficult to perform in the laboratory and not amenable to high-throughput analysis. In contrast, generating the DNA sequence is relatively trivial to perform, and converting from the gene sequence to the protein sequence is a simple computational task. Most organisms use the same translation dictionary to go from DNA triplets to amino acids. There are variations, but they tend to be extreme cases that can be accommodated computationally. Therefore, part of our understanding of genome sequences and the genetics of life come from sequencing the DNA strands of individual genes, chromosomes, and organisms. However, genetic experiments provide most of our basal-level understanding of the genes that DNA contains, the proteins that those genes encode, and the functions of those proteins. Biologists disrupt individual genes and then look for perturbations in the growth of the organism. Readily identifiable effects, for example, may be resistance to antibiotic or the inability to grow on a particular sugar or chemical.
Identifying the role that a particular gene product plays in the cellular machinery of life is a time-consuming, expensive, and complex task that may take decades. However, once the role of a protein has been identified in one organism (the protein has been annotated), similar proteins can be readily identified in other organisms through homology searches. Large swathes of bioinformatics—the application of computational science to biological problems—involve the accurate and appropriate transfer of annotations from the very few experimentally characterized proteins to the very many that have only been predicted into existence through implication based on their DNA sequence (figure 2).
Source: R. Edwards, SDSU/ANL Illustration: A. Tovey
Figure 2. A map of the genome of the common soil micro-organism Bradyrhizobium japonicum. The genome is almost 10 million letters long and contains over 8,500 different genes. Only the region from 0 to 10,000 letters is shown, and the yellow boxes represent the genes in that region.

Gene Products Working and Living Together
Genes that tend to be needed at the same time are organized together along the chromosome. Biologists call these regions on the chromosome "clusters" or "operons." The latter term implies that there is a direct relationship between the access of one gene and the access of the next gene; however, this relationship is essentially impossible to prove computationally and relatively difficult to prove experimentally. Therefore, the more generic term "cluster" is usually used unless a direct relationship has been shown. These clusters may be thought of as directories in a file system. They order files together that are used together, allowing immediate access for a whole suite of information. Once the index of the directory or the start of the cluster has been identified, all of the information contained therein can be realized essentially in O(1) time. For example, the amino acid arginine is a complex chemical made by a variety of different bacteria. The cellular manufacture of arginine requires seven enzymes, and many bacteria possess the ability to make it from scratch. Each of the bacteria able to make this chemical stores the necessary information in its DNA. In most cases that have been studied, the genes describing how to make arginine are organized head to tail in a cluster along a single stretch of DNA.
This arrangement makes annotating genomes easier. The clusters can be used as guides, and genes that work together can be categorized based on the things that they do. Just as with the file system analogy, each directory may contain a vastly different set of files, and a single file could possibly be placed in more than one directory but likely ends up in the most appropriate place. Often a single protein is used in more than one pathway, although typically its gene lies in a cluster with other genes that are used at the same time, so the genetic hierarchy can be represented as a directed acyclic graph. Efficient information access and retrieval emerges from the organization of genes along a genome. This natural classification is leveraged in an annotation schema, using the clusters as a foundation for an annotation hierarchy based on the notion of a subsystem. The subsystems cover activities such as making amino acids from scratch or breaking amino acids down either to use the chemicals in the amino acid or to convert between different amino acids. The subsystems also cover common cellular processes, such as making more cell wall or copying the DNA. In fact, all the processes and functions of the cell are described in terms of the genes required to fulfill those functions.
 
New Technology Revolutionizing Our Understanding of Life
One of the original sequencing technologies, developed by Frederick Sanger and colleagues in 1975, has remained with us for over 30 years and has been the mainstay of DNA sequencing so far. Sanger is one of only four people awarded two Nobel prizes: the first in chemistry in 1958 for studies on the structure of insulin, and the second in chemistry in 1980 for the technology to sequence DNA. This technology, largely unchanged, was used to sequence the very first genome—a virus sequenced in 1977 containing only 5,375 letters—as well as both the private and public human genome sequences published in 2003. Nowadays, Sanger sequencing can correctly and unambiguously identify the correct order of about 750 to 1,000 letters on a single strand of DNA at a time. In order to achieve the complete sequence of an organism's DNA, many, many separate reactions are performed, each of which generates the sequence of a different stretch of the overall DNA sequence, and each of which represents a random portion of the total. Sufficient sequences are generated so that each letter occurs in eight to ten reactions; then all the sequences are combined computationally by identifying contiguous overlapping sections. Typically the algorithms that identify the overlaps require a minimum of 20 letters, providing 420 possibilities. Identical sequences should therefore not occur by chance but should represent those occasions when the same piece of DNA has been sequenced more than once. The two fragments can be joined at the point of overlap to create a single contiguous sequence.
The primary limitation in analyzing sequence data using Sanger sequences was the actual biological and chemical interpretation of the order of the letters on the DNA strand; the physical science was limiting the computational science. "But all this has changed in the past five to ten years," says Dr. Robert Edwards, computational biologist at San Diego State University and Argonne National Laboratory. "Several new technologies have been developed that dramatically reduce both the time and the cost of sequencing, while simultaneously increasing the amount of sequence generated. For example, with the new generation of sequencing machines, the yield of a single reaction is typically between 500 million and one billion letters!"
All the processes and functions of the cell are described in terms of the genes required to fulfill those functions.
Pyrosequencing (commercially available from Roche Applied Sciences) is by far the most advanced technology of the next-generation sequencers. This approach uses an enzymatic reaction to generate a pulse of light each time a particular letter is read from a DNA strand. A single piece of DNA is attached to a spherical bead 28 microns in diameter. The bead-DNA combination is spread over glass plates that contain very small wells, each only 54 microns in diameter. Only a single bead will fit in a well, thereby ensuring only one piece of DNA is read at a time. Raw chemicals that create the letters are added to the plate one at a time, first adding the A, and then after a wash step adding the G, and then after a wash step adding the C, and so on through hundreds of cycles of G, A, C, T. As one of the DNA chemicals is added, the reaction that copies the DNA on the bead releases a chemical (inorganic phosphate) that causes an enzyme in the well to create a brief light pulse that is captured by a CCD (charge-coupled device) camera. Therefore, with each wash of a chemical, only those wells that have added that particular letter will flash. Over the course of all the cycles, each well is read multiple times identifying the letters and the order in which they are added. The current capacity of this machine, called a 454 FLX, is to sequence about 200 consecutive letters from as many as 500,000 pieces of DNA in just a few hours.
This technology has dramatically affected all aspects of sequence-based biology. For example, the original Human Genome Project cost $3 billion (U.S.) and was completed in a little over 13 years. Currently, commercial sequencing of a human genome costs about $350,000 and takes about six months. Cheaper, higher-throughput sequencing is revolutionizing our understanding of the world around us, and particularly of the microbes that inhabit it (figure 3).
P. Morris
Figure 3. The explosion of sequence data continues.

Genome Sequencing Meets Microbial Ecology
For many years microbial ecologists traveled to sites near and far to collect samples. After days, months, or even years of nurturing, they could occasionally culture individual bacteria from their samples. Then the long process of analyzing and investigating the metabolism began. These studies resulted in 5,000 to 10,000 well-characterized bacteria, catalogued in several books, such as Bergey's Manual of Determinative Bacteriology. From most environments, far less than 1% of all bacteria can be grown in the laboratory. With the development of new DNA isolation approaches, new sequencing technology, and new computational analyses, it became possible to unravel and decode the DNA of all of the microbes in any environment at once without growing any of them. Rather than the painstaking nurturing and culturing, plating, purification, and identification that had typified their research, microbial ecologists began homogenizing samples, busting everything open, and sequencing the DNA.
For example, with a sample from the ocean, about 20 liters of water is size-fractionated by filtration. Large filters are used to remove the debris and larger items such as sharks and fishes, allowing the microbes to pass through. Smaller filters are used to collect the microbes, allowing other contaminants and the viruses to pass through. These effluents may also be filtered with yet smaller filters, if desired, allowing all of the different biological entities to be collected based on their size. The microbial cells are broken open by enzymes, chemicals, pressure, or heat; and the cellular debris (essentially everything that is not DNA) is removed from the solution. That DNA becomes the raw material for the sequencing reactions.
As an aside, one of the main advantages that the new technologies provided was the ability to generate DNA sequence from raw DNA; unlike the approach used with Sanger sequencing, the pieces did not need to be captured in surrogate cells (a step biologists call cloning). With the old approaches, domesticated bacteria—often a very common bacterium called Escherichia coli (figure 4)—were used to grow the large volumes of DNA necessary for the sequencing reactions. The problem is that E. coli and other workhorse bacteria are picky about which DNA they like. If, for example, the DNA encodes a protein that is toxic, the bacteria will not grow. In the new approach there is no need for this step, and therefore a large source of bias has been removed: toxic genes are sequenced as efficiently as everything else. It is now generally assumed that the DNA that is sequenced is roughly proportional to its concentration in the environment, although that assumption remains to be proven.
"Several new technologies have been developed that dramatically reduce both the time and the cost of sequencing, while simultaneously increasing the amount of sequence generated."

Dr. Robert Edwards
SDSU/ANL
This approach—grinding up the cells, extracting the DNA and sequencing it—is called either meta-genomics or random community genomics to emphasize the abstract nature of the process. Skilled scientists working on soil samples can process a sample through the steps needed to go from corn field to computer file in as little as four days. The challenge now lies in the computational analysis of the data stream being generated. This is precisely where leadership-class computing facilities excel.
 
The Questions Tackled by Metagenomics
Although every study has its own goals and focus, several key questions remain that are being addressed by metagenomics. The most fundamental question is: what is there? Biologists would like to know what species of microbes are present in each environment being studied. Just as biologists who study plants and animals want to describe the objects of their research, environmental microbiologists need that descriptive information, too.
Related to identifying the players in an environment, the sequence data also provide perspectives on how much of each organism is in the environment, or how it is organized. Biologists would like to know, for example, whether the environment is dominated by one or a few species, or if a lot of different organisms are present. Just as a very old forest may contain only a few species of trees, the rest being lost over time, a mature microbial community may contain only a few species. A very disturbed community, where the organisms are under some form of stress, may contain a greater range of microbes. For example, our intestines contain millions of microbes, almost all of which are essential for our health and well-being, aiding our digestion and providing essential amino acids. While we are healthy, our microbial community remains constant, with the predominant members dividing as quickly as they are being lost. If we take a course of antibiotics, however, the microbial community will be disturbed, many of the most abundant microbes will be killed off, and other microbes (some of which may cause harm) can start growing in their place. This is why an upset stomach often follows a short course of antibiotics and is usually resolved once the antibiotics are completed and the normal microbial populations can recover.
The challenge now lies in the computational analysis of the data stream being generated. This is precisely where leadership-class computing facilities excel.
R. Bizzoco, SDSU
Figure 4. E. coli—the model organism. Because it is common and is easily manipulated, E. coli has long been used in biotechnology studies to grow large volumes of DNA for sequencing.
A third question being tackled by metagenomics is concerned with what is going on in each environment: what is it doing? Part of the answer to this question comes from analysis of what is in each environment. If we can describe what microorganisms are there, we can infer what they are doing. It is becoming increasingly clear, however, how limited our understanding of microbial functions are based just on the organisms that are present. By taking all of the DNA in the sample at once and identifying the protein functions that are present (see below), we are learning that different microbes are doing the same things and—surprisingly—that the same microbes have different roles in different environments.
Source: R. Edwards, SDSU/ANL Illustration: A. Tovey
Figure 5. BLAST complexity of similarity searches for 10 trillion letters of DNA used to understand microbial metagenomes.

Computational Comparisons for Identifying Community Members
DNA and proteins are, at their heart, strings of characters. For decades computational scientists developing bioinformatics algorithms have leveraged numerous string-comparison algorithms developed in other fields to identify similar DNA or protein sequences in different samples. The basic notion is that if a sequence is similar to something that is already known, and especially if it is similar to something someone working in a laboratory has studied, we can declare with impunity that our new sequence is performing the same function; that is, we can transfer the annotation from the original sequence to the new sequence. To address the question, what is there? researchers use a combinatorial approach. First, all instances of a special gene, called the 16S rRNA gene, are identified based on the similarity with a database of known sequences. This approach uses the heuristic BLAST method to quickly scan the sequences for likely instances of this gene (figure 5). Depending on the length of the query sequence, the size of the database, and the overall score, some alignments might occur just by the random chance of finding two somewhat similar strings of letters. Some of these are screened out by a statistical analysis, which provides an estimate of the likelihood that the alignment occurs at random, but near matches, including some potentially spurious alignments, are refined using a global alignment program. During this alignment process, the order of the sequences is usually shuffled to find the most parsimonious representation of the alignment. Once a significant, and reliable, alignment is generated, the number of differences between each of the sequences is counted. These differences are an approximate measure for the length of time since what biologists call a common ancestor. After a cell divides in two, each of the daughter cells begins a new lineage, accumulating changes over time. These natural changes occur very slowly and are generally just mistakes in the DNA replication. On average, and in normal growth conditions, about one mistake is made during the replication of about 10 million letters of DNA. Over generations and generations of time, however, these mutations accumulate and can be used as a surrogate marker for the time since the cell divided in two. If two DNA sequences have very few changes, they are likely from cells that are closely related to each other. In contrast, if sequences have very many changes between them, it has probably been a long time since the original cell division. Although the topic remains controversial, most biologists consider that two organisms are the same if their 16S genes are about 97% identical to each other. As the similarity between these sequences decreases, the organisms are less and less likely to be the same. Using this approach, scientists can identify the organisms in an environment from the raw DNA (figure 6).
Source: NASA Illustration: A. Tovey
Figure 6. Snapshot of the tree of life first proposed by Carl Woese and now accepted by scientists to reflect the likely evolutionary history of life on Earth. Bacteria, archaea, and small eucaryotes are small, single-celled organisms. The archaea and eucaryotes share a common ancestor that split off from bacteria approximately 3,000 million years ago. Bacteria radiated into the variety of forms that are studied today and are so influential to our lives, while eucaryotes evolved into plants, fungi, animals, and other multicellular organisms.
Another computational approach taken to compare sequences is to repeat the alignment many hundreds or thousands of times, each time randomizing the order of addition of sequences to the build. Hence, no single early strong alignment dominates the remaining datasets. In this approach, the number of times two sequences are found to be near each other is counted and may be used either as a surrogate for time, or more routinely as support for a single assertion of time from one alignment.
A third approach used to categorize the sequences in a sample is to compare the frequency of different combinations of letters in the DNA sequence. Recall that there are 64 possible triplets that could be used to encode the 20 different amino acids that bacteria use. Since the earliest days of molecular biology it has been known that not all microbes use these triplets with the same efficiency or frequency. Some microbes favor a particular suite of triplets, although the reasons why are not fully understood. However, by counting the occurrence of single letters, pairs, triplets, and so forth in the DNA sequences, each fragment may be placed in an appropriate classification schema. Before analyzing the de novo data, training sets are developed by using known complete genomes and the classification of the organisms from which those genomes were derived. The test data are then compared to the training data to suggest the most likely placement of the unknown sequences in the classification schema.
Mathematical modeling has led to a greater understanding of how the microbes in a given community are organized.
Two similar approaches classify fragments based on their raw nucleotide composition. The first, developed by researchers at the Max Planck Institute for Marine Microbiology, uses four consecutive letters (tetranucleotides); the second, developed by researchers at IBM Life Sciences, classifies sequences based on seven consecutive nucleotides (heptamers) using a support vector machine classification algorithm.
 
Mathematical Approaches for Organizing Community Members
Mathematical modeling has led to a greater understanding of how the microbes in a given community are organized. Like every other community, microbes are stratified with a very large number of a few species, fewer occurrences of more species, and a lot of species that are present in very low numbers. This organization can be represented graphically with a rank-abundance curve plotting the number of times a species is found on the y-axis, and ordering them from most abundant to least abundant on the x-axis. Since all the DNA is extracted from an environmental sample and then a proportion of it sequenced at random, the frequency with which sequences are identified should reflect the overall complexity of the environment. That is, the rank abundance of the species in the DNA sequences should reflect the rank abundance of the species in the environment. Rarely, however, is an entire sequence found more than once, and so an alternative species definition was developed to reconstruct the rank abundance curve from the sample.
Recall from the description of complete genome sequencing that adjacent pieces of DNA are joined into a contiguous stretch of sequence if they have 20 letters or more in common. It is very rare (1:420) that this would occur randomly. Recall also that, at the 16S level described above, about 97% identity is required for two sequences to be considered to come from the same organism. We can merge these two simple tenets to achieve a new definition of a species from a metagenome: we require at least 20 base pairs of greater than 97% sequence identity between two overlapping fragments. If we can identify all the times that these similar sequences occur in our sample, we can calculate the number of species present in the sample. Moreover, if three sequences overlap and fit our modified species concept, we know that they are more abundant than if only two sequences overlap and fit our new concept. Therefore, we can generate a rank-abundance curve based on similarity between sequences in our sample, without comparing them to any other sequence dataset. This is an exclusively intrinsic measure of the number and organization of the microbes in the sample as it does not require similarity to any external sequences or other data.
Source: P. Morris Illustration: A. Tovey
Figure 7. Cycling of carbon, nitrogen, and phosphorus through the environment and into soil storage. The process is driven primarily by microbes.
 
Several standard algorithms have been identified to map all of the sequences to each other in a sequence sample. Most of these were developed for complete genome sequencing where, as described earlier, the ultimate goal is the reconstruction of a single complete DNA fragment from small pieces. With minor modifications, we can reuse this code to estimate the number and abundance of different microbes in any given sample based on our modified species criteria. These rank abundance curves can be parameterized against a series of standard equations (for example, the power law or log-normal distribution), and by error minimization techniques we can estimate the rank abundance of the microbes in the original sample. Researchers at San Diego State University have generated a web interface for this analysis (http://biome.sdsu.edu).
 
Functional Annotation and Determining What the Microbes Are Doing
The accurate prediction, projection, and extension of the functions of different proteins between organisms and samples are perhaps the most demanding of the holy grails of bioinformatics. Subsystems classification has assisted this effort by providing a set of annotations that are both accurate and reliable, being curated by human experts and then compared across all microbial genomes computationally. These annotations are sorted into groups based on aggregating sets that perform a related function in the cell.
The process of annotating the DNA sequences begins with comparing the new sequences to those that have been seen before. The analysis uses the heuristic BLAST approach, ascribing a probability that each pair of sequences is similar. Once all the similarities have been found for a particular sequence, a judgment call is made determine the function of the new sequence. In the annotation of complete genomes, where each sequence should have only a single instance of a given function, teams of expert biologists peruse the data and attempt to assign correct functions to the genes and proteins. In metagenomes, however, the volume of data vastly exceeds anything that a single curator can examine; hence, the new function is almost always simply taken to be the same function as that of the known sequence with highest probability of being the same. Computational biologists are developing some artificial intelligence approaches to assign a function from the subsystems, attempting to emulate the decisions that the experienced annotator would make. The subsystems approach provides many immediate advantages, including the fact that the built-in hierarchical classification can be leveraged for downstream analyses.
Currently it takes about 100 CPU-hours to compare a single metagenome against known datasets and assign the functions.
R. Edwards, SDSU/ANL
Figure 8. Ecoregions in the United States and soil sampling sites within them. There are 104 ecotypes that cover the lower 48 states, and so far 20 different sites have been targeted for sampling. Once these have been analyzed, samples will be collected from the remaining regions and sequenced.
At this point high-performance computing becomes critical. Our accumulated knowledge of all proteins currently extends to 7,139,712 proteins comprising 2,503,057,853 letters. (Recall that there are more than three times as many DNA letters, since three DNA letters are used to encode one protein letter, and that some DNA letters do not encode proteins but are spacers, used to contain regulatory elements, and so on.) In order to identify putative functions within the sequences, the metagenome—the raw DNA sequence—is translated to the protein sequence computationally, and then the new, unknown sequences are compared to the known sequences using BLAST. This analysis is done in protein space because the protein sequence changes more slowly than the DNA sequence. Since more than one set of three DNA letters encodes for a single amino acid, the DNA can change but not alter the protein sequence—so-called silent or synonymous mutations.
In metagenomics analysis, the BLAST comparisons of all the fragments that are sequenced against the standard databases of known data are the most time-consuming task. Currently it takes about 100 CPU-hours to compare a single metagenome against known datasets and assign the functions. Of course, this process depends on the sizes of both the metagenome and the database and, as noted above, both of these are increasing with the deployment of next-generation sequencing machines. In an analysis of more than 10 gigabase pairs of DNA sequence comparisons (1 x 1010 letters compared to a standard database of approximately 1.5 x 109 letters), researchers at Argonne National Laboratory showed that the compute time scales with linear complexity and that with sixteen-core architecture it takes approximately seven seconds to process 1,000 DNA sequence letters, or 7 x 107 seconds (or 810 days) to process the entire dataset. Once the comparison has been computed, the transference of annotation from known protein to unknown protein and the compilation of collections of proteins in different systems are relatively trivial computational tasks. Since these data are maintained in a series of relational databases, annotation is not a rate-limiting step. The Argonne databases are constructed to be agnostic of the underlying hardware and software profiles so that they can be rapidly deployed to new hardware as available. The machines used were a cluster of PowerPC G4 nodes and a few Intel quad-core PCs.
"The project has the ambitious goal of integrating metagenomics sequence data with biogeochemical, ecological, meteorological, satellite, and any other pertinent data to understand the cycling of nutrients through the terrestrial environment."

Rick Stevens
ANL
These computations have radically altered our view of the microbial communities around us. For the first time microbiologists are gaining understandings of the complex interplay between micro-organisms in their natural environment. Microbes are performing fascinating functions, sharing substrates or reaction products, and not acting in isolation. By comparing the functions in different samples, a team from San Diego State University, Argonne National Laboratory, and their collaborators showed that the annotation becomes predictive of the sample. For example, when supplied with a set of DNA sequences from an unknown sample, they can correctly assign it to an environmental domain about 90% of the time, based solely on our predictions of what the microbes are doing with their DNA. Moreover, they showed that this was true both for the microbes themselves and, unexpectedly, for the viruses that prey on them. Although it is not yet obvious what those viruses are doing, they appear to be sampling their hosts' DNA and enriching the most useful things.
 
Impacts of Metagenomics
Now we will take a look at some recent findings selected from the vast range of different environments that are being targeted by the new tools that metagenomics provides.
 
Soil Metagenomics
Microbes are fundamental constituents of our soil, turning over the dead vegetable and animal matter and returning the nutrients therein back into the soil (figure 7). Microbial processes in the soil promote carbon sequestration and contribute to the approximately petagram (1,000,000,000 metric tons) of CO2 equivalents sequestered by soils annually. If computational biologists can identify which processes are accumulating carbon in the soil—not just converting atmospheric CO2 to fixed carbon through photosynthesis, but converting that fixed carbon to soil organic matter— biologists may be able to enhance the deposition of carbon in the soils. Microbes are also chewing on the nitrogen-based fertilizers applied to our crops. Typical corn-belt crops are supplemented with 150 kilograms of nitrogen per hectare and 50 kilograms of phosphorus per hectare, adding 10-20¢ per liter to the cost of corn-derived ethanol. Much of this fertilizer is not directly used by the plants but is instead consumed by the microbes and may be lost forever from the plant food cycle. By identifying the microbes that are using the ammonia fertilizers and understanding how the nitrogen is being converted, the amount of fertilizer that need be applied to crops to achieve sustainable, energy efficient agriculture can be reduced.
As shown in figure 8, 104 different ecoregions have been defined to cover the contiguous 48 states. It is not known how the microbes vary between or within these regions, from the mature forest of the Northeast, through the farmlands of the Midwest to the arid desert of the Southwest. If the microbes that promote healthy soils, reduce the breakdown of ammonia, and enhance carbon sequestration can be identified and encouraged, the health of our nation can be promoted. In 2008, Argonne National Laboratory established the terragenomics project to further our understanding of the soil microbial processes that govern carbon sequestration, the turnover of fertilizers, and the development of fertile agricultural land. "The project has the ambitious goal of integrating metagenomics sequence data with biogeochemical, ecological, meteorological, satellite, and any other pertinent data to understand the cycling of nutrients through the terrestrial environment," says Rick Stevens, associate laboratory director for Computing, Environment, and Life Sciences at Argonne. The project started modestly, with the aim of sequencing representative microbial communities from 20 different soil samples dispersed around the United States. These sites were chosen because of the abundant metadata that describe the locations and that will be integrated into the analysis; each site is part of the National Science Foundation National Ecological Observing Network. These sites are also the target of long-term ecological studies, and so a wealth of additional data is available to support the sequence-based analysis. In subsequent phases, more sites that have been studied by other ecosystems biologists will be sampled, and many of the previous sites will be resampled to see how they have changed over time and with different environmental and nutrient treatments (figure 9).
P. Morris
Figure 9. The proportion of the five most abundant subsystems identified from a prairie site, two grasslands, and a soy field site. The bars show the frequency out of 100,000 sequences.
 
Biofuels
The conversion of plant matter to ethanol or other fuels is hampered largely by a single factor: plant cell walls are extremely resistant to degradation. If such were not the case, all the plants in the world would become food for the microbes, and the plants would end up as a gooey mess! In the current ethanol production pipelines, the cell walls are broken down by a combination of high temperature and high pressure, a process that requires a lot of energy. Bacteria in a few, very specialized environments have evolved to break down plant cell walls and use the breakdown products as energy. For example, cows have a complex digestive system where plants are broken down. By studying the microbes in the rumen, scientists aim to identify the genes that encode the enzymes that cleave the cellulosic material that holds plants together. With genetic engineering techniques, these enzymes could be made in bioreactors, providing a cleaner, cheaper, more fuel-efficient source of ethanol than the current methods used to convert corn. Similarly, at DOE's Joint Genome Institute researchers have focused on termite hindguts, small microbial factories that have been nurtured by the termites to eat cellulosic wood, converting the recalcitrant carbon into more usable sources. The termites benefit from this relationship by using the microbial fermentation products as an energy source. Using metagenomics, researchers have identified, cloned, and expressed several key microbial genes that are responsible for the degradation of wood to fermentative products. These isolated products, or benign bacteria harboring them, could potentially be used in the bioreactors that convert corn to ethanol, resulting in cheaper and more efficient ethanol production.
 
At DOE's Joint Genome Institute researchers have focused on termite hindguts, small microbial factories that have been nurtured by the termites to eat cellulosic wood, converting the recalcitrant carbon into more usable sources.
Human Health
Metagenomics is also influencing human health. Humans have about ten times as many microbes as human cells, and they not only affect our acute health through common diseases like diarrhea or pneumonia, but also cause chronic diseases such as cancer. The combination of an organism (such as a person or a coral reef) and all associated microflora is an emerging topic of study in ecology; it is referred to as a holobiont, and the genetic complement of host and microbes is called the hologenome. Recently the metagenomics approach was used to show that the types of bacteria in the intestines might be responsible for our girth. Obesity correlated with an abundance of a type of bacteria that has the ability to turn over energy more rapidly than normal, and it may be that the bacteria are releasing more energy from our food that is subsequently consumed and stored as excess fat. A plethora of diseases is associated with microbial causes, but specific microbes may not cause many of these. Rather, shifts in the microbial communities may alter functions that microbes are doing. In 2007 the National Institutes of Health announced the Human Microbiome Project, a large sequencing effort to describe all of the microbes associated with people and to begin to compare the presence and function of microbes in healthy and diseased individuals. This effort may ultimately lead to so-called personalized medicine. In what may revolutionize visits to family physicians, researchers will begin by sequencing our genomes, and the genomes of our microbes, to figure out what ails us. Then, based on our genetic complement and known drug-genome interactions, specific pharmaceu- ticals will be crafted specially to heal our wounds without causing unwanted side effects with other aspects of our metabolism.
 
Computational Challenges
Metagenomics presents several problems for computational science. Some are welcome, with easy solutions; others will require more development and deployment of computing infrastructure. One of the primary challenges presented by these data is the inherent linearity of the searches. Since each sequence must be compared to the existing dataset, there are few ways to avoid this linearity. Current architectures achieve a throughput of approximately 1,000 DNA letters in about seven seconds. A standard dataset contains approximately 1 x 108 letters. On the other hand, sequence analysis is embarrassingly parallel. The statistics of DNA sequence comparisons are well understood, and the output of a single comparison is independent of all other comparisons. Therefore, the data can be cut and diced into smaller and smaller datasets, ensuring that both the query sequences and the known sequences remain in memory and do not require disk access. Of course, the problem with cutting the datasets smaller then becomes the interconnect transfer rates to place the known sequences on each compute node. The standard program for comparing DNA and protein datasets was published in 1990 (BLAST), but researchers at DOE's Pacific Northwest National Laboratory recently released a parallel version, called scalaBLAST, that takes advantage of global arrays stored across a whole cluster to improve computation time in part by reducing network latency. This version results in a significant speedup compared to traditional approaches (figure 10).
Source: PNNL Illustration: A. Tovey
Figure 10. Prefetching hides memory latency for distributed systems. In this figure, an approximately 50% boost in scaling is evident on MPP2 from prefetching, allowing ScalaBLAST to perform on the distributed-memory system (Linux cluster with dual 1.5 GHz Itanium-2, and Quadrics QSnet-II network) almost exactly the same as it does on the shared-memory system (SGI Altix, 1.5 GHz Itanium-2). Prefetching was implemented by creating local memory buffers that are being filled via remote direct-memory access calls (transparent to the user) while processing is taking place on the previously filled buffer. When the end of a buffer is reached, the one-sided memory operation is then checked for completion and the active buffer is swapped while a new prefetch operation is initiated.
Another challenge for computational science is the integration of these disparate datasets in a meaningful manner. Moving toward the goal of sequencing the microbes from soils around the United States requires the infrastructure to store, retrieve, and analyze not only the sequence data but also all of the abstract data types that are collected. As noted above, researchers are integrating all scales of environmental analyses in the terragenomics project—everything from the molecular and genetic characterization of individual proteins, through the viruses, bacteria, archaea, and eukaryotes (including both eukaryotic microbes such as yeasts and large plants such as corn and soyabean), to ecological-scale interactions between plants and animals, and even to satellite-scale imagery. Integration and simulation of these datasets will require exascale computing facilities, far beyond those currently available. Similarly, human health studies must consider the health of the holobiont, integrating all of the health history of the patient with the environmental influences and the genomes of the microbes that reside in and on us. Only then will we be able to truly design medical services that are tailored to our individual needs.
 
Metagenomics presents several problems for computational science. Some are welcome, with easy solutions; others will require more development and deployment of computing infrastructure.
Toward Practical Applications
Metagenomics is unveiling the complexity of microbial communities that surround us and affect our everyday lives. We are finally beginning to understand what bacteria are doing and how they are doing it. The implications for energy production and consumption, agriculture, human health and every other facet of our economy are enormous, and the interdisciplinary melding of computational sciences and biology, a field called bioinformatics, is leading the way.
 
Contributors Dr. Robert Edwards, computational biologist at Argonne National Laboratory and San Diego State University
 
Further Reading
Terragenomics Project
http://terragenomics.mcs.anl.gov/

E. A. Dinsdale et al. 2008. Functional metagenomic profiling of nine biomes. Nature 452: 629-632.