| ECOLOGY AND METAGENOMICS |
| Gaining New Insights into MICROBIAL Communities |
We are not alone on this planet. Of course, we share the Earth with plants and animals that we see all around us, but what we usually do not appreciate are the unseen organisms that surround us, encompass us, and affect us every day—microbes. With leadership-class computing facilities and new technologies for sequencing and analyzing genomes, microbiologists are gaining new insights into what these tiny organisms look like, how they function, and how we might utilize them to improve our environment and our lives.
|
| An estimated 1030 microbial cells exist on the Earth. These are the smallest members of our environment and are visible only under the microscope (hence their name). "Microbes" is a catch-all term that covers each of the three domains of life—bacteria, archaea, and small eucaryotes—and viruses (not really considered alive by most). Microbes are single-celled entities that live by basic rules: eat, do not be eaten, and divide (sex is optional). Depending on where they are and what they are using for food, their life cycle can be as short as a few hours or days or as long as hundreds or thousands of years. |
| Microbes inhabit every conceivable environment on Earth and are responsible for a vast array of processes touching every aspect of our lives. For example, the oceans are teeming with them; every mouthful of seawater has about 10 million microbes swimming around. There are vastly more microbes than sharks in the ocean, and the microbes may also be more dangerous: most of the diseases that ail us are caused by our microbial foes. Although microbes are well known as causes of food poisoning and tuberculosis, they are increasingly becoming implicated in other acute diseases that take many years to develop, such as atherosclerosis (heart disease) and cancer. |
| Microbes also are responsible for many beneficial processes. Most people are aware of the important roles of microbes in food production (such as beer, wine, and cheese). Microbes additionally are the source of almost all antibiotics that are used to treat infections. Microbes that live in our bodies are beneficial, too. They are critical for our survival through the production of essential amino acids that we cannot make ourselves. Beneficial microbes also coat our surfaces (skin, intestines, and so on) and prevent or limit "bad bugs" from binding to those surfaces. This is the natural action of many of the probiotics promoted by the health food industry. |
Microbes inhabit every conceivable environment on Earth and are responsible for a vast array of processes touching every aspect of our lives. |
| Microbes are tenacious, able to survive in seemingly inhospitable environments. Recently, bacteria have been found living inside rocks deep in the Earth; these remarkable creatures use uranium radiation instead of sunlight or heat as an energy source. Microbes have adapted to extremes of acidity or temperature as well; some can live in solid ice at or below 0°C, forming small channels with highly-concentrated solutes that prevent them from freezing and allow them to move around. In fact, the physical limits—too hot, too cold, too much pressure, and so forth— beyond which microbes are unable to survive are unknown. |
Biologists have spent years trying to understand what the different kinds of microbes are doing and how they are doing it. But recently, studies called microbial ecology have received a double jolt in the arm—from new sequencing technology and from leadership-class computing facilities—with the promise of dramatically enhancing our understanding of these fascinating creatures and how we can use them to improve our lives. |
DNA—The Code of Life
The genetic material of life, deoxyribonucleic acid (DNA), encodes almost all functions that microbes (and the rest of us) carry out. Ribonucleic acid (RNA) is a primordial variant of DNA. A DNA molecule is generally a long strand made up of four different chemical bases: guanine (G), cytosine (C), adenine (A), and thymine (T). All variation in life is due to the different organizations of these bases, or "letters." A single strand of DNA can be hundreds, thousands, or millions of letters long. The human genome, for example, is approximately three billion letters spread over 23 strands of DNA. Most microbe genomes are much smaller, typically a few million letters long on one or two strands of DNA. Viruses are smaller still, often only a few thousand or tens of thousands of letters long. This genetic information is a storage system. It evolved to allow data to be retained over long periods of time and to be faithfully replicated. The entire complement of DNA in a cell, its genome (sidebar "Key Terms" p51), is analogous to a file system, and a single DNA molecule is analogous to a single disk drive: it contains multiple files that can be accessed essentially in any order whenever they are needed. |
 |
| J. Insley |
| Figure 1.
Tyrosine, one of the 20 amino acids used by cells to synthesize proteins.
|
|
| Biologists call the equivalent of a file on a strand of DNA a gene—a string of letters that encode a specific series of amino acids. A single DNA molecule will typically have thousands of genes along its length in a somewhat linear order (genes can overlap slightly, though tend not to). Within a gene, the letters are read by the cell in triplets—three letters at a time—and the triplets are translated into one of the 20 possible amino acids. The amino acids are joined together as proteins to make the enzymes, cell structures, and so on that are the key components of life. For example, the triplet CCA encodes for proline and the triplet GGC for glycine. Each gene is demarked by a start and stop triplet, so four triplets have special meaning to the machinery that reads the DNA: ATG means "start here with a methionine"; and TGA, TAA, and TAG mean "stop here without adding anything else." (ATG also serves a dual function meaning "add a methionine" if it is in the middle of a region.) Thus, the machinery runs along the DNA until it finds an ATG and then starts with a methionine. It encodes every set of three letters into a different amino acid until it reaches one of TGA, TAA, or TAG, when it stops the translation. The machinery then tracks along until it finds another ATG signaling the start of the next region. Observant readers will note that there are 20 different amino acids commonly used in creating proteins (figure 1) but 64 possible combinations of three-letter words using a four-letter alphabet. More than one combination of three letters can encode for a single amino acid. For example, CCA, CCC, CCG, and CCT all encode for a proline. Thus, the genetic code is redundant because more than one triplet of DNA can represent a single amino acid. |
Recently, studies called microbial ecology have received a double jolt in the arm--from new sequencing technology and from leadership-class computing facilities--with the promise of dramatically enhancing our understanding of these fascinating creatures and how we can use them to improve our lives. |
| Figuring out the order of amino acids in a protein requires incredibly complex chemical reactions. The process is difficult to perform in the laboratory and not amenable to high-throughput analysis. In contrast, generating the DNA sequence is relatively trivial to perform, and converting from the gene sequence to the protein sequence is a simple computational task. Most organisms use the same translation dictionary to go from DNA triplets to amino acids. There are variations, but they tend to be extreme cases that can be accommodated computationally. Therefore, part of our understanding of genome sequences and the genetics of life come from sequencing the DNA strands of individual genes, chromosomes, and organisms. However, genetic experiments provide most of our basal-level understanding of the genes that DNA contains, the proteins that those genes encode, and the functions of those proteins. Biologists disrupt individual genes and then look for perturbations in the growth of the organism. Readily identifiable effects, for example, may be resistance to antibiotic or the inability to grow on a particular sugar or chemical. |
| Identifying the role that a particular gene product plays in the cellular machinery of life is a time-consuming, expensive, and complex task that may take decades. However, once the role of a protein has been identified in one organism (the protein has been annotated), similar proteins can be readily identified in other organisms through homology searches. Large swathes of bioinformatics—the application of computational science to biological problems—involve the accurate and appropriate transfer of annotations from the very few experimentally characterized proteins to the very many that have only been predicted into existence through implication based on their DNA sequence (figure 2). |
 |
| Source: R. Edwards, SDSU/ANL Illustration: A. Tovey |
| Figure 2. A map of the genome of the common soil micro-organism Bradyrhizobium japonicum. The genome is almost 10 million letters long and contains over 8,500 different genes. Only the region from 0 to 10,000 letters is shown, and the yellow boxes represent the genes in that region. |
|
Gene Products Working and Living Together
Genes that tend to be needed at the same time are organized together along the chromosome. Biologists call these regions on the chromosome "clusters" or "operons." The latter term implies that there is a direct relationship between the access of one gene and the access of the next gene; however, this relationship is essentially impossible to prove computationally and relatively difficult to prove experimentally. Therefore, the more generic term "cluster" is usually used unless a direct relationship has been shown. These clusters may be thought of as directories in a file system. They order files together that are used together, allowing immediate access for a whole suite of information. Once the index of the directory or the start of the cluster has been identified, all of the information contained therein can be realized essentially in O(1) time. For example, the amino acid arginine is a complex chemical made by a variety of different bacteria. The cellular manufacture of arginine requires seven enzymes, and many bacteria possess the ability to make it from scratch. Each of the bacteria able to make this chemical stores the necessary information in its DNA. In most cases that have been studied, the genes describing how to make arginine are organized head to tail in a cluster along a single stretch of DNA. |
This arrangement makes annotating genomes easier. The clusters can be used as guides, and genes that work together can be categorized based on the things that they do. Just as with the file system analogy, each directory may contain a vastly different set of files, and a single file could possibly be placed in more than one directory but likely ends up in the most appropriate place. Often a single protein is used in more than one pathway, although typically its gene lies in a cluster with other genes that are used at the same time, so the genetic hierarchy can be represented as a directed acyclic graph. Efficient information access and retrieval emerges from the organization of genes along a genome. This natural classification is leveraged in an annotation schema, using the clusters as a foundation for an annotation hierarchy based on the notion of a subsystem. The subsystems cover activities such as making amino acids from scratch or breaking amino acids down either to use the chemicals in the amino acid or to convert between different amino acids. The subsystems also cover common cellular processes, such as making more cell wall or copying the DNA. In fact, all the processes and functions of the cell are described in terms of the genes required to fulfill those functions. |
New Technology Revolutionizing Our Understanding of Life
One of the original sequencing technologies, developed by Frederick Sanger and colleagues in 1975, has remained with us for over 30 years and has been the mainstay of DNA sequencing so far. Sanger is one of only four people awarded two Nobel prizes: the first in chemistry in 1958 for studies on the structure of insulin, and the second in chemistry in 1980 for the technology to sequence DNA. This technology, largely unchanged, was used to sequence the very first genome—a virus sequenced in 1977 containing only 5,375 letters—as well as both the private and public human genome sequences published in 2003. Nowadays, Sanger sequencing can correctly and unambiguously identify the correct order of about 750 to 1,000 letters on a single strand of DNA at a time. In order to achieve the complete sequence of an organism's DNA, many, many separate reactions are performed, each of which generates the sequence of a different stretch of the overall DNA sequence, and each of which represents a random portion of the total. Sufficient sequences are generated so that each letter occurs in eight to ten reactions; then all the sequences are combined computationally by identifying contiguous overlapping sections. Typically the algorithms that identify the overlaps require a minimum of 20 letters, providing 420 possibilities. Identical sequences should therefore not occur by chance but should represent those occasions when the same piece of DNA has been sequenced more than once. The two fragments can be joined at the point of overlap to create a single contiguous sequence. |
| The primary limitation in analyzing sequence data using Sanger sequences was the actual biological and chemical interpretation of the order of the letters on the DNA strand; the physical science was limiting the computational science. "But all this has changed in the past five to ten years," says Dr. Robert Edwards, computational biologist at San Diego State University and Argonne National Laboratory. "Several new technologies have been developed that dramatically reduce both the time and the cost of sequencing, while simultaneously increasing the amount of sequence generated. For example, with the new generation of sequencing machines, the yield of a single reaction is typically between 500 million and one billion letters!" |
All the processes and functions of the cell are described in terms of the genes required to fulfill those functions. |
| Pyrosequencing (commercially available from Roche Applied Sciences) is by far the most advanced technology of the next-generation sequencers. This approach uses an enzymatic reaction to generate a pulse of light each time a particular letter is read from a DNA strand. A single piece of DNA is attached to a spherical bead 28 microns in diameter. The bead-DNA combination is spread over glass plates that contain very small wells, each only 54 microns in diameter. Only a single bead will fit in a well, thereby ensuring only one piece of DNA is read at a time. Raw chemicals that create the letters are added to the plate one at a time, first adding the A, and then after a wash step adding the G, and then after a wash step adding the C, and so on through hundreds of cycles of G, A, C, T. As one of the DNA chemicals is added, the reaction that copies the DNA on the bead releases a chemical (inorganic phosphate) that causes an enzyme in the well to create a brief light pulse that is captured by a CCD (charge-coupled device) camera. Therefore, with each wash of a chemical, only those wells that have added that particular letter will flash. Over the course of all the cycles, each well is read multiple times identifying the letters and the order in which they are added. The current capacity of this machine, called a 454 FLX, is to sequence about 200 consecutive letters from as many as 500,000 pieces of DNA in just a few hours. |
| This technology has dramatically affected all aspects of sequence-based biology. For example, the original Human Genome Project cost $3 billion (U.S.) and was completed in a little over 13 years. Currently, commercial sequencing of a human genome costs about $350,000 and takes about six months. Cheaper, higher-throughput sequencing is revolutionizing our understanding of the world around us, and particularly of the microbes that inhabit it (figure 3). |
 |
| P. Morris |
| Figure 3. The explosion of sequence data continues. |
|
Genome Sequencing Meets Microbial Ecology
For many years microbial ecologists traveled to sites near and far to collect samples. After days, months, or even years of nurturing, they could occasionally culture individual bacteria from their samples. Then the long process of analyzing and investigating the metabolism began. These studies resulted in 5,000 to 10,000 well-characterized bacteria, catalogued in several books, such as Bergey's Manual of Determinative Bacteriology. From most environments, far less than 1% of all bacteria can be grown in the laboratory. With the development of new DNA isolation approaches, new sequencing technology, and new computational analyses, it became possible to unravel and decode the DNA of all of the microbes in any environment at once without growing any of them. Rather than the painstaking nurturing and culturing, plating, purification, and identification that had typified their research, microbial ecologists began homogenizing samples, busting everything open, and sequencing the DNA. |
| For example, with a sample from the ocean, about 20 liters of water is size-fractionated by filtration. Large filters are used to remove the debris and larger items such as sharks and fishes, allowing the microbes to pass through. Smaller filters are used to collect the microbes, allowing other contaminants and the viruses to pass through. These effluents may also be filtered with yet smaller filters, if desired, allowing all of the different biological entities to be collected based on their size. The microbial cells are broken open by enzymes, chemicals, pressure, or heat; and the cellular debris (essentially everything that is not DNA) is removed from the solution. That DNA becomes the raw material for the sequencing reactions. |
| As an aside, one of the main advantages that the new technologies provided was the ability to generate DNA sequence from raw DNA; unlike the approach used with Sanger sequencing, the pieces did not need to be captured in surrogate cells (a step biologists call cloning). With the old approaches, domesticated bacteria—often a very common bacterium called Escherichia coli (figure 4)—were used to grow the large volumes of DNA necessary for the sequencing reactions. The problem is that E. coli and other workhorse bacteria are picky about which DNA they like. If, for example, the DNA encodes a protein that is toxic, the bacteria will not grow. In the new approach there is no need for this step, and therefore a large source of bias has been removed: toxic genes are sequenced as efficiently as everything else. It is now generally assumed that the DNA that is sequenced is roughly proportional to its concentration in the environment, although that assumption remains to be proven. |
"Several new technologies have been developed that dramatically reduce both the time and the cost of sequencing, while simultaneously increasing the amount of sequence generated."
Dr. Robert Edwards
SDSU/ANL |
This approach—grinding up the cells, extracting the DNA and sequencing it—is called either meta-genomics or random community genomics to emphasize the abstract nature of the process. Skilled scientists working on soil samples can process a sample through the steps needed to go from corn field to computer file in as little as four days. The challenge now lies in the computational analysis of the data stream being generated. This is precisely where leadership-class computing facilities excel. |
The Questions Tackled by Metagenomics
Although every study has its own goals and focus, several key questions remain that are being addressed by metagenomics. The most fundamental question is: what is there? Biologists would like to know what species of microbes are present in each environment being studied. Just as biologists who study plants and animals want to describe the objects of their research, environmental microbiologists need that descriptive information, too. |
| Related to identifying the players in an environment, the sequence data also provide perspectives on how much of each organism is in the environment, or how it is organized. Biologists would like to know, for example, whether the environment is dominated by one or a few species, or if a lot of different organisms are present. Just as a very old forest may contain only a few species of trees, the rest being lost over time, a mature microbial community may contain only a few species. A very disturbed community, where the organisms are under some form of stress, may contain a greater range of microbes. For example, our intestines contain millions of microbes, almost all of which are essential for our health and well-being, aiding our digestion and providing essential amino acids. While we are healthy, our microbial community remains constant, with the predominant members dividing as quickly as they are being lost. If we take a course of antibiotics, however, the microbial community will be disturbed, many of the most abundant microbes will be killed off, and other microbes (some of which may cause harm) can start growing in their place. This is why an upset stomach often follows a short course of antibiotics and is usually resolved once the antibiotics are completed and the normal microbial populations can recover. |
The challenge now lies in the computational analysis of the data stream being generated. This is precisely where leadership-class computing facilities excel. |
 |
| R. Bizzoco, SDSU |
| Figure 4. E. coli—the model organism. Because it is common and is easily manipulated, E. coli has long been used in biotechnology studies to grow large volumes of DNA for sequencing. |
|
| A third question being tackled by metagenomics is concerned with what is going on in each environment: what is it doing? Part of the answer to this question comes from analysis of what is in each environment. If we can describe what microorganisms are there, we can infer what they are doing. It is becoming increasingly clear, however, how limited our understanding of microbial functions are based just on the organisms that are present. By taking all of the DNA in the sample at once and identifying the protein functions that are present (see below), we are learning that different microbes are doing the same things and—surprisingly—that the same microbes have different roles in different environments. |
 |
| Source: R. Edwards, SDSU/ANL Illustration: A. Tovey |
| Figure 5.
BLAST complexity of similarity searches for 10 trillion letters of DNA used to understand microbial metagenomes. |
|
Computational Comparisons for Identifying Community Members
DNA and proteins are, at their heart, strings of characters. For decades computational scientists developing bioinformatics algorithms have leveraged numerous string-comparison algorithms developed in other fields to identify similar DNA or protein sequences in different samples. The basic notion is that if a sequence is similar to something that is already known, and especially if it is similar to something someone working in a laboratory has studied, we can declare with impunity that our new sequence is performing the same function; that is, we can transfer the annotation from the original sequence to the new sequence. To address the question, what is there? researchers use a combinatorial approach. First, all instances of a special gene, called the 16S rRNA gene, are identified based on the similarity with a database of known sequences. This approach uses the heuristic BLAST method to quickly scan the sequences for likely instances of this gene (figure 5). Depending on the length of the query sequence, the size of the database, and the overall score, some alignments might occur just by the random chance of finding two somewhat similar strings of letters. Some of these are screened out by a statistical analysis, which provides an estimate of the likelihood that the alignment occurs at random, but near matches, including some potentially spurious alignments, are refined using a global alignment program. During this alignment process, the order of the sequences is usually shuffled to find the most parsimonious representation of the alignment. Once a significant, and reliable, alignment is generated, the number of differences between each of the sequences is counted. These differences are an approximate measure for the length of time since what biologists call a common ancestor. After a cell divides in two, each of the daughter cells begins a new lineage, accumulating changes over time. These natural changes occur very slowly and are generally just mistakes in the DNA replication. On average, and in normal growth conditions, about one mistake is made during the replication of about 10 million letters of DNA. Over generations and generations of time, however, these mutations accumulate and can be used as a surrogate marker for the time since the cell divided in two. If two DNA sequences have very few changes, they are likely from cells that are closely related to each other. In contrast, if sequences have very many changes between them, it has probably been a long time since the original cell division. Although the topic remains controversial, most biologists consider that two organisms are the same if their 16S genes are about 97% identical to each other. As the similarity between these sequences decreases, the organisms are less and less likely to be the same. Using this approach, scientists can identify the organisms in an environment from the raw DNA (figure 6). |
 |
| Source: NASA Illustration: A. Tovey |
|
| Figure 6. Snapshot of the tree of life first proposed by Carl Woese and now accepted by scientists to reflect the likely evolutionary history of life on Earth. Bacteria, archaea, and small eucaryotes are small, single-celled organisms. The archaea and eucaryotes share a common ancestor that split off from bacteria approximately 3,000 million years ago. Bacteria radiated into the variety of forms that are studied today and are so influential to our lives, while eucaryotes evolved into plants, fungi, animals, and other multicellular organisms.
|
|
| Another computational approach taken to compare sequences is to repeat the alignment many hundreds or thousands of times, each time randomizing the order of addition of sequences to the build. Hence, no single early strong alignment dominates the remaining datasets. In this approach, the number of times two sequences are found to be near each other is counted and may be used either as a surrogate for time, or more routinely as support for a single assertion of time from one alignment. |
| A third approach used to categorize the sequences in a sample is to compare the frequency of different combinations of letters in the DNA sequence. Recall that there are 64 possible triplets that could be used to encode the 20 different amino acids that bacteria use. Since the earliest days of molecular biology it has been known that not all microbes use these triplets with the same efficiency or frequency. Some microbes favor a particular suite of triplets, although the reasons why are not fully understood. However, by counting the occurrence of single letters, pairs, triplets, and so forth in the DNA sequences, each fragment may be placed in an appropriate classification schema. Before analyzing the de novo data, training sets are developed by using known complete genomes and the classification of the organisms from which those genomes were derived. The test data are then compared to the training data to suggest the most likely placement of the unknown sequences in the classification schema. |
Mathematical modeling has led to a greater understanding of how the microbes in a given community are organized. |
Two similar approaches classify fragments based on their raw nucleotide composition. The first, developed by researchers at the Max Planck Institute for Marine Microbiology, uses four consecutive letters (tetranucleotides); the second, developed by researchers at IBM Life Sciences, classifies sequences based on seven consecutive nucleotides (heptamers) using a support vector machine classification algorithm. |
Mathematical Approaches for Organizing Community Members
Mathematical modeling has led to a greater understanding of how the microbes in a given community are organized. Like every other community, microbes are stratified with a very large number of a few species, fewer occurrences of more species, and a lot of species that are present in very low numbers. This organization can be represented graphically with a rank-abundance curve plotting the number of times a species is found on the y-axis, and ordering them from most abundant to least abundant on the x-axis. Since all the DNA is extracted from an environmental sample and then a proportion of it sequenced at random, the frequency with which sequences are identified should reflect the overall complexity of the environment. That is, the rank abundance of the species in the DNA sequences should reflect the rank abundance of the species in the environment. Rarely, however, is an entire sequence found more than once, and so an alternative species definition was developed to reconstruct the rank abundance curve from the sample. |
| Recall from the description of complete genome sequencing that adjacent pieces of DNA are joined into a contiguous stretch of sequence if they have 20 letters or more in common. It is very rare (1:420) that this would occur randomly. Recall also that, at the 16S level described above, about 97% identity is required for two sequences to be considered to come from the same organism. We can merge these two simple tenets to achieve a new definition of a species from a metagenome: we require at least 20 base pairs of greater than 97% sequence identity between two overlapping fragments. If we can identify all the times that these similar sequences occur in our sample, we can calculate the number of species present in the sample. Moreover, if three sequences overlap and fit our modified species concept, we know that they are more abundant than if only two sequences overlap and fit our new concept. Therefore, we can generate a rank-abundance curve based on similarity between sequences in our sample, without comparing them to any other sequence dataset. This is an exclusively intrinsic measure of the number and organization of the microbes in the sample as it does not require similarity to any external sequences or other data. |
 |
| Source: P. Morris Illustration: A. Tovey |
| Figure 7. Cycling of carbon, nitrogen, and phosphorus through the environment and into soil storage. The process is driven primarily by microbes. |
|
Several standard algorithms have been identified to map all of the sequences to each other in a sequence sample. Most of these were developed for complete genome sequencing where, as described earlier, the ultimate goal is the reconstruction of a single complete DNA fragment from small pieces. With minor modifications, we can reuse this code to estimate the number and abundance of different microbes in any given sample based on our modified species criteria. These rank abundance curves can be parameterized against a series of standard equations (for example, the power law or log-normal distribution), and by error minimization techniques we can estimate the rank abundance of the microbes in the original sample. Researchers at San Diego State University have generated a web interface for this analysis (http://biome.sdsu.edu). |
Functional Annotation and Determining What the Microbes Are Doing
The accurate prediction, projection, and extension of the functions of different proteins between organisms and samples are perhaps the most demanding of the holy grails of bioinformatics. Subsystems classification has assisted this effort by providing a set of annotations that are both accurate and reliable, being curated by human experts and then compared across all microbial genomes computationally. These annotations are sorted into groups based on aggregating sets that perform a related function in the cell. |
| The process of annotating the DNA sequences begins with comparing the new sequences to those that have been seen before. The analysis uses the heuristic BLAST approach, ascribing a probability that each pair of sequences is similar. Once all the similarities have been found for a particular sequence, a judgment call is made determine the function of the new sequence. In the annotation of complete genomes, where each sequence should have only a single instance of a given function, teams of expert biologists peruse the data and attempt to assign correct functions to the genes and proteins. In metagenomes, however, the volume of data vastly exceeds anything that a single curator can examine; hence, the new function is almost always simply taken to be the same function as that of the known sequence with highest probability of being the same. Computational biologists are developing some artificial intelligence approaches to assign a function from the subsystems, attempting to emulate the decisions that the experienced annotator would make. The subsystems approach provides many immediate advantages, including the fact that the built-in hierarchical classification can be leveraged for downstream analyses. |
Currently it takes about 100 CPU-hours to compare a single metagenome against known datasets and assign the functions. |
 |
| R. Edwards, SDSU/ANL |
| Figure 8.
Ecoregions in the United States and soil sampling sites within them. There are 104 ecotypes that cover the lower 48 states, and so far 20 different sites have been targeted for sampling. Once these have been analyzed, samples will be collected from the remaining regions and sequenced.
|
|
| At this point high-performance computing becomes critical. Our accumulated knowledge of all proteins currently extends to 7,139,712 proteins comprising 2,503,057,853 letters. (Recall that there are more than three times as many DNA letters, since three DNA letters are used to encode one protein letter, and that some DNA letters do not encode proteins but are spacers, used to contain regulatory elements, and so on.) In order to identify putative functions within the sequences, the metagenome—the raw DNA sequence—is translated to the protein sequence computationally, and then the new, unknown sequences are compared to the known sequences using BLAST. This analysis is done in protein space because the protein sequence changes more slowly than the DNA sequence. Since more than one set of three DNA letters encodes for a single amino acid, the DNA can change but not alter the protein sequence—so-called silent or synonymous mutations. |
| In metagenomics analysis, the BLAST comparisons of all the fragments that are sequenced against the standard databases of known data are the most time-consuming task. Currently it takes about 100 CPU-hours to compare a single metagenome against known datasets and assign the functions. Of course, this process depends on the sizes of both the metagenome and the database and, as noted above, both of these are increasing with the deployment of next-generation sequencing machines. In an analysis of more than 10 gigabase pairs of DNA sequence comparisons (1 x 1010 letters compared to a standard database of approximately 1.5 x 109 letters), researchers at Argonne National Laboratory showed that the compute time scales with linear complexity and that with sixteen-core architecture it takes approximately seven seconds to process 1,000 DNA sequence letters, or 7 x 107 seconds (or 810 days) to process the entire dataset. Once the comparison has been computed, the transference of annotation from known protein to unknown protein and the compilation of collections of proteins in different systems are relatively trivial computational tasks. Since these data are maintained in a series of relational databases, annotation is not a rate-limiting step. The Argonne databases are constructed to be agnostic of the underlying hardware and software profiles so that they can be rapidly deployed to new hardware as available. The machines used were a cluster of PowerPC G4 nodes and a few Intel quad-core PCs. |
"The project has the ambitious goal of integrating metagenomics sequence data with biogeochemical, ecological, meteorological, satellite, and any other pertinent data to understand the cycling of nutrients through the terrestrial environment."
Rick Stevens
ANL
|
These computations have radically altered our
view of the microbial communities around us. For
the first time microbiologists are gaining understandings
of the complex interplay between
micro-organisms in their natural environment.
Microbes are performing fascinating functions,
sharing substrates or reaction products, and not
acting in isolation. By comparing the functions in
different samples, a team from San Diego State
University, Argonne National Laboratory, and
their collaborators showed that the annotation
becomes predictive of the sample. For example,
when supplied with a set of DNA sequences from
an unknown sample, they can correctly assign it
to an environmental domain about 90% of the
time, based solely on our predictions of what the
microbes are doing with their DNA. Moreover,
they showed that this was true both for the
microbes themselves and, unexpectedly, for the
viruses that prey on them. Although it is not yet
obvious what those viruses are doing, they appear
to be sampling their hosts' DNA and enriching the
most useful things. |
Impacts of Metagenomics
Now we will take a look at some recent findings
selected from the vast range of different environments
that are being targeted by the new tools
that metagenomics provides. |
Soil Metagenomics
Microbes are fundamental constituents of our
soil, turning over the dead vegetable and animal
matter and returning the nutrients therein back
into the soil (figure 7). Microbial processes in the
soil promote carbon sequestration and contribute
to the approximately petagram (1,000,000,000
metric tons) of CO2 equivalents sequestered by
soils annually. If computational biologists can
identify which processes are accumulating carbon
in the soil—not just converting atmospheric
CO2 to fixed carbon through photosynthesis, but
converting that fixed carbon to soil organic matter—
biologists may be able to enhance the deposition
of carbon in the soils. Microbes are also
chewing on the nitrogen-based fertilizers applied
to our crops. Typical corn-belt crops are supplemented
with 150 kilograms of nitrogen per
hectare and 50 kilograms of phosphorus per
hectare, adding 10-20¢ per liter to the cost of
corn-derived ethanol. Much of this fertilizer is not
directly used by the plants but is instead consumed
by the microbes and may be lost forever
from the plant food cycle. By identifying the
microbes that are using the ammonia fertilizers
and understanding how the nitrogen is being converted,
the amount of fertilizer that need be
applied to crops to achieve sustainable, energy
efficient agriculture can be reduced. |
| As shown in figure 8, 104 different ecoregions
have been defined to cover the contiguous 48
states. It is not known how the microbes vary
between or within these regions, from the mature
forest of the Northeast, through the farmlands of
the Midwest to the arid desert of the Southwest. If
the microbes that promote healthy soils, reduce
the breakdown of ammonia, and enhance carbon
sequestration can be identified and encouraged,
the health of our nation can be promoted. In 2008,
Argonne National Laboratory established the terragenomics
project to further our understanding
of the soil microbial processes that govern carbon
sequestration, the turnover of fertilizers, and the
development of fertile agricultural land. "The project
has the ambitious goal of integrating metagenomics
sequence data with biogeochemical,
ecological, meteorological, satellite, and any other
pertinent data to understand the cycling of nutrients
through the terrestrial environment," says
Rick Stevens, associate laboratory director for
Computing, Environment, and Life Sciences at
Argonne. The project started modestly, with the
aim of sequencing representative microbial communities
from 20 different soil samples dispersed around the United States. These sites were chosen
because of the abundant metadata that
describe the locations and that will be integrated
into the analysis; each site is part of the National
Science Foundation National Ecological Observing
Network. These sites are also the target of
long-term ecological studies, and so a wealth of
additional data is available to support the
sequence-based analysis. In subsequent phases,
more sites that have been studied by other ecosystems
biologists will be sampled, and many of the
previous sites will be resampled to see how they
have changed over time and with different environmental
and nutrient treatments (figure 9). |
 |
| P. Morris |
| Figure 9.
The proportion of the five most abundant subsystems identified from a prairie site, two grasslands, and a soy field site. The bars show the frequency out of 100,000 sequences.
|
|
Biofuels
The conversion of plant matter to ethanol or
other fuels is hampered largely by a single factor:
plant cell walls are extremely resistant to degradation.
If such were not the case, all the plants in
the world would become food for the microbes,
and the plants would end up as a gooey mess! In
the current ethanol production pipelines, the cell
walls are broken down by a combination of high
temperature and high pressure, a process that
requires a lot of energy. Bacteria in a few, very
specialized environments have evolved to break
down plant cell walls and use the breakdown
products as energy. For example, cows have a
complex digestive system where plants are broken
down. By studying the microbes in the
rumen, scientists aim to identify the genes that
encode the enzymes that cleave the cellulosic
material that holds plants together. With genetic
engineering techniques, these enzymes could be
made in bioreactors, providing a cleaner, cheaper,
more fuel-efficient source of ethanol than the
current methods used to convert corn. Similarly,
at DOE's Joint Genome Institute researchers have
focused on termite hindguts, small microbial factories
that have been nurtured by the termites to
eat cellulosic wood, converting the recalcitrant
carbon into more usable sources. The termites
benefit from this relationship by using the microbial
fermentation products as an energy source.
Using metagenomics, researchers have identified,
cloned, and expressed several key microbial
genes that are responsible for the degradation of
wood to fermentative products. These isolated
products, or benign bacteria harboring them,
could potentially be used in the bioreactors that
convert corn to ethanol, resulting in cheaper and
more efficient ethanol production. |
At DOE's Joint Genome Institute researchers have focused on termite hindguts, small microbial factories that have been nurtured by the termites to eat cellulosic wood, converting the recalcitrant carbon into more usable sources.
|
Human Health
Metagenomics is also influencing human health.
Humans have about ten times as many microbes
as human cells, and they not only affect our acute
health through common diseases like diarrhea
or pneumonia, but also cause chronic diseases
such as cancer. The combination of an organism
(such as a person or a coral reef) and all associated
microflora is an emerging topic of study in
ecology; it is referred to as a holobiont, and the
genetic complement of host and microbes is
called the hologenome. Recently the metagenomics
approach was used to show that the
types of bacteria in the intestines might be
responsible for our girth. Obesity correlated with
an abundance of a type of bacteria that has the
ability to turn over energy more rapidly than
normal, and it may be that the bacteria are releasing
more energy from our food that is subsequently
consumed and stored as excess fat. A
plethora of diseases is associated with microbial
causes, but specific microbes may not cause
many of these. Rather, shifts in the microbial
communities may alter functions that microbes
are doing. In 2007 the National Institutes of
Health announced the Human Microbiome Project,
a large sequencing effort to describe all of
the microbes associated with people and to begin
to compare the presence and function of
microbes in healthy and diseased individuals.
This effort may ultimately lead to so-called personalized
medicine. In what may revolutionize
visits to family physicians, researchers will begin
by sequencing our genomes, and the genomes of
our microbes, to figure out what ails us. Then,
based on our genetic complement and known
drug-genome interactions, specific pharmaceu- ticals will be crafted specially to heal our wounds
without causing unwanted side effects with other
aspects of our metabolism. |
Computational Challenges
Metagenomics presents several problems for computational
science. Some are welcome, with easy
solutions; others will require more development and
deployment of computing infrastructure. One of
the primary challenges presented by these data is the
inherent linearity of the searches. Since each
sequence must be compared to the existing dataset,
there are few ways to avoid this linearity. Current
architectures achieve a throughput of approximately
1,000 DNA letters in about seven seconds. A standard
dataset contains approximately 1 x 108 letters.
On the other hand, sequence analysis is embarrassingly
parallel. The statistics of DNA sequence comparisons
are well understood, and the output of a
single comparison is independent of all other comparisons.
Therefore, the data can be cut and diced
into smaller and smaller datasets, ensuring that both
the query sequences and the known sequences
remain in memory and do not require disk access.
Of course, the problem with cutting the datasets
smaller then becomes the interconnect transfer rates
to place the known sequences on each compute
node. The standard program for comparing DNA
and protein datasets was published in 1990 (BLAST),
but researchers at DOE's Pacific Northwest National
Laboratory recently released a parallel version, called
scalaBLAST, that takes advantage of global arrays
stored across a whole cluster to improve computation
time in part by reducing network latency. This
version results in a significant speedup compared to
traditional approaches (figure 10). |
 |
| Source: PNNL Illustration: A. Tovey |
|
| Figure 10. Prefetching hides memory latency for distributed systems. In this figure, an approximately 50% boost in scaling is evident on MPP2 from prefetching, allowing ScalaBLAST to perform on the distributed-memory system (Linux cluster with dual 1.5 GHz Itanium-2, and Quadrics QSnet-II network) almost exactly the same as it does on the shared-memory system (SGI Altix, 1.5 GHz Itanium-2). Prefetching was implemented by creating local memory buffers that are being filled via remote direct-memory access calls (transparent to the user) while processing is taking place on the previously filled buffer. When the end of a buffer is reached, the one-sided memory operation is then checked for completion and the active buffer is swapped while a new prefetch operation is initiated. |
|
Another challenge for computational science is
the integration of these disparate datasets in a
meaningful manner. Moving toward the goal of
sequencing the microbes from soils around the
United States requires the infrastructure to store,
retrieve, and analyze not only the sequence data
but also all of the abstract data types that are collected.
As noted above, researchers are integrating
all scales of environmental analyses in the
terragenomics project—everything from the
molecular and genetic characterization of individual
proteins, through the viruses, bacteria,
archaea, and eukaryotes (including both eukaryotic
microbes such as yeasts and large plants such
as corn and soyabean), to ecological-scale interactions
between plants and animals, and even to
satellite-scale imagery. Integration and simulation
of these datasets will require exascale computing
facilities, far beyond those currently available.
Similarly, human health studies must consider
the health of the holobiont, integrating all of the
health history of the patient with the environmental
influences and the genomes of the
microbes that reside in and on us. Only then will
we be able to truly design medical services that
are tailored to our individual needs. |
Metagenomics presents several problems for computational science. Some are welcome, with easy solutions; others will require more development and deployment of computing infrastructure. |
Toward Practical Applications
Metagenomics is unveiling the complexity of
microbial communities that surround us and
affect our everyday lives. We are finally beginning
to understand what bacteria are doing and how
they are doing it. The implications for energy production
and consumption, agriculture, human
health and every other facet of our economy are
enormous, and the interdisciplinary melding of
computational sciences and biology, a field called
bioinformatics, is leading the way.
|
Contributors Dr. Robert Edwards, computational biologist
at Argonne National Laboratory and San Diego State
University |
Further Reading
Terragenomics Project
http://terragenomics.mcs.anl.gov/
E. A. Dinsdale et al. 2008. Functional metagenomic
profiling of nine biomes. Nature 452: 629-632. |
|