Metagenomics Software-Interpreting All the Data

 Metagenomics Software

Some 3.2 million 16S ribosomal RNAs (rRNAs) have been logged to date by the Ribosomal Database Project (RDP) at Michigan State University. Each represents a single bacterium or archaeon. And yet, the vast majority of microbial life on Earth remains undocumented and uncultivable.

That presents a problem for the microbiologists who would probe the diversity and dynamics of microbial populations both in the environment and in human health—yet not an insurmountable one. These days, researchers have at least two options. They can study microbes cell by cell (single-cell genomics) or at the population level (metagenomics).

Metagenomics studies generally take one of two forms. Either researchers survey the diversity of the population by sequencing and analyzing a single common, but highly variable gene, typically 16S rRNA. Or, they sequence all the DNA from the environment and try to reconstruct the population by reconstructing the genomes of which it was comprised. This strategy is called shotgun metagenomics.

Each has its pros and cons. But whichever approach users take, there exists a rich and growing toolset of analytical software to help.


16S rRNA analysis

Of the two metagenomics approaches, 16S rRNA (or, more generally, targeted metagenomics) is by far the easier.

“16S is like an industry-standard way of doing metagenomics analysis in a quick-and-dirty way,” says Prateek Kumar, product manager for Thermo Fisher Scientific’s Ion Reporter software, a cloud- or server-based package that can perform this analysis.

The process involves sequencing all or part of the 16S ribosomal RNA gene and comparing the resulting sequences against a database of sequences from known taxonomic groupings. Then, by essentially counting and clustering the reads into “operational taxonomic units” (OTUs)—each of which is a computational representation of a single species—researchers can determine which organisms were present in the sample and in what amount.

Beckman Coulter Genomics, a service provider based in Danvers, Mass., has run some 200 metagenomics projects in the three years it has been offering that service, says staff bioinformatician Tom Moloshok. Most of those jobs, he says, involved taxonomic analysis based on 16S rRNA: “It really comes down to a question of species diversity or genus diversity within samples.”

According to Moloshok, his company uses a custom pipeline built from third-party software. First the raw sequence reads are assembled into “contigs” using MIRA (Mimicking Intelligent Read Assembly), providing a “collapsed set” of species present in the sample; the generated set is then searched against a reference sequence database using BLAST (Basic Local Alignment Search Tool); and finally the results are grouped into discrete taxonomic bins using MEGAN (MEtaGenome ANanlyzer, currently at version 5).

Also supporting the 16S workflow is QIAGEN, which launched in June the CLC Microbial Genomics Module, a software plug-in for its CLC Genomics Workbench and CLC Genomics Server products.

QIAGEN aims to guide customers “from sample to insight,” says Arne Materna, global product manager at QIAGEN Bioinformatics. The software transforms raw sequence reads into OTUs (operational taxonomic units) and guides users through statistical analysis in just four steps. “The output is very visual and interactive,” Materna says. “You can look at individual communities and adjust the taxonomic resolution to any level you want.”

The software also enables users to assess the diversity of a given sample (a parameter called ‘alpha’) and, importantly, to compare diversity to other samples (‘beta diversity’). Among other applications of such analyses, Materna says, criminalists can use population diversity to compare, say, soil samples with a specimen recovered from a crime scene—in fact, the company supplies a reference dataset and tutorial to walk new users through exactly that process. QIAGEN bioinformatics plans to add whole-metagenome analysis to the software next year.

16S rRNA analysis is also possible using Pacific Biosciences sequencing data, says PacBio chief scientific officer Jonas Korlach. He explains that the company’s long reads simplify both 16S and shotgun-metagenomics approaches. Using PacBio’s circular consensus sequencing (CCS) workflow, for instance, researchers can sequence the entirety of the 16S rRNA gene to high confidence, he says—something that isn’t possible with short-read technologies because of the gene’s length. On the shotgun-metagenomics side, PacBio’s long reads simplify genome assembly. And when combined with CCS, they provide high-resolution information on gene operons and genomic structure, he adds.

The company has released a set of tools called rDNATools on GitHub. Alternatively, researchers can upload sequences to the Chun Lab web portal in Korea and run their analyses that way.

With that web portal, “the user can submit Pacific Biosciences reads and get a very detailed picture of the phylogenetic tree and the composition of these populations,” Korlach says.

Shotgun metagenomics

There also are tools available for those pursuing whole-metagenome assembly problems.

Kraken, for instance, uses a database of whole reference genomes—not simply 16S rRNA sequences—to sort whole-metagenome data into related bins, explains Derrick Wood, a post-doctoral fellow in computer science at Johns Hopkins University who helped develop the software as a graduate student. “For any given input sequence … Kraken’s goal is to give that sequence a taxonomic label,” he explains.

Among other applications, he says, researchers can use the software to filter out reads that are not of interest. For instance, a dataset from a study of host-microbiome interactions could be filtered to remove host sequences.

According to Wood, Kraken requires significant computational resources—“at least 100 GB or more of RAM available.” But a stripped-down version, called MiniKraken, also is available, he says. “We get 70% to 90% of the sequence sensitivity of the full database, but it’s available to anyone with 4 GB RAM, which means I can run it on my laptop.”

Another approach for shotgun assembly involves the chromosome conformation analysis tool Hi-C.

Christopher Beitel, a graduate student at the U.C. (University of California) Davis Genome Center who published a report on this approach in 2014 [1], explains that Hi-C reveals “what pieces of DNA are nearby other pieces of DNA.” For eukaryote genomes, such data reflect chromosomal looping. But with bacteria, the data can be used to determine which pieces of DNA came from the same cell, including chromosomal and extrachromosomal sequences.

Metagenomics, Beitel says, is like trying to reassemble a set of phone books that have been put through a shredder. Applying Hi-C simplifies that problem by metaphorically coloring each book’s pages a different color first. “We are attempting to create a signal of cellular co-localization of DNA fragments before those [cells] are lysed,” he says.

And there are, of course, other tools. One popular option is MetAMOS, “a modular and customizable framework for metagenomics assembly and analysis,” according to a publication describing its capabilities [2]. Another popular choice is QIIME (pronounced “chime”), an open-source bioinformatics pipeline that is “designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics,” according to the software’s home page. For those using Pacific Biosciences data, there’s the company’s HGAP tool for whole-metagenome assembly, Korlach says.

With so many options available, unraveling the metagenome is easier than ever. But microbiologists—and rRNA database curators—needn’t worry: With so many bacteria remaining to be discovered, there should be plenty of work to go around for years to come.


References

[1] Beitel, CW, et al., “Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products,” PeerJ, 2:e415, 2014. 

[2] Treangen, TJ, et al., “MetAMOS: A modular and open source metagenomics assembly and analysis pipeline,” Genome Biology, 14:R2, 2013.

  • <<
  • >>

Join the discussion