Analyze Your Sequence Data with Bioconductor

 Bioconductor for NGS Analysis
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

It’s been 10 years since the first massively parallel, “next-generation” DNA sequencers came on the market, and to call their impact transformative would be a monumental understatement. Today’s ultrafast DNA-sequencing technology enables researchers to ask and answer questions of which they previously could only dream. Human genetics, cancer biology, microbial diversity, epidemiology—these fields and more have been redefined by next-gen sequencing (NGS) DNA technology that is collectively more rapid, more data-rich and more affordable than ever before.

Yet the experiment doesn’t stop when the sequencing run ends. These days, the challenge isn’t obtaining a sequence, it is interpreting it. To wit: Illumina’s new HiSeq 4000 sequencer can produce 1.5 trillion bases of sequence—5 billion reads at 2 x 150 bases apiece—in 3.5 days. It simply isn’t possible to eyeball that volume of data and figure out what they mean.

“The language of biology has become computation,” says Sean Davis, a staff scientist at the Center for Cancer Research at the National Cancer Institute. “And the reason is that biology in general, and DNA sequencing in particular, has become a data science.”

Fortunately, biologists have responded with a fleet of commercial and open-source software packages to decipher those data, many of which are listed here. The software includes include everything from cloud-based suites, like Illumina’s BaseSpace® and DNAnexus, to tool aggregators like GenomeSpace and Galaxy, to command-line toolsets such as the Broad Institute’s GATK and the Bioconductor project.

Bioconductor has become a popular choice for researchers looking to efficiently sift through the wealth of NGS and other types of experimental data they collect. Here's why.

About Bioconductor

As described in a recent perspective in Nature Methods, “Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology” [1].

Built atop the statistical programming language R—which gives scientists an interactive command-line interpreter—Bioconductor supports a variety of biological data types, including microarrays, flow cytometry, mass spectrometry and image data, and of course, DNA sequences. In all these cases, the software enables researchers to perform massive and sophisticated mathematical operations and to string together operations into complex pipelines, all with just a few lines of code.

“My own initiation into R was when I realized that 3,000 lines of C code could be expressed in six lines of R,” explains Martin Morgan, principal scientist in computational biology at the Fred Hutchinson Cancer Research Center in Seattle and head of the Bioconductor project.

Bioconductor offers similar clarity, says Davis, who is also a core Bioconductor developer. Just a few lines of code are required, for instance, to quantify the differential RNA expression between two RNA-Seq datasets. “The software becomes a way of encapsulating very complex ideas in simplicity,” Davis says.

Nearly 23,000 articles in PubMed cite Bioconductor packages, according to the Nature Methods perspective [1]. And according to Davis, the suite is downloaded nearly 15,000 times per month. Thus, says Davis, it should be easy for newbies to find local help when needed. “There’s probably someone down the hall using Bioconductor.”

Kasper Daniel Hansen, assistant professor of biostatistics at the Bloomberg School of Public Health at Johns Hopkins University, uses and develops Bioconductor tools in his epigenetics research. Bioconductor, he says, offers “high-quality graphics, high-quality statistical routines, [and] a lot of care is taken to make this a language that’s usable for day-to-day data analytics.” Plus, he says, the tool sports “a vibrant user community and user base. I like to write software in a language people will use.”

Another advantage, says Hansen, is flexibility. Although graphical user interface-based software may be easier to manipulate, Bioconductor enables researchers to tweak their workflows to accommodate changing experimental conditions. “In my day-to-day work, no two analyses are always the same,” he explains. “There always are custom things that have to be done. … I never can use something off the shelf.”

Getting Started

R is compatible with Windows, MacOS and Linux computers; therefore, so is Bioconductor. Users typically install R, followed by a “core” set of Bioconductor packages, which they then supplement with more specialized packages, as needed.

According to Davis, all Bioconductor packages provide at least two levels of documentation. The first is detailed documentation of the algorithms (functions) the package supports. The second is a set of working “vignettes”—workflow examples that demonstrate how to actually apply those functions to user data. Oftentimes, says Davis, packages will include sample datasets to ensure researchers are using the packages correctly.

At present, users can choose from some 934 Bioconductor packages. Among the most popular for DNA-sequence analysis are Rsamtools (for reading sequence-alignment data into R), edgeR (for differential-expression analysis), GenomicRanges (for manipulating sequence features) and AnnotationHub (for downloading sequence data from online resources). Users also have created packages to encapsulate such popular external tools as Cufflinks and Bowtie (CummeRbund and Rbowtie, respectively) and to expose Bioconductor features in the online analysis tool Galaxy (RGalaxy).

According to Hansen, GenomicRanges is particularly powerful: “It implements an algebra” that enables users to ask questions like, Given a list of 80 million single-nucleotide polymorphisms (SNPs) and a list of gene promoters, which SNPs are contained in the promoters and might be likely to affect gene expression? “I use that every day, and it has transformed the way I do my work,” he says.

Combining GenomicRanges with AnnotationHub, users can answer even more sophisticated questions. Recently, for instance, Hansen had a list of genomic regions that displayed “interesting patterns” of DNA methylation. “I wanted to find out, are these regions enriched in any transcription-factor binding sites?” Using AnnotationHub, he downloaded 500 epigenetic “tracks” from the University of California, Santa Cruz (UCSC) Genome Browser—a subset of the output of the National Institutes of Health (NIH) ENCODE project—and compared those datasets against his methylation data. “That represents millions of dollars of [research] investment, and I could complete this entire task in two hours on my laptop,” he says.

Users interested in learning more about Bioconductor have several options. The Bioconductor website offers extensive resources, including course materials from training sessions and conferences (one is scheduled for July in Seattle); common workflow guides and vignettes; video tutorials; and a well-trafficked, Stack Overflow-like support site containing years’ worth of archived questions and answers. “Literally, the knowledge of a decade and a half and several thousand people are there,” says Davis.

With its low bar for entry and high-level functionality, Bioconductor “insulates” researchers from the underlying statistical issues they are trying to solve, Morgan says. But it also exposes users to the “statistical ‘opinions’ of the group that contributed the package in the first place”—opinions that are not necessarily widely shared or applicable to your particular research questions.

When in doubt, it never hurts to double-check your work—and to speak with an expert. “It behooves the analyst, who spent tens of thousands of dollars and potentially years of work generating the data, to be responsible in the analysis,” Morgan warns. “And that probably involves consulting with a statistician.”

Reference

[1] Huber, W, et al., “Orchestrating high-throughput genomic analysis with Bioconductor,” Nat Meth, 12:115-21, 2015. [PubMed ID: 25633503]

Image: Jeffrey M. Perkel

  • <<
  • >>

Join the discussion