Transcriptome Assembly, No Reference Required!

 De novo Transcriptome Assembly
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

There’s no denying the impact of whole-genome sequencing for basic research and in the clinic. Reference genomes provide a scaffold upon which to map traits, trace evolutionary relationships, assess genetic diversity and more, and for researchers studying well-characterized organisms like Drosophila, mouse, rat and human, they are invaluable resources.

But for the vast majority of species, whole-genome sequencing is a nonstarter. Perhaps there are too few investigators interested in the organism to get the ball rolling. Or maybe it’s a matter of money. Whatever the reason, despite deep cost decreases in next-gen sequencing, many researchers simply cannot justify reading and interpreting every base in their organism of interest.

Fortunately, they don’t necessarily have to. A growing number of researchers are leveraging RNA-seq data to produce a sort of low-rent, partial-genome assembly, a process called de novo transcriptome assembly. The result doesn’t capture every base, but it does provide information on expressed genes. All it takes is an RNA-seq dataset, some specialized open-source software tools and the bioinformatics chops to use them.

Transcriptome vs. genome assembly

Though a complete genome sequence contains every nucleotide, an argument could be made that if a researcher had to choose, transcriptome data is actually more useful. For one thing, transcriptomes are far less expensive to produce. They show which genes are actually on in a given tissue and their relative expression levels. And they reveal information about gene structure, polymorphisms and alternative splicing, only some of which a genome alone can provide.

“We’re not convinced that if we did a genomic study that we would get as much useful information,” says Keenan Amundsen, a turfgrass geneticist at the University of Nebraska at Lincoln, who recently published a transcriptome assembly and comparative analysis of two buffalograss cultivars [1].

Amundsen and his bioinformatics analyst, Michael Wachholtz at the University of Nebraska at Omaha, powered their study using a tool called Oases, an add-on to the genome assembler, Velvet.

The reason researchers need tools like Oases (as opposed to standard assemblers like Velvet) is that transcriptome data is fundamentally different from genomics data.

For one thing, genome assemblers assume that all sequences are represented at more or less equivalent levels. But that’s not true of transcriptomes, in which transcripts can differ in abundance by many orders of magnitude. Furthermore, while genome assemblers ideally try to create one contig per chromosome, a complete transcriptome will contain one for every transcript, representing different splice forms, start sites and so on. Finally, although DNA is double-stranded, RNA-seq datasets can be strand-specific, information that genome assemblers would essentially ignore.

“The complexity is to try to infer the set of original transcripts that generated the reads that you see in your experiment,” says Manuel Garber, associate professor and director of the bioinformatics core at the University of Massachusetts School of Medicine.

Assembly strategies

Transcriptome assemblers actually come in two flavors. Programs like Oases, SOAPdenovo-Trans and the popular Trinity (comprising modules entitled Inchworm, Chrysalis and Butterfly) are truly de novo—they don’t require alignment to a reference genome to figure out transcript structure, but rather rely only on RNA-seq data.

Other tools, including Scripture and Cufflinks, require alignment of RNA-seq datasets to a reference genome before they can assemble and identify transcripts—a strategy sometimes called ab initio.

Ab initio is from the data itself,” explains Garber, who codeveloped the software. “This [Scripture] is a genome-guided method, as opposed to genome-independent methods like Trinity.”

Some researchers blend the two approaches, says Brian Haas, a senior computational biologist at the Broad Institute of MIT and Harvard, who codeveloped Trinity. The user’s particular strain or cell line might differ from the reference genome by the presence of a new gene, viral sequence or genome rearrangement, for instance—information that could be lost by narrowly focusing on a reference genome. In one case, Haas says, by sequencing the transcriptomes of mouse dendritic cells, his team identified a long protein-coding transcript homologous to the human genome that was absent in the mouse reference.

“We are trying to combine genome-guided and de novo approaches to extract as much information as we can,” he says.

Required resources

Cold Spring Harbor Laboratory assistant professor of quantitative biology Michael Schatz, who studies genome assemblers, says users who opt for transcriptome assembly over whole genomes should be aware that although the approach is budget-friendly, it is analytically challenging, thanks to issues like ploidy, alternative splicing and extended gene families.

If you’re determined to try, you’ll need two things. The first is hardware, particularly RAM. According to Haas, Trinity requires approximately 1 gigabyte of memory for every million reads it analyzes. Even a “reasonable-sized” dataset contains hundreds of millions of reads, so Haas recommends doing some “in silico normalization” (see also this) first to simplify the dataset.

Researchers can often access the necessary computing power through university servers, but there are off-site options, too, such as the Data Intensive Academic Grid, the Pittsburgh Supercomputing Center, the Broad Institute’s GenePattern and the Galaxy web server, all of which provide access to bioinformatics tools.

The other thing researchers invariably need is a good command of the UNIX command line. Bioinformatics tools generally lack pretty graphical user interfaces and are executed from a command prompt using instructions such as:

$TRINITY_HOME/Trinity.pl --seqType fq --single single.fq --JM 20G

Users may need to write custom programs to filter files and convert them from one format to another. And because these are open-source tools, documentation can be spotty, though tool developers and online communities often are quite responsive.

“For some students, the whole concept is beyond them,” says Wachholtz. “Everyone is used to a graphical user interface and using a mouse.”

Getting started

Daniel Zerbino, Ensembl Regulation project leader at EMBL-EBI, who codeveloped Oases with Marcel Schulz, suggests users filter reads for quality prior to attempting assembly, as that simplifies the computational burden (one option: Trimmomatic). He also suggests users “subsample” the dataset to see if it requires further manipulation because, say, some abundant transcripts are dominating the data. “You can filter that [out] and save yourself a tremendous amount of work and effort.”

Zerbino also recommends trying multiple assemblers to ensure all transcripts are identified. In the paper describing Oases, for instance, his team compared Oases to Trinity and another assembler called Trans-ABySS [2]. “All three methods find fairly different sets of transcripts,” Zerbino notes, and no one tool found everything. “Doing comparisons and trying to pick the best [transcript assemblies] from as many methods as possible seems to be the best possible strategy.”

Indeed, Wachholtz used Velvet/Oases to build his team’s published buffalograss assembly but later repeated the work using Trinity. The latter, he says, was much simpler to use, as “Trinity is a plug-and-chug program,” and it produced what he believes to be a superior assembly. “When I assembled with Velvet, most reads mapped to multiple locations,” he says. (That is, they could not be assigned to a single transcript.) “When I assembled with Trinity, the majority of reads, 60% to 70%, uniquely mapped.”

To make sure everything is working as expected before diving into your RNA-seq data, Schatz recommends running test analyses against published results. “Try to reproduce an assembly from yeast so you get some experience with the files and output formats.” If that doesn’t work, he says, collaboration with a bioinformatician might be in order.

In fact, collaboration might make sense in any event. A transcriptome assembler is exactly that—it assembles transcripts. You’ll need other tools to assign function (try BLAST), measure abundance (RSEM or eXpress) and assess expression changes across tissues or conditions. Plus, there’s all the quality-control work required to ensure the results make sense.

“There is no ‘explain transcriptome’ button in Excel,” Schatz says.

But there are resources that can help. One excellent starting point: Haas and colleagues’ recent Nature Protocols article describing the de novo assembly process using Trinity, as well as downstream tasks such as expression analysis [3]. (See also at 2011 review in Nature Methods, by Garber and Cufflinks developer, Cole Trapnell. [4])

“We think of Trinity assembly as being the end of the beginning of de novo RNA-seq analysis,” Haas says -- a sentiment that could be said of all transcriptome assemblers.

References

[1] Wachholtz, M, et al., “Transcriptome analysis of two buffalograss cultivars,” BMC Genomics, 14:613, 2013. [PubMed]

[2] Schulz, MH, et al., “Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels,” Bioinformatics, 28:1086-92, 2012. [PubMed]

[3] Haas, BJ, et al., “De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis,” Nat Protocols, 8:1494-1512, 2013. [PubMed]

[4] Garber M, et al., “Computational methods for transcriptome annotation and quantification using RNA-seq,” Nat Methods, 8:469–77, 2011. [PubMed]

  • <<
  • >>

Join the discussion