Transcriptome Analysis Using RNA-Seq

 Transcriptome Analysis Using RNA-Seq
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

When you get right down to it, the difference between a skin cell, say, and a kidney cell, is a matter of gene expression. All cells have the same DNA; it’s the proteins they produce that define their behavior. The instructions to build those proteins are carried by RNA, and researchers have long recognized the value of probing RNAs to gain insight into the expression differences that define tissues, developmental stages and disease.

RNA-Seq vs microarray

Just a few years ago, researchers who wanted to get a 30,000-foot overview of the transcriptional state of a cell—the so-called “transcriptome,” or cellular RNA content—had one option: DNA microarrays. But the rise of next-generation DNA sequencing (NGS) technologies, coupled with plummeting prices, has shifted the technology landscape.

Today, transcriptome analysis is performed most commonly using an NGS application called RNA-seq, in which some RNA pool—total RNA, messenger RNA or noncoding RNA, for instance—is reverse-transcribed into cDNA, converted into a sequencing library, sequenced and analyzed.

The technique offers several advantages over DNA microarrays, says John Marioni, research group leader at the European Bioinformatics Institute on the Wellcome Trust Genome Sciences Campus in Cambridge, UK Most obviously, RNA-seq works even for species for which no reference genome or DNA microarray exists. Microarrays cannot be built without at least a partial genome sequence and some understanding of what sequences the researcher is looking for. And microarray manufacturers produce chips mostly for the classic laboratory models—Drosophila and C. elegans, mouse and rat.

“If you want to look at organisms way down the evolutionary ladder, like sponges or marine mollusks, there’s no way to do that with arrays,” Marioni says.

In contrast, RNA-seq is unbiased. It reads whatever cDNA is in the sample, regardless of whether researchers have seen that DNA before or not.

Marioni, a statistician and computational biologist who develops tools for analyzing RNA-seq data, has been using the technique since 2008. This year he co-authored a paper in which he applied it to genetic differences and variation among 16 mammalian species, including 11 non-human primates, of which seven had “little or no genomic data . . . previously available.” [1]

His goal, he says, is to create tools that can turn raw data into biological insights. “The idea is you get counts of transcripts from primate livers, and you want to develop models to take in count data and get biological inferences out, so you can know these are not chance events and there is meaningful data in the numbers you’ve analyzed,” Marioni explains.

RNA-seq also offers other advantages over microarrays. It offers a wider dynamic range than microarrays and generally can pick up less abundant transcripts. And unlike microarrays, which report relative expression values based on fluorescence intensity, RNA-seq can report those abundances absolutely, because it counts the transcripts that it reads. Finally, RNA-seq can reveal transcript structure and splicing and can even identify novel isoforms, gene fusions, allele-specific variants and the like.

Naturally, given its growing popularity, tools for performing RNA-seq are widely available, and more are coming to market. Whether it’s sample preparation on the front-end or bioinformatics analysis on the back-end, you’re sure to find a tool to fit your needs.

RNA-Seq Sample preparation

In comparing RNA-seq to other next-gen sequencing applications, the primary difference, says Jeremy Preston, director of product marketing at Illumina, is, well, the RNA. “You can’t sequence raw RNA. You have to convert it to DNA first. That’s the key step that makes [RNA-seq] different in terms of the assay.” Once you have that cDNA, it’s just like any other sample, Preston says, which then feeds directly into your sequencer’s library preparation protocol.

Illumina’s TruSeq RNA Sample Preparation Kits, for instance, prepare sequencing libraries from total RNA. The kits enable “indexing” (i.e., barcoding) of up to 24 samples at once, meaning up to 384 samples can be processed per HiSeq 2000 run (24 samples per lane, multiplied by 16 lanes, equals 384). New indexing reagents that will increase multiplexing to 96 samples per lane are in development, Preston says.

For typical expression profiling studies, researchers collect 10 million to 20 million reads per sample in a transcriptome analysis, so there’s room for at least 100 samples in a full HiSeq run (3 billion reads), says Preston. For more in-depth analysis, for instance to identify novel transcripts or rare noncoding transcripts, users might dedicate 50 million to 100 million reads per sample (still enough for two samples per Illumina lane), “but that’s real edge-of-the-curve research,” he says.

Illumina purchased Epicentre Biotechnologies in 2011 and through that subsidiary, the company offers additional RNA-seq tools, including the Ribo-Zero™ ribosomal RNA removal kits, which can boost sensitivity by removing highly abundant but irrelevant transcripts from your sample, and ScriptSeq™ Complete Kits “for seamless, end-to-end preparation of RNA-Seq libraries in [one] day,” according to company literature.

Other sequencer vendors also provide reagents. Offerings include Life Technologies’ Ion Total RNA-Seq Kit v2, which “contains everything needed to construct representative cDNA libraries for strand-specific RNA sequencing on the Ion PGM™ Sequencer,” according to product literature. Roche Applied Science does not offer a dedicated RNA-seq kit for its 454 GS FLX and GS Junior systems, says marketing manager Clotilde Teiling, but the company does have a cDNA Synthesis System Kit that can be used to convert RNA into DNA for library preparation.

Third-party tools, such as the SureSelect RNA Capture kit from Agilent Technologies, are also available.

NGS Sequencer hardware

Fortunately, when it comes to RNA-seq, the technology is somewhat platform-agnostic. Each of the “Big Three” sequencing vendors, Illumina, Roche/454 and Life Technologies, supports the application on its hardware, and clients of service provider SeqWright can choose to run their analyses on any of them. The companies’ product offerings include the 454™ Titanium and GS-FLX+ from Roche/454, HiSeq™ 2000 and MiSeq™ from Illumina, SOLiD™ 4 and 5500xl, and the Ion Torrent PGM™, from Life Technologies.

“Each platform has its own advantages and disadvantages,” explains Adam Pond, a marketing associate at Houston-based SeqWright. “For a project that’s larger, with multiple pooled samples, you would want a platform like Illumina HiSeq, whereas for bacterial transcriptomes you might want Ion Torrent, which gives you the most data at the best cost. But every platform is capable of RNA-seq.”

Marioni uses the Illumina Genome Analyzer IIx in his work. “It’s the technology you would automatically think of doing RNA-seq data [collection] on,” he says. In part, that’s because the platform collects so many reads that it’s possible to read very deeply into the transcriptome. Indeed, Teiling advises those 454 customers who simply want to count transcripts to use Illumina’s sequencers or even DNA microarrays, the latter of which is “still the experiment of choice for measurements of mRNA levels.”

But RNA-seq represents a significant user base on 454’s GS FLX, as well, she says, especially for non-model organisms for which researchers want to use transcriptome data to build rough genome assemblies. That’s because, at about 700 bases in length, 454 reads are long enough (albeit fewer in number than competing platforms) to easily align to a reference genome, if available, or to each other. The resulting “isotigs” (RNA contigs) can be used to probe transcript structure, detect allelic differences and identify novel splice variants, polymorphisms and fusions, Teiling says.

In one 2011 study, for instance, researchers at 454 (including Teiling) and Cornell University used the GS FLX Titanium platform to compare tame and aggressive silver foxes—a mammal for which the genomic DNA sequence was not available—to each other and the domesticated dog. In the process, they identified “over 30,000 high-confidence fox-specific SNPs, fox orthologs of over 14,000 dog genes, and yielded insights into potentially important differences in expression of genes in the pre-frontal cortex between tame and aggressive foxes.” [2]

Analytical tools

Of course these days, it’s not so much the sequencing that’s difficult, but the analysis, and RNA-seq neophytes could face significant hurdles, says Marioni. For one thing, the sheer number of free and commercial analytical tools can be dizzying. One step, aligning reads to a reference genome, is represented by “at least 60 algorithms,” he says.

Complicating matters, RNA-seq data are not like other sequencing data. When sequencing genomic DNA, the aim is typically genome assembly or variant discovery. But with RNA-seq, the goal is usually transcript counting. First, though, you must align the reads to a reference, and there are two distinct approaches, he says: aligning to a reference genome or aligning to a transcriptome. The former includes splice junctions, so your analytical software needs to be capable of handling that.

Marioni recommends free, open-source command-line tools like BOWTIE and BWA for analyzing RNA-seq data, which computer-savvy users can string together into “pipelines.” Alternatively, users can use the software that comes with their sequencers or outsource the work to online data analysis platforms like DNAnexus or service providers like SeqWright.

The result from such analyses is typically a list of genes whose expression levels changed during the experiment. It’s then up to the user to make sense of those lists, to determine which genes to go after next. “A lot of translational medicine researchers need to take action on that knowledge and pivot their investigations to the pathways or biomarkers that are really closely tied to the phenotype they are investigating,” says Megan Laurance, scientific leader for iReport at Ingenuity Systems. That’s where software like Ingenuity’s iReport comes in.

iReport is a web-based analytical tool that helps researchers identify the biological pathways implicated in their RNA-seq data. Built on the company’s Knowledge Base, which comprises nearly 5 million findings curated from biomedical references and databases, the software enables researchers to identify key genes and biological processes underpinning the system they are studying.

“Our whole goal with iReport is . . . giving [researchers] a really fast and simple way to ‘grok’ the data, so they know how much data they’re dealing with, and what biological story is contained in the data,” Laurance says. Users can obtain that story for $495 per report, she says.


References

[1] Perry, GH, et al., “Comparative RNA sequencing reveals substantial genetic variation in endangered primates,” Genome Res, 22:602-10, 2012.

[2] Kukekova, et al., “Sequence comparison of prefrontal cortical brain transcriptome from a tame and an aggressive silver fox (Vulpes vulpes),” BMC Genomics, 12:482, 2011.

 

The image at the top of the page is Life Technologies' SOLiD™ 4 System.

  • <<
  • >>

Related Products

Join the Discussion