Bioinformatics Tools for lncRNA Research

 Bioinformatics Tools for lncRNA
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

It is by now well established that the eukaryotic gene deserts once derided as “junk DNA” are anything but. Though they do not code for proteins, these genetic loci play key roles in gene regulation, including the production of long noncoding RNAs (lncRNAs). lncRNAs arise from protein-coding regions, too, and although they don’t code for protein, many do play key functional roles, such as recruiting chromatin-remodeling factors to control gene expression.

Naturally, researchers are keen to study these noncoding transcripts, and for the most part, the research tools to do so already are in place (read: RNA-seq). After all, the sequencer doesn’t care if a transcript encodes protein or not. Data analysis, though, is a different matter.

For one thing, some lncRNAs have different physical characteristics than mRNAs, notes Will Jeck, an M.D./Ph.D. candidate at the University of North Carolina Medical School in Chapel Hill, N.C. Some lack a poly-A tail, for instance, or are even circular, both of which can throw off some sequence-analysis tools.

But perhaps the biggest difficulty, says John Rinn, the Alvin and Esta Star Associate Professor of Stem Cell and Regenerative Biology at Harvard Medical School, is that it’s surprisingly difficult to prove a transcript really is “noncoding.” Just about every long transcript contains at least one short open reading frame (ORF), he notes. Determining that that ORF is never translated into protein is incredibly difficult—you can never really prove a negative result. “We call them lncRNA essentially because we don’t know what else to call them,” Rinn says. As a result, researchers tend to lump together transcripts whose only common feature is an (apparent) lack of translation.

To help tease out such challenges, bioinformaticians are amassing a growing kit of lncRNA tools. Here are a few options.

Circular RNAs

What bioinformatics tools you need depends, of course, on the question you’re asking. If all you want to do is count molecules, traditional RNA-seq analysis packages should work just fine, assuming the RNAs in question have already been annotated, says Julia Salzman, assistant professor of biochemistry at Stanford University. “The canonical packages for mRNA discovery and quantification are probably just as good as any other lncRNA-specific algorithm,” she says, “because the essential problems are identical.”

Or rather, they usually are identical. One area where they are not is in the case of circular RNAs.

As Salzman, who as a post-doc identified many such molecules, explains, circRNAs are ubiquitous noncoding transcripts that for years were hidden in plain sight, overlooked, among other reasons, because the bioinformatics tools designed to find novel transcripts were unprepared for the possibility of a circular topology. (There is now a database of circular transcripts at circRNA.org; for lncRNAs in general, check out lncRNAdb.org.)

Jeck, who also has studied circRNAs, says the key to studying them is to use a bioinformatics tool that can accurately detect so-called “backsplices”—sequence reads in which more 3’ exons precede more upstream exons as a result of the RNA’s circularization. Originally, he says, few if any programs had the capability—they essentially ignored the sequence evidence in such cases; today, many can, including the mapping program segemehl.

Another circRNA-ready tool, and one Jeck used in his own research, is MapSplice. Capable of quantifying both previously annotated and novel transcripts, MapSplice is “a software for mapping RNA-seq data to reference genome for splice junction discovery that depends only on reference genome, and not on any further annotations,” according to the tool’s website. But Jeck recommends TopHat for most users in light of what he calls the application’s “superior documentation” and broader user base. “There’s definitely a usefulness to that broad use that makes it easier to get help when you need it.”

Salzman has developed another algorithm for circRNA analysis. As she explains, every sequence read has an inherent error associated with it, which is a function of transcript abundance. “Basic statistics teach us that the higher the signal, the more precisely we can measure something.” As lncRNAs tend to be expressed at low levels, quantitation estimates can be thrown off, as can structural analyses, making it difficult for researchers to know which novel transcripts to follow up on.

Salzman’s algorithm, codeveloped with grad student Linda Szabo and not yet published, essentially determines the false discovery rate for each candidate transcript—both circular and linear—thereby giving researchers a way to prioritize follow-up work and assign confidence values.

Salzman says her lab is validating the algorithm by experimentally looking for novel transcripts that seem solid based on sequence coverage, but in which her algorithm has low confidence. “So far, we haven’t been able to detect these transcripts,” she says. “That’s a good sign.”

Now, she says, she hopes the broader research community adopts it. “It will add the general ability to assign confidence in every detected splice junction either at annotated boundaries [or] totally novel boundaries, and that’s just not available in most algorithms.”

Seeing structure

Walter (Larry) Ruzzo, professor of computer science and engineering at the University of Washington, says lncRNA analysis represents a challenging bioinformatics problem because “the kinds of signals we rely on for analyzing the function of protein-coding sequences are different than in the noncoding world.” Codon frequencies, for instance, have no meaning in a noncoding RNA. Evolutionary signatures appear different, too.

One common strategy for inferring function in a piece of nucleic acid is to look at how it has evolved over time—that is, to compare orthologous genes in different species. An important protein-coding gene, for instance, should exhibit relatively few amino acid changes over evolutionary time—that is, its “codon substitution frequency” should be low. But noncoding RNAs are, well, noncoding, meaning the sequence experiences a different sort of evolutionary pressure. And they tend to evolve relatively rapidly compared with protein. Thus, sequence conservation is lower than might be anticipated, complicating analyses.

Still, functional RNAs must derive their function from some feature, and if it isn’t sequence per se, it could be structure, or how the molecule folds upon itself. To identify such features, Ruzzo’s team developed CMfinder, a tool that can essentially predict the secondary structure of a given sequence and compare it against other transcripts.

Applying the algorithm to bacterial RNAs, Ruzzo’s team has identified likely metabolite-binding control modules called “riboswitches,” which control translation based on metabolite abundance. Now his team has applied CMfinder to the human genome. “It has turned up literally thousands of candidate regions … [in which] the pattern of conservation is better explained by conservation of secondary structure than primary sequence.”

Ultimately, of course, bioinformatics tools are just that, tools. Although they can help formulate hypotheses, those hypotheses must be tested at the bench. But at least, thanks to these and other algorithms, researchers will have a better sense of where to start looking.

  • <<
  • >>

Join the discussion