Non-coding RNAs (ncRNAs)—RNA molecules that are generally not translated into proteins —were once dismissed as “junk” uninvolved with gene expression. Within the past decade or so, however, it’s become clear that much of the human genome (and indeed, that of many other mammals) is actually transcribed into a variety of ncRNAs. Thought to comprise about 60% of total RNA, ncRNAs are now understood to play a key role in controlling gene expression by binding to proteins or other molecules to switch genes on or off, in most cases without affecting the DNA sequence.

“We originally thought, ‘Oh, this is just noise and garbage,’” says Frank Pugh, Ph.D., professor of biochemistry and molecular biology at Penn State University. “Then a few cases developed where it became clear that ncRNAs weren’t exactly irrelevant. As we started peeling away the layers, we saw that more and more ncRNAs have functions. Some may be functions that we cannot even imagine right now, or that are difficult to define—very subtle functions affecting morphology or states of liquid phases inside the nucleus. If they are disrupted, the effects may be very minor…but they could be fitness effects over the evolutionary lifetime of an organism. Subtle doesn’t mean unimportant.”

A Rosetta Stone for lncRNA function

ncRNAs of over 200 nucleotides in length are broadly categorized as long non-coding RNAs (lncRNAs), which are thought to have key regulatory roles that, when disrupted, contribute to disease. Exactly what is the function of each of these many thousands of lncRNAs? For all but a handful of them, that question has remained a mystery. But now, scientists at the University of North Carolina have now developed an algorithm that they believe can rapidly categorize lncRNAs by their likely functions.

One of the best known lncRNAs is Xist, which plays a key role in normal development in females. “Expressed early in female mammalian development, its function is to turn off gene expression across the entire 150 million base pairs of the X chromosome, over 100 years and a billion cell divisions,” explains Mauro Calabrese, Ph.D., assistant professor of pharmacology and member of the University of North Carolina’s Lineberger Comprehensive Cancer Center. “It’s really remarkable, and good at what it does. Other RNAs have been identified that function in a similar way on a smaller scale, silencing gene expression in smaller regions of the genome, but Xist is a total outlier, turning off the whole chromosome. Nonetheless, Xist and its lesser-known cousins seem unquestionably to be working through the same mechanism, recruiting a histone methyltransferase called the polycomb complex to turn off gene expression.”

But when they compared Xist to these other lncRNAs using BLAST (Basic Local Alignment Search Tool) or other common sequence alignment algorithms, Dr. Calabrese and his team found no similarities in their sequence. “So we thought, maybe we’re just looking at this the wrong way. A tool like BLAST is designed to detect an evolutionary relationship. Protein A and Protein B have similarity via BLAST because they evolved from single precursor protein Z. But that inherent assumption isn’t true for these non-coding RNAs. Xist and its cousins do the same thing by recruiting the same enzyme, but each evolved independently.”

So they theorized that, since non-coding RNA recruits RNA binding proteins that recognize tiny pieces of sequence, by counting 6-nucleotide stretch “words” in a sequence, lncRNAs could be categorized by the abundance of all possible words in each. “You only have the four letters—A, G, C, and U,” he explains. “If you assume a word is 6 nucleotides long, you have 4096 possible words in that ‘dictionary.’”

The computer-based tool Dr. Calabrese and his team developed, called SEEKR, finds and compares protein-binding sequences in lncRNAs based on those “words”—which they call “k-mers” —regardless of their precise locations. He compares it to an English-language document filled with paragraphs. “If you can’t read the language fluently, you can get a sense of what each paragraph is about by counting the abundance of words in each paragraph. A paragraph that mentions the word ‘dog’ 100 times is far more likely to be about dogs than one that never does.”

The group found that about half of all human and mouse lncRNAs could be grouped into five different communities, based on similarities in their kmer content. The kmer-based approach also could help predict where lncRNAs are normally found within cells and to what kinds of protein they bind. "We can now take sequence information from a well-studied lncRNA, and use it to discover lncRNAs that may be functioning through a related mechanism. In a way, it's like being able to finally understand the different scripts in the Rosetta Stone,” Calabrese says.

“This is the first opportunity we have had to really get a foothold into this process. If we know what maybe 30 lncRNAs do, we can now take SEEKR and find other RNAs that are likely functioning in a similar manner.”

To test SEEKR’s capabilities further, Dr. Calabrese and his group plan to deploy it on a larger group of RNAs with Xist-like functions, which could be regulating gene expression in important ways during development and in different diseases. “We think we can identify many RNAs that regulate gene expression in way similar to Xist, and we’re now in the process of testing that hypothesis.”

Ultraconserved lncRNAs

One particularly intriguing lncRNA is Evf2, which appears to play a crucial role in early development of the hippocampus. Jhumku Kohtz, Ph.D., research professor of pediatrics in the department of developmental biology at Northwestern University Feinberg School of Medicine, discovered Evf2 and reported it as the first known ultraconserved lncRNA in 2006. She focuses her research on the biological significance of Evf2 and other lncRNAs in the regulation of GABAergic interneuron development. (Defective GABAergic transmission has been implicated in a number of developmental disorders, including epilepsy, autism, and schizophrenia.)

“Not only does Evf2 have enhancer-regulating activity and transcriptional activity, but it’s particularly intriguing because it forms clouds that suggest it is regulating not just genes that are adjacent to it, but genes far away,” she says. “It was previously thought that enhancers only acted ‘next door,’ or one MB away at a maximum. This tells us that the linear distance of a chromosome is not a limitation—it’s all about what’s happening in three-dimensional space.”

In their most recent research, published in Molecular Cell in August, Dr. Kohtz and her team reported that EVf2 regulates the selection and expression of four specific genes in GABAergic interneurons during embryonic brain development, and by doing so, plays an important role in cells that produce GABA neurotransmitters. During embryonic brain development, it selects these genes—both activated and repressed, and ranging from 1.6 MB to 27 MB distant—and places a key DNA region near each, which allows them to be accessible for regulation. In a mouse model without Evf2, adult mice were more susceptible to seizures due to reduced GABA inhibitory function.

“We confirmed that Evf2 RNA decreases seizure susceptibility,” says Dr. Kohtz. “It is exciting to discover how this RNA actually works in the embryo and the crucial impact it has on subsequent neurological activity in the adult brain.”

The research is challenging, Dr. Kohtz adds, because they must work on primary neuronal cells from tissue. “We are looking at a very specific time in development when GABA-ergic neural progenitors differentiate in vivo. We can’t do this in culture because the minute you do that, you automatically get artefacts—methylation changes that happen even in the presence of RNA. So that’s slowed us down a bit, but it’s very exciting to have these results. Since GABA-ergic interneurons are the most diverse cell type in the body, a major question has been how does RNA organize such a huge set of genes across a region with a common function? Now we know that folding shortens the distances on the chromosome, which in retrospect seems so obvious.”