Researchers at the Children’s Hospital of Philadelphia (CHOP) have developed a new computational tool to more accurately discover and quantify RNA molecules from error-prone long-read RNA sequencing data. The tool, called ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options), was reported in Science Advances.
Alternative splicing, which allows a single gene to encode several different proteins, is an important step in many biological processes, like when stem cells mature into tissue-specific cells. However, in the context of disease, alternative splicing can be dysregulated. Therefore, it is important to examine the transcriptome — all the RNA molecules that might stem from genes — to understand the condition’s root cause.
Search Antibodies Search Now Use our Antibody Search Tool to find the right antibody for your research. Filter
by Type, Application, Reactivity, Host, Clonality, Conjugate/Tag, and Isotype.
Historically, it has been difficult to “read” RNA molecules in their entirety because they are usually thousands of bases long. Instead, researchers have relied on short-read RNA sequencing, which breaks RNA molecules and sequence them in much shorter pieces – somewhere between 200 to 600 bases, depending on the platform and protocol.
Computer programs are then used to reconstruct the complete sequences of RNA molecules. Short-read RNA sequencing can give highly accurate sequencing data, with a low per-base error rate of approximately 0.1%. However, it is limited in the information it can provide due to the short length of the sequencing reads.
Recently, “long-read” platforms that can sequence RNA molecules over 10,000 bases in length end-to-end have become available. These platforms do not require RNA molecules to be broken up before sequencing, but they have a much higher per-base error rate, typically between 5% to 20%. This well-known limitation has severely hampered the widespread adoption of long-read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of novel, previously unknown RNA molecules discovered in a particular condition or disease.
ESPRESSO can accurately discover and quantify different RNA molecules from the same gene – known as RNA isoforms – using error-prone long-read RNA sequencing data alone. To do so, the computational tool compares all long RNA sequencing reads of a given gene to its corresponding genomic DNA, and then uses the error patterns of individual long reads to confidently identify splice junctions – places where the nascent RNA molecule has been cut and joined – as well as their corresponding full-length RNA isoforms.
By finding areas of perfect matches between long RNA sequencing reads and genomic DNA, as well as borrowing information across all long RNA sequencing reads of a gene, the tool is able to identify highly reliable splice junctions and RNA isoforms, including those that have not been previously documented in existing databases.
“Long-read RNA sequencing is a powerful technology that will allow us to uncover RNA variation in rare genetic diseases and other conditions, like cancer,” says senior author Yi Xing, PhD, director of the Center for Computational and Genomic Medicine at CHOP. “We are probably at an inflection point in how we discover and analyze RNA molecules. The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools that reliably interpret long-read RNA sequencing data are urgently needed.”