With so many improvements in high-quality library kits and high-throughput instrumentation, sequencing data is being generated at a very rapid pace. Consequently, NGS data analysis is quickly becoming the new sequencing bottleneck, replacing challenges overcome in library generation and assay development. With the spotlight now on data, companies and researchers alike are turning their focus to improvements in analytics tools to accelerate and ease NGS data analysis and interpretation processes.

“The difference in NGS data analysis from other informational analytics drives unique challenges that other areas don’t deal with,” explains Ramon Felciano, CTO at Qiagen. When considering the challenge of data analysis as applied to NGS and why it is such a substantial obstacle, there are three perspectives to consider: the sheer amount of data and scale of experiments, the type of data generated, and the quality of data.

Different types need different tools

Most of what we know of the genome has come from standardization of short-read sequencing technologies. While these technologies excel at high-throughput sequencing, they have not yet provided an absolute dataset for the entire genome due to limits in resolving complex regions. The short-read sequencing pipeline, primarily developed using Illumina’s technology, includes standardization of sequencing, alignment to reference sequences, variant calling, and individualization.

With a new workflow and data type in structural variants (SVs) and long-read sequencing, PacBio has now developed a similar pipeline to accommodate further sequencing studies and validation. As Jonas Korlach, CSO at PacBio points out, “The data alone is only as useful as the tools to analyze and interpret it. And since different data types require different tools, analytics techniques must be adjusted to fit the data.”

With a new SV pipeline in place, researchers are able to get at previously hidden information in the genome. Very extensive databases such as dbSNP and ClinVar have been developed using short-read and SNP data, allowing clinical interpretation to leverage the knowledge aggregated by thousands of scientists. However, while commercial databases like Qiagen’s Ingenuity Knowledge Base already include structural variant annotations such as genetic fusions, translocations, and larger insertions and deletions, public databases containing these kinds of SVs don’t yet exist. Once completed, SV databases could be used in conjunction with current SNP databases to increase the diagnostic yield using NGS by looking at a whole spectrum of genetic variation. “With only about 30% of the variant bases being in SNPs, perhaps it is not surprising that thus far the diagnostic yields from short-read NGS have only been about 30%,” Korlach adds.

To accelerate the speed from NGS to diagnosis including SVs, PacBio is launching its Joint SV Caller in 2018. This new technology analyzes sequences from patients and their parents, using parent sequences as a reference to the patient’s mutations. By allowing a differential comparison in analyzing structural variants, clinicians can provide fast interpretation to results until comprehensive SV databases can be established.

Biocompare’s NGS Search Tool
Find, compare and review next-gen
sequencers from different suppliers Search

Demonstrating the need for different data types in analysis, researchers at Uppsala University set out to map genetic variation in 1,000 Swedish individuals. Uppsala’s Adam Ameur analyzed whole genomes from the large-scale DNA sequencing project and identified 33 million genetic variants, 10 million of which are novel. While numerous new variants were discovered, the team also notes that SV results are limited by the challenges of detecting structural events from short-read sequencing data.

By employing long-read technologies, more structural variants could be detected than by short-read sequencing alone. Nevertheless, the SV data made available in this initial project is a first step toward creating a catalog of structural events in a population and estimating their frequencies. These studies also add clinical value for WGS-based diagnostics of genomics aberrations by removing false positive events and SVs occurring at high frequency in the population.

Controlling scale while monitoring quality

With improvements in whole genome, exome, and transcriptome sequencing, no longer are researchers forced to choose a specific gene of interest when an entire genome can be sequenced, providing exponentially more information. “However, with this comes more complex datasets as researchers are looking to study and understand different aspects of molecular biology that come together and drive a particular biological function,” explains Felciano. “It will be important to embrace this complexity using analysis tools that can help us understand and also interpret the data.”

Using entire suites, such as the Qiagen GeneReader System that includes instrumentation, assays, and informatics pre-integrated into NGS workflows for clinical testing labs, can help make technologies more accessible to more labs, allowing more samples to be easily sequenced and analyzed, and larger studies to be achieved.

Combining assays and analytics also provides an avenue to control scale while allowing deeper analysis. Roche’s AVENIO ctDNA Analysis kits and software workflow provide a complete set of tools required for data analysis and variant reporting. Offering analysis software with specific applications helps to maintain quality of reporting and allows for the necessary sensitivity needed for ctDNA analysis in particular.

Quality standards in sequencing data could become more stringent as NGS makes its way into all facets of research and medicine.

Quality standards in sequencing data could become more stringent as NGS makes its way into all facets of research and medicine. Thus, the significance of reference databases cannot be overlooked when analyzing data. Reference databases established using the same technology that is used for testing, for example, can cause serious bias issues in analysis. But incomplete analysis can result in more than just bias.

During the genome project in Uppsala, Ameur discovered parasitic worm DNA that aligned with several of their novel variants. While the worm’s genetic analysis had been done and submitted to databases, the novel sequences discovered by the Swedish team were definitively human, indicating incorrect overlap of the two genomes in the database.

With no orthogonal contributions to reference databases, there is no accurate information as to whether errors exist and where they might be. Orthogonal methods are fast becoming a critical aspect of NGS data analysis to validate sequencing and avoid bias, lending a comprehensive view to sequencing information, and making variant calling and diagnosis more robust and accurate.

As Korlach explains, “we need to insist on quality, maintain quality, and correct errors to move forward with sequencing in research and especially as a diagnostic tool.” With this goal in mind, the Broad Institute’s Heng Li and colleagues developed a dataset from de novo PacBio assemblies of two human cell lines and combined with current short-read datasets to provide an accurate and less biased view of error rates for small variant calls, delivering high confidence in validation and improvement in analytic methods.

Addressing these challenges of scale and analytic integration, Felciano comments, “By providing a universal strategy where bioinformatics can be used across all technologies, Qiagen offers purposeful and continued support in overcoming these challenges of scale. One approach is using cloud technologies to get the benefit of high-performance computing without having to invest in individual IT facilities. The CLC Genomic Cloud Engine does just this, providing a way to scale up set processes in virtual private data centers, while taking advantage of cloud components and squeezing more performance out of current technologies.”

Interpreting results

A significant aspect of data analysis is not just the aggregation and pure statistical analysis of results, but the interpretation of the data. Identifying genetic signals that are associated with disease is a first step, followed by a need to prove biological and clinical relevance. This provides context to the data by overlaying results against the world’s existing knowledge of genome biology and disease genetics.

Qiagen cloud technologiesThe use of cloud technologies and applied algorithms enables levels of analysis and interpretation most labs might not be able to create on their own. Cloud technologies offer support for analysis through unprecedented storage and computing capabilities. For instance, Qiagen’s Ingenuity Pathway Analysis, Ingenuity Variant Analysis, and Clinical Insight software tools are delivered on a secure, clinical-grade cloud to provide a shared infrastructure where everyone uses the same technology to drive collaboration in a high-quality environment.

Many of Qiagen's software tools use cloud technologies, which help drive collaboration. 

In addition, it is important to consider contributions from companies like Google (with its DeepMind and DeepVariant technologies) and IBM’s Watson for Genomics that provide time-saving analytic algorithms for use with available databases. These companies are also working on adding to databases through funded research and community partnerships.

From pure calling to interpretation, data analysis has become a huge task. With so much data available to work with, researchers must first ensure current references and established databases are upgraded and second integrate powerful new analysis tools and technologies to pull together relevant results and add value to further translation.