Analyzing Whole Genome Sequencing Data

Comments

November 07, 2019

Whole genome sequencing (WGS) is an increasingly accessible tool for obtaining the full genomic code of an organism or a patient. Unfortunately, the challenges posed by WGS data analysis can preclude researchers from take advantage of it. WGS generates a huge amount of data in the form of sequence reads. In order to interpret these data, analysis entails a multistep process using different software tools that line up the reads, look for variations in genetic codes, and compare them to reference genomes, among many other tasks. This used to take weeks or more, but innovations in software, along with the recent adoption of cloud computing, have made WGS significantly faster and cheaper to perform.

Despite recent gains, WGS data analysis still poses challenges that make it hard for labs to exploit the potential of WGS. One reason is that WGS itself is still evolving—while most WGS occurs as “second-generation WGS” in the form of next-generation sequencing (NGS) such as the widely used Illumina platform, “third-generation” WGS technology is also quickly developing at companies such as PacBio and Oxford Nanopore Technologies. Another challenge is that the software needed for WGS analysis is also still evolving, whether for second- versus third-generation data, larger population data, or cloud computing.

This article looks at a single snapshot in time of this fast-moving field. Researchers wrestle with data analysis conundrums posed by different lengths of sequence reads, for example, or the increased time and cost of analyzing large population data. Open-source software tools outnumber commercially available ones, as the algorithms that researchers have at their disposal keep changing. Many analysis tools are easily available, yet no one software solution will get the job done. Some researchers resort to writing their own—welcome to WGS data analysis in late 2019.

Multiple software tools needed for WGS analysis

One of those who must sometimes create his own solutions is Sek Won Kong, assistant professor at Harvard Medical School. He focuses on translational genomics and clinical WGS in rare genetic disorders, and is also a faculty member of the Computational Health Informatics Program at Boston Children's Hospital, which advances biomedical informatics. Kong routinely uses multiple methods to analyze WGS data in his research. “There's no one pipeline, which can perform all the analyses, so we have to use multiple different tools,” he says. “I use three or four types of software to analyze a genome, which I chose after doing a comparative analysis of different pipelines,” which he recently published with colleagues.¹

Kong analyzes WGS data generated from families of people with rare genetic and neurodevelopmental disorders, and combines this information with metabolomics and transcriptomics data. “There is no publicly available tool for the research I am performing, so I sometimes have to develop my own tools to perform some types of research,” he says, noting that this isn’t uncommon among researchers performing WGS analysis. “Some collaborate with bioinformatics researchers to develop their own software, and to help answer their questions.”

The generation gap

The data-analysis challenges that scientists face can differ depending on whether their data derives from second- versus third-generation WGS (the latter produces longer sequence reads). “For second-generation WGS data, the biggest challenge is the speed of mapping and variants calling,” especially for large population data, says Zhuo Song, CTO and co-founder of Genetalks Biotech. Song uses “BWA+GATK” to analyze second-generation WGS data. BWA (Burrows-Wheeler Aligner) is software that maps sequence reads against a large reference genome.² GATK (Genome Analysis Toolkit), widely used software that includes variant discovery, was developed by the Data Sciences Platform at the Broad Institute at MIT and Harvard University.³

Song solves the speed problem with computing acceleration. “We speed up the software with a home-brewed FPGA acceleration chip, like DRAGEN from Edico Genome does,” he says. The DRAGEN (Dynamic Read Analysis for GENomics) Bio-IT Platform⁴ speeds analysis of NGS data, from hours to minutes, using field programmable gate array (FPGA) technology. Last year, Illumina acquired the start-up Edico Genome to incorporate DRAGEN into Illumina’s genomic data-analysis tools. In September 2019, Illumina and the Broad Institute announced a collaboration to create open-source analysis software that combines the strengths of both DRAGEN and GATK.

Analysis of third-generation WGS data faces “growing pains,” in that the algorithms used to assemble its long sequence reads are still under active development. “The biggest challenge is to keep third-generation WGS analysis results updated,” says Song. “Researchers may have to recalculate or combine their data with different algorithms.” He uses two types of software for assembling third-generation WGS data: wtdbg2⁵ and CANU⁶. “Among them, wtdbg2 is new and fast, and CANU is older but widely used,” says Song.

In the cloud

As WGS becomes feasible in clinical research and even therapeutics, there is an increasing need to reduce both the cost and the time of analysis. One solution is using cloud computing for the massive calculations needed for WGS analysis. Song and colleagues recently published a system to do this called GT-WGS⁷, which won first place in the high performance genomics computing competition held by the International Congress of Genomics. GT-WGS returns results in minutes with accuracy comparable to the well-known GATK. It takes advantage of the dynamic pricing of Amazon Web Services (AWS) to reduce the cost of large-scale WGS analysis dramatically.

Song and colleagues have developed parallel cloud computing versions of their chosen analysis solutions. The FPGA-based acceleration system, GTX.one, replaces BWA+GATK for analysis of second-generation WGS data, and a parallel cloud-computing version of CANU is used for third-generation WGS data. They also developed GTX.Zip (also called GTZ)⁸ to help researchers with large-scale population WGS data analysis, after observing people “trying to re-invent the wheel” in computing, compressing, and transporting WGS data, says Song. GTZ is a compression and cloud transmission tool with a particularly high compression rate for genetic data, using FASTQ files (a common file format for NGS data).

“The good news is that with the increasing volume of data, problems related to scale will be overcome soon enough,” says Song. “Bioinformatics combined with high-performance computing is the future.”