Never Miss a Variant Again with These Sequence-based CNV Detection Tools

 CNV Detection from NGS Data
Jeffrey Perkel has been a scientific writer and editor since 2000. He holds a PhD in Cell and Molecular Biology from the University of Pennsylvania, and did postdoctoral work at the University of Pennsylvania and at Harvard Medical School.

For years, much of the work of genetic analysis has revolved around single-nucleotide polymorphisms (SNPs). SNPs are easily detected, first by DNA microarrays and then using next-generation DNA sequencing (NGS). But SNPs represent just one kind of genetic variant. Copy-number variants (CNVs) are considerably more difficult to find—at least using NGS.

Why? In a word, length.

Today’s NGS technologies produce millions upon millions of sequence reads, but they're mostly relatively short, measuring a few hundred bases in size. It’s difficult using such data to piece together the subtle structural variations that distinguish one individual from another, simply because individual reads often are too short to span the variant regions of the genome. For that reason, CNVs—typically defined as chromosomal insertions, deletions and inversions (relative to a reference genome) larger than 1 kilobase in size—traditionally have been detected with microarrays, a strategy called array comparative genomic hybridization (arrayCGH), or fluorescence in situ hybridization (FISH), both of which offer limited resolution, on the order of tens to hundreds of kilobases or more. NGS, though, is catching up.

“The sequencing-based approach gets [you] the best bang for your buck,” says Ankit Malhotra, associate computational scientist at the Jackson Laboratory for Genomic Medicine in Farmington, Conn. But it isn’t easy: Malhotra compares CNV detection from NGS data to recreating a book using just a list of the words it contains. “You can determine sequence, but not order or abundance,” he explains.

Still, researchers increasingly are turning to NGS to extract CNV information. Perhaps they’ve already collected the sequencing data for another purpose and now want to know if they can use it to pull out CNV information. Maybe they have limited sample and figure they can kill two birds with one stone using sequencing rather than arrayCGH. Or perhaps they simply have recognized that array-based methods are on the decline. In any event, for the researchers who would use NGS data for CNV detection, there are dozens of algorithms to choose from. The trick is figuring out which one to use.

CNV detection strategies

In theory, detecting CNVs from sequencing data should be straightforward. Suppose one region of the genome is duplicated, and another is deleted. If sequencing were completely unbiased, simply counting the number of reads over each segment of the genome would illuminate the CNVs as an increase or decrease in read depth.

Unfortunately, sequencing is biased with respect to DNA content, explains Louis Culot, vice president for business development at BioDiscovery, a bioinformatics software company. Some regions amplify more efficiently than others, for instance, and chemical damage in formalin-fixed paraffin-embedded (FFPE) tissue samples can be uneven. “These [factors] combine to make [uniformity] a big problem.”

One approach to addressing those problems is with longer reads, and companies like Pacific Biosciences are indeed making strides on that front. In the meantime, computational approaches can help filter signal from noise.

According to a recent review in BMC Bioinformatics, today’s CNV detection algorithms can adopt any of five strategies [1]. Paired-end mapping looks for unexpected differences in the distances between paired-end reads relative to a reference genome. Split-read-based methods detect CNV breakpoints when one of two paired-end reads fails to fully map to a reference genome. Read-depth methods use the number of sequencing reads across a region to estimate copy number. A fourth method relies on de novo assembly; rather than comparing reads to a reference genome, these algorithms assemble the newly sequenced genome from scratch and compare that assembly to a reference. And the fifth approach is combinatorial, relying on a combination of the other four methods.

For would-be bioinformaticians, the surfeit of approaches to CNV detection could prove overwhelming—the review lists some 48 different software tools [1]. Partly, that’s because “copy-number variation detection” does not mean the same thing in every instance. Some researchers, for instance, are interested in precisely mapping where CNVs begin and end. Others are more interested in the absolute structure of the CNV—that is, how many copies are present and in what configuration. Some want to extract variants from whole-genome sequence information, and others prefer to use exome-sequencing data.

In the case of Henry Wood, senior research fellow at the Leeds Institute of Cancer and Pathology, CNV breakpoints are less interesting than CNV amplitude. “I want to know common regions that are gained or lost” across multiple patient samples, he explains. “Not the exact base—I’m looking for trends.”

Wood and his team are interested in extracting CNV data from small, precious samples—a small piece of biopsy, say, or a few nanograms of FFPE genomic DNA. They tend to collect relatively low-coverage sequence data—fewer than 10 million reads per FFPE sample—and estimate CNVs from read-depth data using a Bioconductor package called DNAcopy, originally designed for arrayCGH. The resulting data lack some resolution, but for Wood’s needs, it suffices. “We find it’s a tradeoff. If you want to sequence 100 samples cheaply, you’re not going to get good resolution.”

In 2013, Yu-Ping Wang, associate professor of biomedical engineering and biostatistics and bioinformatics at Tulane University, compared six popular read-depth-based options: CNV-seq, Control-FREEC, readDepth, CNVnatorSegSeq and event-wise testing (EWT). The results, Wang says, show that no one tool is best in every case; in fact, the article describing the comparison suggests different options for different applications. For instance, EWT is a good choice for mapping CNV breakpoints. But if a user wants to estimate the actual copy number—that is, how many copies of a sequence are present—CNVnator is a better option [2].

Based on their comparison, Wang and his team developed and published a new method, called CNV-TV, which addresses what he sees as limitations in the six previous methods [3]. In particular, the tool uses variable-sized “windows” across the genome to be more responsive specifically to short variants that can be missed using fixed-size, but larger windows. “We thought we can provide a better approach that can detect CNVs with shorter size,” he says.

But even CNV-TV cannot solve every problem, says Wang. As a rule of thumb, users probably should run three or more algorithms and combine the outputs to build a cohesive picture. Combining data collected using different sequencing strategies may boost resolution, he adds. “Each approach has its pros and cons,” he says.

Distinguishing features

Culot says one key distinction between different computational approaches is the particular application for which they were intended. Some, for instance, are better able to detect variants in “constitutional” samples by analyzing many from the same sequencing run together in an effort to uncover systematic bias. Others are designed to pick out differences between a tumor genome and its matched normal control. Similarly, some algorithmic applications are better at handling whole-genome data while others are more suited to whole-exome sequences or targeted panels.

Another differentiator is the size of the dataset an application can handle. Wood suggests comparing your dataset to the example data included in the packages to determine whether they are comparably sized. Wood developed an algorithmic approach termed CNAnorm, which can handle up to about 10 million reads. “If I had 500 million reads, I know it’s not the best [algorithm] out there.”

Unfortunately, there’s no one-size-fits-all method users can use in every situation. Malhotra says his team uses a variety of tools in its informatics pipelines, including CNV-seq, CNVnator and readDepth for read-depth-based detection; Hydra, BreakDancerLumpy and VariationHunter for breakpoint detection using paired-end mapping and split-read detection; and TIGRA-SV for de novo assembly. “Depending on the kind of question, you could be using a different tool or pipeline.”

But Malhotra cautions that every bioinformatics tool will generate an answer. The trick is determining if that answer actually is correct. He recommends following stepwise instructions carefully and, when in doubt, consulting the literature or a bioinformatics expert. “You can easily run these pipelines, and if you are not paying attention, get results that do not make sense.”

References

[1] Zhao, M, et al., “Computational tools for copy number variation (CNV) detection using next-generation sequencing data: Features and perspectives,” BMC Bioinformatics, 14(Suppl 11):S1, 2013. [PubMed ID: NA]

[2] Duan, J, et al., “Comparative studies of copy number variation detection methods for next-generation sequencing technologies,” PLoS ONE, 8(3):e59128, 2013. [PubMed ID: 23527109]

[3] Duan, J, et al., “CNV-TV: A robust method to discover copy number variation from short sequencing reads,” BMC Bioinformatics, 14:150, 2013. [PubMed ID: 23634703]

  • <<
  • >>

Join the discussion