About one-third of the nearly 15,000 copy-number-variation (CNV) papers listed in PubMed at least touch on CNVs’ impacts on disease. The growing realization of the roles that major structural variations play in human ills is a product of, and a spur to, the rapid evolution of methods for studying them. Wet-lab methods dominated CNV studies from their beginnings through about 2015. First comparative genomic hybridization (CGH) and then array CGH (aCGH) have been the tools of choice.

acgh

Figure 1. PubMed CNV publications. Left axis: CGH/aCGH papers (red bars) and WES/WGS papers (green bars). Right axis: Total CNV papers (purple line); CNV and disease papers (blue line).

With the emergence of next-generation sequencing instruments, however, CNV researchers have turned to CNV callers—statistics-based pattern-matching software operating on whole-genome, whole-exome, and targeted sequence data. In 2016, sequence-analysis approaches edged out CGH/aCGH in the research literature. By the end of 2018, publications on sequence-based CNV identification outnumbered CGH/aGCH papers by better than three to one. At the same time, though, hybridization methods continue to rule as the gold standard, particularly for confirming structural variants flagged in software.

The reason: CNV callers have idiosyncratic problems with sensitivity, specificity, and false positives. Any specific CNV caller will correctly identify only its own subset of the actual CNVs in sample data. And each will produce its own collection of false positive identifications. In 2017, Fatima Zare, working with colleagues at the University of Connecticut and the University of California, San Diego, tested five then-recent, commonly used CNV callers on both SNP-array and whole-exome sequences from 10 breast cancer tumors.1 In general, they found only moderate sensitivity (from 50% to 80%), fair specificity (70% to 94%), and poor false discovery rates (27% to 60%). One package had notably higher sensitivity in calling both gene amplifications and deletions, another program had the lowest false discovery rate by a wide margin and a higher specificity in detecting amplifications, while a third package had the lowest false discovery rate and the highest specificity when detecting deletions.

This isn’t surprising. CNV callers are complex statistical machines, embodying a number of assumptions and tools. To begin with, the designers choose among calling strategies, including:

  • read-depth (or depth of coverage)
  • paired-end read
  • split read
  • de novo assembly.2

Inputs may be long-read or short-read, derived from any of at least three different kinds of sequence data (whole-genome, whole-exome, and targeted-sequencing, as noted). And the analysis may depend on any of a host of different assumptions about the underlying statistics: one package might assume Gaussian read-depth distribution, another posits a negative binomial distribution, yet another uses a Poisson distribution, while others make no assumptions at all.3

CNV software developers are working to improve existing CNV callers and introduce new packages, inspiring new evaluation and benchmarking projects that strive to identify the best options among programs with names like Breakdancer, CANOES, CNVnator, Control-FREEC, FermiKit, LUMPY, SOAPsv. (See Table 1) In the meantime, other researchers are turning to machine learning and artificial intelligence to increase reliability and extract meaning from multiple callers.

 cnv

Figure 2. Average numbers of CNVs detected by CNV callers ADTEx, CONTRA, cn.MOPS, ExomeCNV, and VarScan2. Left: Amplified genes. Right: Deleted genes. (Zare 2017, used under Creative Commons License)

In June 2019, Whitney Whitford and a team of University of Auckland neuroscientists evaluated how well five CNV callers detected deletions from whole-genome sequences.4 Their results showed less variation than the Zare study did, reflecting continued intense development of the tools. But it is still true that no single package can find all of the known CNVs in a sample. Why focus on deletions? “Indels and single nucleotide variants are relatively simple to identify,” said Whitford, “as the variants are contained within a single sequencing read, which are aligned to a single locus in the reference genome. In comparison, deletions result in two distant DNA sequences being brought together, which can be problematic for alignment algorithms.”

More recently, L. Zhang, with co-authors from Sichuan University and the Chinese Academy of Sciences, cataloged 15 CNV callers, and benchmarked 10 of them.5 The study concluded that: "LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Additionally, CNVnator and GROM-RD perform well for low-depth sequencing data."

In an even more ambitious review, researchers from the University of Santiago de Compostela and the Galician Research and Development Center in Advanced Telecommunications, cataloged and evaluated 32 freeware CNV callers.2 Again, the study found variability in results, stemming from GC content, repeating DNA elements, variable library preparation, and sequencing. The DECoN package was their choice for targeted NGS data—even though the program runs only as a command-line from terminal emulation, even under Windows. Other top performers included ExomeDepth (though it requires “at least moderate R programming skills”) and ExomeCNV (though the report format was cumbersome, and data required two stages of pre-processing).

Ultimately, the group concluded, they could increase sensitivity only be using five or more control samples and applying nine or more different CNV callers, though ganging smaller numbers of callers might improve results in certain circumstances.

New callers

Because of performance issues with existing CNV callers, researchers continue to introduce new methods. Among those are the following:

SeqCNV

One novel approach focuses specifically on targeted capture sequencing, since whole-genome sequencing data can still be cost-prohibitive and complex. Capture NGS provides a greater depth of coverage in regions of interest for better quality and fidelity at lower cost. Fei Wang, Rui Chen, and colleagues at the Baylor College of Medicine, the Shanghai Key Lab of Intelligent Information Processing, and Fudan University aimed to overcome one of the major limitations with capture NGS, which can easily miss large CNVs. Their goal was to develop a new calling method that could reliably identify all types of variation, from small single nucleotide polymorphisms and indels to large duplications and deletions.6

They developed SeqCNV to identify variations of any size from capture sequence data with improved sensitivity and specificity. SeqCNV extracts read depth information to accurately call multiple CNVs in a variety of sample cohorts by identifying copy number ratio and CNV boundary using the maximum penalized likelihood estimation (MPLE) model. (SeqCNV is available here.)

Key Takeaways

  • Sequence-based methods are supplanting hybridization for detecting copy number variations.
  • Hybridization retains an important role as the gold standard for confirming CNVs.
  • Individual sequence-based CNV callers typically detect different subsets of CNVs. Specificity and sensitivity are often low, and false-positive rates are often high.
  • Researchers are increasingly using multiple CNV callers and then applying higher-level analyses and/or artificial intelligence to improve overall result quality.

modSaRa2

modSaRa2 updates the original modSaRa (for “normal mean change-point model for search and ranking”), a local-search strategy developed to reduce the computational complexity of approaches like the circular binary segmentation (CBS) recursive breakpoint test.7 The new modSaRa2 further streamlines computing and integrates “more genetic information and external empirical statistics.” Applied to cutaneous melanoma whole-genome SNP sequence data, modSaRa2 found one new deletion and three duplication variants. And in simulations comparing modSaRa2 with modSaRa and established CNV callers CBS and PennCNV, all four packages performed comparably in detecting CN gains and losses and detected “almost all” breakpoints. modSaRa2, however, shone in producing a high true positive rate for detecting weak signals. (modSaRa2 is available here  as well as here.)

Dhaka Project

The Dhaka Project—with members from Microsoft, Carnegie Mellon University, the University of British Columbia, and the British Columbia Cancer Agency—is developing a variational autoencoder (a multilayered peceptron neural network) that deconstructs single-cell sequence data into a more tractable, lower-dimensional representation that makes it easier to see the structural variations that distinguish subpopulations among tumor cells.8 Dhaka then tries to reconstruct the original dataset by reverse-transforming lower-dimensional representation. The neural net repeats this process over and over, eventually learning the rules that will allow it to reconstitute the data into a form as much like the original as possible. Imagine a Google Translate that can translate the Gettysburg Address from English into Chinese and back to English, and still wind up with “Four score and seven years ago….” What this means is that the simplified representation retains all the essential features of the complex representation in an easy-to-handle form—ideal for discerning structural variations. (Dhaka is available here.) 

Turning to multiple callers and AI

Increasingly, though, CNV researchers have taken another approach: Rather than trying to develop the perfect single CNV caller, they are using multiple different calling packages, and/or using machine learning to extract better identifications. On PubMed, just 33 papers mention CNV and either artificial intelligence or machine learning; two-thirds of those were published in 2018 or the first half of 2019.

CN-Learn

Current methods for detecting CNVs from exome-sequencing data are limited by high false-positive rates and low concordance because of inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn diagram approaches to identify "high-confidence" CNVs. This approach misses potentially true calls that do not have consensus from multiple callers.

Santhosh Girirajan and collaborators at Pennsylvania State University developed CN-Learn, “a machine-learning framework that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs.”3 They applied the method to CNV calls by four exome-based packages (CANOES, CODEX, XHMM, and CLAMMS) operating on 503 samples. The machine-learning approach, they said, identifies true CNVs with ∼90% precision and ~85% recall, even if trained with relatively small amounts of data (about 30 samples). CN-Learn, they found, recovers twice as many CNVs as any single CNV caller or Venn diagram-based approaches. (CN-Learn is available at here.)

EnsembleCNV

Noting that, “The associations between diseases/traits and copy number variants…have not been systematically investigated in genome-wide association studies…primarily due to a lack of robust and accurate tools for CNV genotyping,” researchers drawn from the Icahn Mount Sinai School of Medicine, Johns Hopkins, the Karolinska Institutet, Tartu University Hospital, and Tongji University have proposed “a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data.”9

The new caller, they say, identifies and eliminates batch effects at raw data level before assembling using a heuristic algorithm to assemble CNV calls from “existing callers with complementary strengths” (PennCNV, QuantiSNP, and iPattern) into individual CNV regions (CNVRs). They report that ensembleCNV outperformed competing methods, with a 93.3% call rate while 98.6% reproducibility (98.6%) while capturing 85% of common CNVs documented in the 1000 Genomes Project.9 (EnsembleCNV is available here.)

Sparse Learning and Joint Effects

Machine learning excels at drawing correlations from unwieldy masses of data. Zhiyong Wang, Benika Hall, Jinbo Xu, and Xinhua Shi (researchers from the Toyota Technological Institute at Chicago and the University of North Carolina at Charlotte, writing in IEEE/ACM Transactions on Computational Biology and Bioinformatics) note that although current research is illuminating the biology of individual CNVs, there has been little effort to understand the cumulative effects of “multiple interactive CNVs” on complex traits—largely because the computations are a bear.10 The group has stepped into the breech with an approach that combines sparse machine learning with biological networks to identify interacting CNVs.

After theoretical validation with simulated data, they applied their method to human genomic data from the 1,000 Genomes Project. They identified 622 candidate CNV groups (with an average 8.77 CNVs per group) that might exert combined effects. (Code for the group's sparse learning tools are available here with supporting material under CNVnet here.)

Table 1. Evaluations of Selected CNV Callers

ToolYearCalling MethodRoca (2019)2Hehir-Kwa (2018)11Zhang L (2019)5Zare (2017)1Whitford (2019)4
ADTex 2014 Paired & pooled  X  X 
BIC-seq 2011 Read depth X    
BreakDancer 2009 Paired end X    X
CANOES 2014 Read depth X    
Canvas 2011 Read depth   X  
CLAMMS 2015 Read depth X    
cn.MOPS 2012 Read depth X  X  
CNVem 2013 Read depth   X  
CNVer 2010 Read depth   X  
CnvHiTSeq  2012Paired-end, split read, read depth X    
CNVkit 2016 Read depth X    
CNVnator 2011 Read depth X X   X
CNVrd2 2014 Read depth X    
CNV-seq 2009 Read depth X    
CODEX 2015 Read depth X    
CONIFER 2012 Read depth X X   
CONTRA 2012 Read depth X   X 
Control-FREEC 2011 Read depth  X X  
CoNVaDING 2015 Read depth X    
Copy-Seq 2010 Read depth X    
Cortex 2012 Assembly X    
DECoN 2016 Read depth X    
Delly 2012 Paired-end, split read     X
Excavator 2013 Paired  X   
ExomeCNV 2011 Read depth X X  X 
ExomeCopy 2011 Read depth X X   
ExomeDepth 2012 Read depth X X   
FermiKit 2015 Assembly based     X
GASV 2009 Paired end X    
GATK 2010 Paired & pooled  X   
Genome STRiP v2 2015 Read depth, paired-end X    
GROM-RD 2015 Read depth   X  
iCopyDAV 2018 Read depth   X  
JointSLM 2011 Read depth   X  
LUMPY 2014Paired-end, split read, read depth X    
Magnolya 2009 De novo assebly X    
m-HMM 2013 Read depth X    
mrCaNaVAR 2009 Read depth   X  
PEMer 2009 Paired-end mapping X    
Pindel 2009 Split readX    X
RDXplorer 2009 Read depthX  X  
ReadDepth 2011 Read depth   X  
RSICNV 2017 Read depth   X  
Samblaster 2013 Paired end     
SegSeq  2009 Read depth X    
SeqCNV  2017 Read depth X    
SOAPsv 2011      
Ulysses 2015Paired-end mapping X    
VariationHunter 2009Paired-end mapping  X    
VarScan 2012 N/A    X 
XHMM 2012 Pooled  X   

References

1. F. Zare, M. Dow, N. Monteleone, A. Hosny and S. Nabavi, "An evaluation of copy number variation detection tools for cancer using whole exome sequencing data," BMC Bioinformatics, vol. 18, no. 1, p. 286, 2017.

2. Roca, L. Gonzalez-Castro, H. Fernandez, M. L. Couce and A. Fernandez-Marmiesse, "Free-access copy-number variant detection tools for targeted next-generation sequencing data," Mutat Res, vol. 779, pp. 114-124, 2019.

3. V. K. Pounraja, G. Jayakar, M. Jensen, N. Kelkar and S. Girirajan, "A machine-learning approach for accurate detection of copy number variants from exome sequencing," Genome Res, vol. 29, no. 7, pp. 1134-1143, 2019.

4. W. Whitford, K. Lehnert, R. G. Snell and J. C. Jacobsen, "Evaluation of the performance of copy number variant prediction tools for the detection of deletions from whole genome sequencing data," J Biomet Inform, no. 103174, 2019.

5. L. Zhang, W. Bai, N. Yuan and Z. Du, "Comprehensively benchmarking applications for detecting copy number variation," PLos Comput Biol, vol. 15, no. 5, 2019.

6. Y. Chen, L. Zhao, Y. Wang, M. Cao, V. Gelowani, M. Xu, S. A. Agrawal, Y. Li, S. P. Daiger, R. Gibbs, F. Wang and R. Chen, "SeqCNV: a novel method for identification of copy number variations in targeted next-generation sequencing data," BMC Bioinformatics, vol. 18, no. 1, p. 147, 2017.

7. F. Xiao, X. Luo, N. Hao, Y. S. Nieu, X. Xiao, G. Cai, C. I. Amos and H. Zhang, "An Accurate and Powerful Method for Copy Number Variation Detection," Bioinformatics, 2019.

8. S. Rashid, S. Shah and R. Pandya, "Project Dhaka: Variational Autoencoder for," Bioinformatics, 2019.

9. Z. Zhang, H. Cheng, X. Hong, A. F. Di Narzo, O. Franzen, S. Peng, A. Ruusalepp, J. C. Kovacic, J. L. M. Bjorkegren, X. Wang and K. Hao, "EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data," Nucleic Acids Res, vol. 47, no. 7, p. e39, 2019.

10. Z. Wang, B. Hall, J. Xu and X. Shi, "A Sparse Learning Framework for Joint Effect Analysis of Copy Number Variants," IEEE/ACM Trans Comput Biol Bioinform, vol. 14, no. 5, pp. 1013-1027, 2017.

11. J. Y. Hehir-Kwa, B. B. J. Tops and P. Kemmeren, "The clinical implementation of copy number detection in the age of next-generation sequencing," Expert Rev Mol Diagn, vol. 18, no. 10, pp. 907-215, 2018.