Tools for Understanding and Analyzing Copy Number Variation

 Tools for Understanding and Analyzing Copy Number Variation
Mike May earned an M.S. in biological engineering from the University of Connecticut and a Ph.D. in neurobiology and behavior from Cornell University. He worked as an associate editor at American Scientist, and he is the author of hundreds of articles for clients that include Nature, Science, Scientific American and many others.

In basic genetics, we learn that humans usually carry two copies of a gene—one from our mother and one from our father—but that can vary. The number can be something other than two, and this is called copy number variation (CNV). During DNA replication, genes can be lost or duplicated, which changes the number. Increasing or decreasing the number of copies of a gene often creates more or less, respectively, of its encoded protein. In fact, CNV can be used to study many areas of biology and medicine.

CNV is “linked to human evolution, traits and disease,” says Jonas Korlach, chief scientific officer at Pacific Biosciences. “To highlight just a few examples, increases in the copy number of the AQP7 gene in humans compared to apes are connected to the improved ability to sweat to regulate body temperature, which may have enabled ancestral humans to run long distances during hunts.” He adds, “Within humans, adoption of a starch-based diet in Japanese and other populations is associated with copy number increases in the AMY1 gene, a digestive enzyme.”

Large variations in copy number are also tied to several diseases, including autism, cancer and heart defects.

“Copy number differences in smaller repeat counts are connected to many neurodegenerative diseases, like amyotrophic lateral sclerosis—ALS or Lou Gehrig’s disease,” Korlach says.

High-tech toolbox

Traditionally, biologists studied CNV with cytogenetic techniques, especially fluorescence in situ hydridization (FISH). As Kim Caple, vice president of marketing, microarrays, at Thermo Fisher Scientific, points out: “FISH is visual—subjective.” Although it is also inexpensive, FISH can only identify copy number variants at the level of thousands of base pairs or larger.

For more specificity in analyzing CNV, some scientists use microarray techniques, such as array comparative genomic hybridization (aCGH). The microarray technology “has dedicated data-analysis software that is automated, flexible, provides visualization and minimizes time to the final result,” says Caple. “Microarrays can also provide data on SNP—single nucleotide polymorphism—variation in addition to copy number.” A microarray approach can target specific genes or the whole genome. Additionally, its resolution is better than that of FISH. Microarray technology has a resolution of 25,000 base pairs or less, but FISH’s resolution is 40,000 to 100,000 base pairs.

In a study of patients with renal cell carcinoma published in the journal Clinical Genitourinary Cancer, Cynthia A. Schandl—a pathologist at the Medical University of South Carolina—and her colleagues used SNP microarrays to measure CNV. Using this technology, Schandl says, “allows not only computation of copy number across the genome but also … determination of mosaicism and regions of homozygosity—loss of heterozygosity.” Imagine a tumor-suppressor gene that a person carries a normal and an abnormal form of. Losing the normal one—an example of loss of heterozygosity (LOH)—can allow cancer growth. In fact, LOH plays a role in the development of many cancers, and CNV analysis can be used to study this cancer-causing mechanism.

To delve even deeper into the structure of chromosomes, DNA sequencing can be used to analyze CNV. These sequencing techniques “are used regularly in research and are making their way to the clinic,” Korlach says. “Some sequencing assays can identify copy number variants to the precision of single base pairs.”

The sequencing approach provides clear advantages. “The main benefits are genome-wide, high-resolution assessment of copy number variation—in contrast to methods such as FISH that investigate only a few predetermined loci,” says Kristin Knouse, an M.D./Ph.D. graduate student in Angelika Amon’s lab at Massachusetts Institute of Technology. “Moreover, when sequencing is performed using paired-end reads at sufficient depth, one can potentially determine additional information about putative CNVs, such as loss of heterozygosity and structural rearrangements.”

Nonetheless, there is no free lunch in sequencing. “The primary drawback has been the cost, though it is becoming increasingly cheaper,” Knouse says.

Not so simple

Despite the ongoing advances in analyzing CNV, “accurate interpretation of copy number variation remains difficult,” Korlach shares. “The first challenge is from imprecision in identifying the variation.” That is, the techniques—FISH, aCGH and even some DNA-sequencing techniques—do not indicate the precise boundaries of increases or decreases in copy number. “Furthermore, these assays do not indicate where in the genome new copies are inserted—for example, next to the original copy or on another chromosome,” Korlach explains. “Knowing the precise copy-number-variable region and the context in which copies are present is important to fully understand the variation.”

Despite the advantages of more data depth with sequencing, one drawback is “the potential for artifactual copy number alterations secondary to non-uniform whole-genome amplification,” Knouse says. “As such, it is critical to use stringent quality-control criteria to limit the likelihood of false positives.”

Beyond such fundamental obstacles to accurate information about CNV, the technology itself remains complex. When asked about the primary challenges in analyzing and applying copy number variation, Caple points out that the analysis requires expertise and remains time- consuming. In particular, she notes the challenge of fully automating the process and then handling so-called variants of unknown significance, or VOUS.

Today’s tricked-out tools

To learn more about CNV, scientists need more precise data, and modern tools are starting to provide that. “New DNA sequencing technologies promise to pinpoint the boundaries and location of copy number variations to a single base pair, the highest possible resolution,” Korlach notes. Nonetheless, all sequencing is not alike in that capability.

“Short-read sequencers analyze DNA in 100- to 200-base-pair pieces and are unable to fully resolve many copy number variations, which are often 10 to 100 times larger than the reads,” Korlach explains. “Long-read DNA sequencers, like the PacBio RS II and Sequel platforms, produce DNA reads up to 50,000 base pairs.” Simply put, the longer the read from a sequencer, the easier it is to analyze variations that cover more nucleotides.

Korlach also points out that long-read DNA sequencers “identify exactly how many copies are present, and where in the genome those copies lie.” He adds, “Even smaller trinucleotide-repeat copy number variations are typically beyond the reach of short reads, either due to the length of the repeat or because of the sometimes highly skewed sequence context—such as CGG sequence repeats in the FMR1 gene linked to fragile X syndrome.”

Exploring the outcomes

The decreasing cost of using long-read platforms is already expanding use of sequencing, and that is leading to a larger dataset.

“This will provide a better understanding of ‘normal’ copy number variation and finally make it possible to comprehensively separate disease-causing from normal variation,” Korlach says.

The growing amounts of data also drive the need for more ways to analyze the results across orthogonal approaches to CNV. As an example, Caple mentions Affymetrix ChAS 3.1, which is a data-analysis package for cytogenetic approaches to CNV. She adds, “patient data sharing will reduce VOUS.”

Ultimately, data sharing across the community of CNV researchers will uncover more information from these results, and the knowledge will enhance basic and applied research focuses.

Image: Shutterstock Images

  • <<
  • >>

Join the discussion