Molecular Cytogenetics Using Next-Gen Sequencing

 Molecular Cytogenetics
Josh P. Roberts has an M.A. in the history and philosophy of science, and he also went through the Ph.D. program in molecular, cellular, developmental biology, and genetics at the University of Minnesota, with dissertation research in ocular immunology.

The challenge of cytogenetics is simple to articulate — define the structural features of one genome that distinguish it from another — yet tricky in practice. For years, researchers have been limited by the relatively crude tools at their disposal, such as Giemsa staining, FISH and DNA microarrays. But DNA sequencing technologies are boosting resolution to unprecedented levels, essentially bridge the gap between molecular cytogenetics and molecular genetics.

It isn’t easy, however. Sequencing read lengths are still mostly too short to answer cytogenetics questions directly. But new strategies are in development. Here we look at the promise of genomic sequencing to carry out the tasks of discovering (and genotyping) chromosomal structural variations (CVs), some areas where sequencing clearly falls short, and how recent developments and promised improvements may allow it ultimately to fulfill its promise.

The technologies of molecular cytogenetics

Techniques such as chromosome staining, banding and karyotyping are great for looking at megabase-scale chromosomal aberrations such as aneuploidy and gross rearrangements, which actually can be seen under a microscope. Using fluorescence in-situ hybridization (FISH)-based approaches, researchers can detect structural differences down to as small as 500 kb.

The introduction of DNA microarrays allowed for querying of copy number variations, deletions and the like down to perhaps 5–10 kb. Yet even these are unable to provide location-specific information about duplicate copies. They are limited to detecting only sequences that are part of the reference set used to design them. They cannot detect balanced translocations. And perhaps most importantly, arrays do not do well with highly repetitive sequences — where many CNVs either originate or are found — and thus they typically cannot resolve the breakpoints of the aberrant sequences.

Next-generation DNA sequencing (NGS), on the other hand, can detect “all aberration types” including balanced translocations, inversions and sequence-level variations, says Rich Shippy, director of product marketing for reproductive and genetic health at Illumina. CNVs, he says, can be resolved at least 100-times more precisely with NGS than arrays, down to the single nucleotide level.

Strategies for NGS analysis

The ability of sequencing to perform molecular cytogenetics is, of course, dependent on a variety of factors including library size, read length, sequencing depth and the uniqueness of the sequences being analyzed. It also depends on the type of analyses performed on the data, of which there are four general approaches.

The number of reads that align to a given 5–10-kb window of a reference genome will give a measure called read-depth. “Simply look for how many reads are placed in that window versus how many should place there if it’s a diploid genome,” says Evan Eichler, professor of genome sciences at the University of Washington. With 25-fold coverage, you can reliably detect deletions (fewer or missing reads) and duplications (an overabundance of reads) down to 2–3 kb, he notes — far beyond other molecular cytogenetic methods by at least an order of magnitude. Read-depth is the only sequencing analysis able to determine the absolute copy number, yet it says nothing about inserted sequences (there is nothing to align to) or inversions (they will align in either orientation), and also cannot discriminate whether duplicated sequences are found in tandem or dispersed in the genome.

To query the locations of duplications, researchers primarily depend upon read-pair-based approaches, which take advantage of the fact that the DNA is sequenced from both ends, with the forward and reverse reads having a fixed distance between them representing the insert size. “What you’re looking for is a set of reads to be anchored to one position, and then a set of the opposite reads to be anchored at some position which is inconsistent with a collinear relationship,” Eichler explains. “In the case of a translocation it’s simple: You would have one set of ends mapping to say chromosome 10, and the other set of ends mapping to chromosome 20. Now I’ve traversed it; I’ve captured the breakpoint.” For an inversion, the paired read would have a reversed orientation.

Split-read analyses look for read pairs in which only one read maps uniquely to the reference genome while the second cannot be mapped (because it lies on the other side of the rearrangement breakpoint). An algorithm then searches the reference genome for where separate fragments of the unmapped read are able to map. Such analyses have the promise to detect even small deletions and insertions.

The fourth approach is sequence assembly, in which reads are put together by merging overlapping sequences into a continuous contig. This typically relies on a combination of de novo and local assembly algorithms (of which there are many), and is considered the most capable method to resolve the breakpoints of any chromosomal aberration, given long and sufficiently accurate sequences.

The promise of longer reads

“If all the breakpoints in the genome related to a cytogenetics translocation, deletion, or duplication or inversion, mapped within unique sequences there would be no problem,” points out Eichler. “But many of the translocations, and in particular many of the inversions, map to large, highly identical repeat sequences, and that’s where the system begins to fall apart. We don’t have libraries that are large enough, or read pairs where the single reads are long enough, to accurately place outside and traverse into a repeat sequence.”

So-called third generation, or next-next generation sequencing platforms address the issue by offering substantially longer reads. Pacific Biosciences’ latest chemistry, for example, delivers average read lengths of 8,500 bases, “with dozens of reads that go beyond 30,000 bases,” says chief scientific officer, Jonas Korlach.

Many repetitive regions will be spanned by such reads, allowing them to be uniquely mapped to a reference genome.

Rather than having read pairs, “you’re generating the full sequence of a single molecule, so every sequence represents a distinct haplotype, period,” points out Eichler.

Illumina’s Moleculo technology offers an alternative haplotyping strategy based not on long reads but statistics and barcoding. The company recently described the approach, which it now calls “statistically aided, long-read haplotyping” (SLRH), in Nature Biotechnology. Using SLRH, the authors report they were able to phase “99% of single-nucleotide variants in three human genomes into long haplotype blocks 0.2–1 Mbp in length,” from as little as 30 Gbp of data [1].

Of course, the ultimate goal of longer reads is to construct a genome assembly from scratch. After all, that’s really what molecular cytogenetics is after, an accurate depiction of the architecture of the genome. “That is the holy grail of this field: If you could take a genome, sequence it, and assemble it accurately without any guide, you’re done. All the molecular cytogeneticists would be out of a job. You’d know all the inversions, deletions, and duplications,” Eichler says.

That day may not be far off. At last month’s Advances in Genome Biology and Technology (AGBT) meeting Pacific Biosciences announced the release of “the first de novo human genome assembly from PacBio-only sequence reads.” The 54x genome was assembled from a single library using reads averaging nearly 7,700-bases in length, and according to the company, reads could reach 20,000 bases within a year. When that happens, cytogenetic answers could be just an assembly away.

References

[1] Kuleshov, V., et al., “Whole-genome haplotyping using long reads and statistical methods,” Nat Biotechnol, 32:261–6, 2014. [PubMed ID: 24561555]

  • <<
  • >>

Join the discussion