Organisms and biology are immensely complex, and to truly understand disease mechanisms, heritability, potential diagnoses, and development of therapies, we need access to more genomic information. Currently, the predominant method for genome analysis involves sequencing an individual genome with short-reads without retaining haplotype knowledge and then aligning those reads to a haploid consensus reference assembly. While this approach provides sufficient power to call single nucleotide variants (SNVs) across most of the genome, a complete analysis of the genome is not possible.1,2

Despite the widespread recognition of the contribution of structural variants (SVs) and copy number variants (CNVs) to health, they remain one of the most difficult types of variation to accurately ascertain from genomic data, in part because they tend to be clustered in duplicated and repetitive regions of the genome that are typically not resolved by short read sequencing.3,4

Limitations in sequencing technology and the human reference have led the field to approach human genome analysis as if the genome were haploid rather than diploid. As a result, the signal between the two haplotypes is averaged, diluting the variant signal. Historically, this leads to the inability to separate true variation from the background noise of depth variability, stochastic changes in allelic representation, probe failure, and alignment artifacts.

Linked-Read technology overcomes historic limitations by providing long-range genomic information via short-read sequencing. Using this approach, short-read sequences are assigned unique barcodes that can be used to link them back to their original high molecular weight genomic DNA (gDNA) molecules. The resulting Linked-Reads are a powerful tool for constructing long-range information—providing a more comprehensive view of the genome and exome. Haplotype information can be deduced, thus allowing for the exploitation of the diploid nature of the genome. Assessing for the presence of variation, haplotype-by-haplotype, provides a much more favorable signal-to-noise ratio as well.

Linked-Reads also enable the mapping of reads to repetitive regions of the genome where structural variant breakpoints often cluster. Barcode information can be used beyond phasing to identify regions located abnormally close to or far from each other. This method proves to be particularly powerful in the detection of copy-neutral events such as inversions and balanced translocations.

Here, we highlight important tips to make use of the benefits of Linked-Read technology for long-range genomic analysis. Hyperlinks within the text will direct you to technical literature with additional details and discussion.

Tip #1 Sample Preparation

Capturing long-range genomic information requires high-quality, high molecular weight (HMW) gDNA. However, not every application calls for the highest molecular weight DNA. While complex and balanced structural variants, de novo assembly, or megabase-scale phasing require DNA of at least 50 kb, standard variant calling (insertions, deletions, SNVs) and gene and SNV scale phasing is minimally impacted by shorter DNA size.

Before starting your experiment, consider what types of analyses will be performed on the dataset and what types of variants are of interest. A collection of Demonstrated Protocols for extracting high molecular weight DNA (>50kb) from various cell and tissue samples is available online.

Regardless of the DNA extraction method used, some universal best practices apply.

Do:

  • Pipet HMW gDNA slowly using a wide-bore pipet tip to prevent shearing.
  • Elute and store HMW gDNA in Tris-EDTA buffer (not water).
  • Use physical grinding instead of chemical lysis buffers when extracting gDNA from tissue samples.

Don’t:

  • Use steps that can denature, nick or damage the DNA, including heat incubation, extreme pH buffers, and chemical dissociation buffers for tissue.
  • Mix samples by vortexing; use brief “pulse vortexing” if absolutely necessary.

Tip #2 Workflow

Barcoded, Illumina®-compatible sequencing libraries are produced using the Chromium™ Controller with Genome or Exome reagent kits. The workflow is straightforward with minimal hands-on time. In order to avoid common pitfalls, users should read the user guide (for Genome) or Demonstrated Protocol (for Exome), which includes tips at the beginning of each Protocol Step and a “Practical Tips and Troubleshooting” guide in the appendix of each document.

Tip #3 Sequencing

We recommend >30x sequencing depth for whole genome applications, corresponding to ~800 million reads or ~128Gb for human samples. For targeted sequencing, the recommended coverage is >60x. The amount of sequencing required to achieve this depth will depend on the capture method and baits used. For example, for Agilent SureSelect™ Human All Exon V6 capture baits, 60x coverage is achieved with ~90 million reads or ~9Gb of sequence.

Tip #4 Analysis

Once the sample has been sequenced, it becomes just as important to have a powerful means of analyzing your data. The Chromium™ Software Suite provides tools for analyzing and visualizing Linked-Read sequencing data. The Long Ranger™ Analysis pipelines produce standard output file formats to maintain compatibility with common analysis tools. For example, to verify alignment quality and other features, the barcoded BAM files produced by the Long Ranger pipelines can be viewed in standard genome browsers such as the Integrated Genomics Viewer (IGV).

The Loupe™ Genome Browser provides multiple views of the Linked-Read data to aid in structural variant analysis. The "Haplotypes" view in Loupe provides a visualization of all variants in the context of their haplotypes, showing both the candidate and called structural events alongside coverage information. In the screenshot below, examples of phased homozygous and heterozygous variants, as well as unphased and non-reference SNVs are labeled. Details, including the reference and alternate allele, appear in the browser window when you select a SNV (Figure 1).

Haplotype

Figure 1. Haplotype view visualizes all variants in the context of their haplotypes, showing both the candidate and called structural events alongside coverage information.

The "Linked-Reads" view in Loupe allows for visualizing reads linked by barcode. Linked-Reads assigned to each haplotype are grouped and color-coded, while unphased Linked-Reads are shown in grey at the bottom of the screen. Figure 2 illustrates an approximately 30kb heterozygous deletion on haplotype 2, with a much smaller deletion present on haplotype 1.

Linked Reads

Figure 2. Linked-Reads view illustrates an approximately 30kb heterozygous deletion on haplotype 2, with a much smaller deletion present on haplotype 1.

The “Structural Variants” view in Loupe can be used to analyze large structural variants (>30kb), which are revealed by the degree of barcode overlap between genomic regions. For example, in Figure 3, barcode overlap patterns in samples with (left) and without (right) a structural rearrangement over the genomic region are plotted.

Structural Variants

Figure 3. Structural Variants view provides a method to analyze large (>30 Kb) and complex SVs. (Left panel) Sample with no SV detected. (Right panel) Sample with large deletion on Chromosome 2.

Conclusion

Standard, short-read sequencing, while powerful, cannot reveal a complete picture of the genome. With Linked-Reads, it is possible to link distant loci and reconstruct long-range haplotypes, providing the long-range information necessary to more fully understand the genome. By following the tips in this article, you can get more information from Linked-Reads, allowing you to turn short-read sequencing data into long-range genomic information and providing insight into previously challenging applications.

References

1. Dugoff L, Norton ME, Kuller JA. The use of chromosomal microarray for prenatal diagnosis. Am J Obstet Gynecol. 215, B2-9 (2016). [PMID: 27427470]

2. Manning M, Hudgins L. Professional Practice and Guidelines Committee. Array-based technology and recommendations for utilization in medical genetics practice for detection of chromosomal abnormalities. Genet Med. 12, 742-5 (2010). [PMID: 20962661]

3. Quinlan AR, Hall IM. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 28, 45-53 (2012). [PMID: 22094265]

4. Chaisson MJ, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari F, Antonacci F, Surti U, Sandstrom R, Boitano M, Landolin JM, Stamatoyannopoulos JA, Hunkapiller MW, Korlach J, Eichler EE. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 29, 608-11 (2015). [PMID: 25383537]

Images: 10x Genomics