Next generation sequencing (NGS) entered the mainstream in 2008, surpassing traditional Sanger sequencing in its ability to produce hundreds of gigabases (Gbs) of data in a single run. Althought NGS is similar to capillary-based Sanger sequencing—bases of a fragment of DNA are identified from emitted signal as each fragment is synthesized from a template strand—NGS extends this technology across millions of parallel reactions. Of the NGS workflow—isolating and fragmenting samples, generating and quantifying libraries, and sequencing—generating libraries is arguably the most critical step affecting the integrity of data downstream.
Biocompare spoke with Mark Stenglein, a postdoctoral fellow in the DeRisi lab at the University of California, San Francisco, for some helpful advice on NGS library preparation. Stenglein recently discovered two divergent arenaviruses, identifying the likely cause of fatal inclusion body disease (IBD) in captive snakes. Stenglein’s hypothesis was simple: If there’s an infection caused by an unknown virus, the viral nucleic acid (RNA) should be detectable among all the host nucleic acid of the diseased and healthy snake samples. The DeRisi lab is well known for identifying viral pathogens of unknown origin and cause, one of the many applications suited for deep sequencing.
“I needed to assemble the entire viral genomes from overlapping short sequences. To do this you need a library that’s complex, free of systematic bias, and not composed of PCR jackpots,” says Stenglein. “The simplest trick is to limit the number of PCR cycles for any next gen experiment.”
Minimizing bias
PCR is a principal source of bias in all NGS library preparations. Loci with extreme base compositions are frequently under-represented or absent, although these loci are often important targets. Molecules that are GC-rich amplify less regardless of how much an amplification protocol is optimized. With increased PCR cycles GC content retreats, resulting in uneven coverage. Libraries that are uneven (i.e., not complex) push researchers to sequence for excessive mean coverage in order to detect things like sensitive polymorphisms or to complete de novo genome assemblies.
Knowing your library before scaling-up
Stenglein also recommends performing some quality control on a finished library before scaling up. “The first time you try a new preparation protocol, you want to validate that you’re getting what you want . . . [and] that certain molecules aren’t taking over your whole library,” he says. “If you sequence your library molecules you’ll get a good idea of the quality of your library and can proceed with confidence.”
A good way to do this is to TOPO clone—a method of cloning PCR products into a plasmid—after constructing a library. From there researchers can Sanger sequence library molecules to get the complete sequence, including the inserts and end adapters. (Molecules without proper adapter placement and orientation are not competent for sequencing.)
Quantify before loading
Oligonucleotide concentration is also critical for library generation. For instance, with one commercial platform, you boost the emitted signal on the surface of a flow cell after it captures the adapted library oligonucleotides. (The Y-shaped adapters stick to the surface of the flow cell for amplification.) The oligonucleotides need to be loaded at a decent enough density to maximize the use of the flow cell to get plenty of reads. (Accuracy in NGS is achieved by sequencing a given region many times—through massive parallel processing—with each sequence contributing to coverage depth. NGS data require a sufficient number of overlapping reads for coverage.) If too much product is loaded, then the clusters become too tight and the resulting sequence overlaps, creating un-usable mixed reads. Currently, the most popular method for quantification is quantitative PCR (qPCR). However, only the overall total signal and not the size of the fragments can be determined with qPCR. Droplet digital PCR (ddPCR) is just beginning to emerge as a more comprehensive way to measure the totality of sample down to the individual molecule.
Determine a good analysis strategy early
Although NGS platforms have dramatically increased throughput and substantially lowered cost in comparison to traditional Sanger methods, the production of billions of NGS reads challenges the infrastructure of many existing information technology systems. Researchers need to have alliances with core facilities that are capable of the massive downstream data handling.
“People often underestimate the end analyses,” says Stenglein. “It’s really important to have a good analysis strategy worked out before you start sequencing. The amount of data from a single run is just enormous.” He notes that pulling 35 snake samples into one lane of a flow cell resulted in 240 million sequences. Finally, along those lines, Stenglein advises researchers to avoid sequencing first and asking questions later. He recommends analyzing each run as you go, working with a collaborator well versed in bioinformatics.
For more information on the arenavirus project methods see "Identification, Characterization, and In Vitro Culture of Highly Divergent Arenaviruses from Boa Constrictors and Annulated Tree Boas: Candidate Etiological Agents for Snake Inclusion Body Disease," by Stenglein et al.
The image at the top of the page was provided by the DeRisi laboratory.