Single-cell experiments tend to yield notoriously large and noisy datasets. The main challenge in single-cell data analysis is to preserve the variance originating from biological differences, and to clarify these signals by reducing variance from non-biological sources. This is especially challenging in single-cell experiments when the biological signals are small, yet functionally important. “Even genetically identical cells can behave very differently,” says Eric Hobbs, Executive VP of R&D and Operations at Bruker. “The risk is mistaking this natural variability for noise, or worse, averaging it away and losing the biology entirely.” Preserving the insights encoded in biological differences is the goal. This article examines challenges encountered and strategies for success in single-cell analysis.
Best practices for a strong start
Even when just beginning to design experiments, you have the ability to benefit your future data analysis. “Many of the phenomena that complicate interpretation, such as batch effects, elevated noise, and dropout events, can often be traced back to upstream decisions in experimental design and sample handling,” says Peter Smibert, VP, Biology at 10x Genomics. Therefore try to minimize the time between sample collection and processing in order to reduce artifacts from degraded or stressed cells that can obscure genuine biological signals.
Search Single-cell sequencing kits Search Now Search our directory to find the right single-cell sequencing kits for your research needs.
“Careful sample preparation is one of the most effective ways to prevent downstream data quality issues,” says Vicky Morrison, Senior Product Manager of Software at Parse Biosciences. Working quickly yet gently during tissue dissociation, and keeping samples cold, will reduce stress and cell death. “Attention to these steps not only reduces the need for aggressive filtering later, but also preserves the true biological signal, resulting in cleaner, more interpretable datasets,” she says. When quick sample processing isn’t possible or required, consider fixing the samples. “Fixation preserves the biological state of cells at the time of collection, effectively locking in gene expression profiles and preventing transcriptional changes that can occur during handling and processing,” Morrison adds.
Another front-end best practice is to keep data analysis top-of-mind when designing experiments. Indeed, just like wet-lab work, researchers need to understand how their analytical methods work—in principle if not in granular mathematical details. “While it can be tempting to run tools without understanding their underlying structure, blind analysis approaches are best avoided,” says Smibert. “Instead, teams should select computational tools intentionally based on their experimental questions and practical observations, including active community development and engagement [with computational experts].”
Some tricky bits
Though some might argue that most things about single-cell experiments are difficult, there are points within the single-cell analysis workflow that seem like open doors to chaos. Anything that obscures biological signals is problematic—and not unusual, because the signals are already very small. However, batch effects, background noise, dropout events, and doublets are known offenders for which you can be prepared.
Batch effects
Batch effects can easily result in variance that masks true biological signals, especially if samples are processed at various times, in different runs, by a variety of researchers. The key to preventing batch effects is consistency—and while this is true for any experiment, it’s especially true for those generating signals close to background noise. “Standardized protocols for sample preparation, along with sample multiplexing and randomizing samples across sequencing runs, help ensure a balanced experimental design and reduce batch effects,” says Vilija Lomeikaite, Lead Bioinformatician at Vugene.
After attempts to prevent batch effects, you can use batch correction during analysis to remove remaining differences between samples processed under different conditions. “If uncorrected, batch effects can drive artificial clustering and mask true biological relationships,” says Morrison. Lomeikaite notes that “data integration using tools like Harmony remove technical variation while preserving biological differences.”1
Background noise
There are multiple possible origins of background noise, and one can use filtering to reduce it. For example, in single-cell RNA sequencing (scRNA-seq), background noise can arise from residual RNA released from burst and apoptotic cells. Thus, in droplet-based methods, “quality control involves removing ambient RNA using tools like SoupX, which estimates background RNA in empty droplets and subtracts it from real cell profiles,” says Lomeikaite.2
In scRNA-seq, it is especially important to filter cells during analysis based on the proportion of mitochondrial and ribosomal protein-coding genes. This step removes dead or stressed cells (with high mitochondrial content) and metabolically inactive cells (with low ribosomal content). “Filtering is the first key step and focuses on removing low-quality data points and technical noise,” says Morrison. “This includes excluding background, including empty droplets (common when using droplet-based approaches), and removing dead or dying cells that typically show elevated mitochondrial gene expression.” She adds that it’s also important to consider biological context when setting filtering thresholds for each of these factors, as some systems will have an inherently higher or lower proportion of mitochondrial and/or ribosomal protein coding transcripts.
Dropouts and doublets
Another source of unwanted, non-biological variance is dropout events. These occur when transcripts are so low in abundance that they are not detected by the assay, leading to a zero reading, and doubt as to whether they are truly absent. “Ensuring sufficient sequencing depth and isolating healthy cells via MACS or FACS can mitigate data sparsity and minimize the impacts of dropout events,” says Lomeikaite.
In droplet-based scRNA-seq platforms, doublets and multiplets occur when a droplet contains and barcodes two or more cells, creating misleading hybrid expression profiles that distort downstream analyses. Morrison notes that optimizing cell concentrations helps to limit the occurrence of doublets and multiplets. Bioinformatics tools, such as DoubletFinder for scRNA-seq data, can help to remove doublet signals, also called false or hybrid cells.
Additional considerations for wrangling data
Attention to the preceding issues will set you up for data analysis that is easier to interpret and that better reflects biological signals. It is important to understand your analysis methods well enough that you can apply them with the same rigor as any wet-lab method. “Each analysis is an iterative, hypothesis-driven process where you may need to test parameters, evaluate outputs, or refine your approach based on outcomes,” says Smibert. “Computational work is a natural extension of experimental science, not separate from it.” Batch correction and data normalization methods allow integration of data for comparisons across batches. Lomeikaite notes that analysis methods such as SCTransform can “stabilize variance and account for differences in sequencing depth across the samples.”
We’re at a nascent intersection of single-cell data analysis on the one hand, and AI with machine learning on the other. It will be fascinating to watch the discoveries that unfold from this combination. “Today’s [single-cell analysis] platforms generate extraordinarily rich, high-dimensional data, capturing function, spatial context, and dynamic behavior,” says Hobbs. “But in practice, we compress that data into low-dimensional representations, and in doing so, we often discard the very signals we’re trying to understand.”
Today, newer AI models are capable of integrating multimodal single-cell data. This will allow machine learning to delve deeper into the data to discover true biological signals that we haven’t discovered yet. “Instead of forcing data into forms that humans can easily visualize, we should move the data into systems that can handle its full complexity,” he says. “If we do this well, we move beyond describing cellular behavior to predicting it, and that’s what will ultimately unlock the next generation of biology.”
References
1. Korsunsky, I., Millard, N., Fan, J. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289–1296 (2019). https://doi.org/10.1038/s41592-019-0619-0
2. Matthew D Young, Sam Behjati, SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data, GigaScience, Volume 9, Issue 12, December 2020, giaa151, https://doi.org/10.1093/gigascience/giaa151