“Proteomics is at an inflection point where genetics was 15–20 years ago,” declared Chris Whelan, Director of Data Science, Neuropsychiatry at Johnson & Johnson and chair of the UK Biobank Pharma Proteomics Project (UKB-PPP), at the recently held 2022 American Society for Human Genetics Annual Conference. “It’s a very exciting time!”

Two decades ago, the first whole human genome was sequenced, and now, sequencing half a million genomes is an attainable goal. “Genomics has paved the way, with over 1 million genomes and 10 million exomes sequenced, but fewer than 1% of all genomic markers have been characterized,” said Theo Platt, Vice President, Data Engineering & Software at Seer.

Protein characterization systems
Search Now Search our directory to find the right protein characterization system for your research needs.

Ultimately, genes are meant to produce and regulate the production of proteins. Genetics without proteomics requires far-reaching assumptions about protein production that make genetic studies somewhat incomplete. “Today, we understand very little about the proteome,” said Platt. “Understanding proteins and proteomes has a complexity that has been largely insurmountable technologically.”

Only within the past few years has technology for proteomic analysis advanced enough to perform population-scale studies, generating plasma proteomic data for tens of thousands of individuals. Combining large-scale genetics studies with proteomics, i.e., proteogenomics, has proven powerful in revealing novel drug targets, biomarkers, genetic and proteomic signatures of disease, and more. Achieving meaningful and unbiased large-scale proteomic studies, however, is a challenge.

Technologies for large-scale proteomics

A proteome refers to the set of all proteins within a given sample, which is predicted at ~2 million potential protein species (or proteoforms) in human cells when taking into account alternative splicing, single amino acid polymorphisms, and post translational modifications. However, identifying and quantifying all of these proteoforms in a sample is not entirely possible. No matter the technology, currently, we can only identify a subset of proteins. Various technologies have different limitations and need to be carefully considered in large-scale experiments.

Today, there are two approaches to proteomics—targeted technologies using ligands to bind to specific proteins, or unbiased approaches that use mass spectrometry (MS) to detect proteins. For decades, MS was virtually the only technology used to detect large numbers of proteins simultaneously in a given sample, however, the workflows that precede MS to separate proteins from samples (such as gas or liquid chromatography) are complex and require too much equipment, manual labor, and time to be scalable.

To overcome many limitations of conventional MS proteomics, Seer has developed proprietary engineered nanoparticles that reproducibly bind to a subset of proteins within a given sample. “When a nanoparticle comes in contact with a biological sample, it quickly forms a layer of biomolecules on its surface, called a protein corona," said Platt. “It leverages the innate way that proteins interact with each other in nature.”

Using several nanoparticles to separate proteins in one sample yields a proteomic subset that can scalably be analyzed using MS. In a proof of concept paper  published in Nature, this technology detected over 2,000 proteins in almost 150 plasma samples. Since the nanoparticles are not targeting specific proteins, the proteome that is revealed is unbiased—meaning many proteoforms are able to be detected.

Unbiased proteoform detection is a significant advantage that MS has over other protein detection techniques, however the throughput is inherently, and substantially lower. Detection of proteins using ligands, although targeted for specific proteoforms, allows for high-throughput and faster analysis.

Olink Proteomics and SomaLogic are leading the way in scalable, targeted, ligand-based protein detection. SomaLogic’s technology has been used for large-scale studies of over 35,000 samples while Olink’s technology has recently been used in a study of over 50,000 participants.

One reason they are scalable technologies is because their endpoints are detection of DNA. “One of the challenges of performing multiple protein measurements at the same time is related to maintaining the specificity of the measurements in the presence of a large number of proteins,” explained Nebojsa Janjic, Chief Science Officer at SomaLogic. “We solved this problem by using high affinity reagents that essentially help us convert the protein measurement into a DNA measurement, which is much easier to do on a large scale.”

SomaLogic achieves protein detection using SOMAmers, which are synthetic pieces of ssDNA with protein-like appendages that bind 1:1 to a specific protein target. SOMAmers that are bound to proteins are eluted and uniquely detected using standard DNA quantification techniques. The SomaLogic platform is able to detect 7,000 proteins simultaneously.

Olink’s platform uses Proximity Extension Assay (PEA) technology, which works through hybridization of oligonucleotide antibody-pairs that contain unique DNA sequences. Pairs of antibodies are generated for a specific protein and attached to complementary oligonucleotides. When the antibodies bind to a protein within a sample, the oligonucleotides come into close proximity and are able to hybridize only to each other. Subsequent proximity extension will create unique DNA reporter sequences, which are amplified by real-time PCR. Olink’s platform is currently able to detect 3,000 proteins simultaneously.

No matter the technology chosen to incorporate into a large-scale genetics study, the nature of scale in the thousands, tens of thousands of samples, or more takes thoughtful planning of study design and quality control.

Large-scale study design to mitigate variation

When analyzing thousands of samples, study design is a critical consideration that accounts for variation across samples and normalizes the results for a meaningful analysis. Olink Proteomics experienced the difficulties in normalization when participating in the UKB-PPP, a collaboration between the UK Biobank (UKB) and 13 biopharmaceutical companies, to characterize the plasma proteomic profiles of 54,306 participants.

According to Klev Diamanti, Data Scientist at Olink Proteomics, “Randomization is the most important factor for normalization. It becomes extremely hard to remove non-randomization using data processing. We needed to take into account the collection time point, preparation, and storage of the samples.” The reality is that plasma samples taken for large-scale studies come from different places and times, are collected in various ways, are processed differently, and have variable storage conditions.

“Storage time is an important problem of biobanks,” says Diamanti. “Samples kept on site at the UK BioBank can be frozen for 10 years. Long-term freezing can affect proteins. Even at -80 [deg C], proteins can degrade, and some of them even increase. Up to 16% of proteins are affected and can lead to anything from 4-35% variance, which is a lot.”

In addition, plasma may have been collected in groups of patients, such as patients undergoing treatment for diabetes. These samples would likely be collected, processed, and stored in the same way, and then sent for analysis as a group. Since not all samples can be analyzed at the same time, the samples must be randomized as much as possible so that bias is not introduced into the results.

Consideration for type of assay also needs to be considered with large-scale, high-throughput analysis. When the endpoint detection uses RT-PCR, for example, outputs need to be in the linear range, which can only occur within a certain concentration of starting DNA template. Therefore, samples may need to be analyzed using different dilutions to keep them within dynamic range.

When these many details are carefully considered, and the data is able to be normalized and trusted, deep insights are able to be obtained on the relationship among genes, proteins, and disease.

Therapeutic potential of linking genes and proteins

Ultimately, large-scale proteogenomics studies can be used to further precision medicine. “Linking a gene mutation to what happens to the proteins downstream is information that you can actually act on with therapeutic intervention,” says Janjic. “Most of the drug targets are proteins, but research has shown that genetic support for drug targets increases the probability of therapeutic success, so having both pieces of information (genetics and proteomics) is very useful.”

Protein quantitative trait loci (pQTLs) are genomic loci associated with protein abundance that, when identified, can shed light on causal links between genetic variants and disease, highlighting clinically important proteins. Of note, two separate large-scale proteogenomic studies have identified pQTLs that may be clinically relevant for autoinflammatory and autoimmune diseases.

In 2019, the Genomic atlas of the human plasma proteome was published in Nature (Sun, et al) identifying 1,927 genetic associations with 1,478 proteins. This study used SomaLogic’s aptamer-based approach to initially measure the relative concentrations of 3,622 plasma proteins, or protein complexes, from the blood plasma of 3,301 participants of the INTERVAL study. A subset of pQTLs were also validated using the Olink platform.

This study identified a pQTL that has led to a therapeutic hypothesis for the autoimmune disease anti-neutrophil cytoplasmic antibody (ANCA)-associated vasculitis. This form of vasculitis is characterized by vascular inflammation and autoantibodies to the neutrophil proteases proteinase-3 (PR3). Specifically, they show that the vasculitis risk allele at PRTN3 (encoding the PR3 protein) is associated with higher plasma levels of PR3, suggesting that eliminating or tolerizing to the PR3 protein may treat PR3+ ANCA vasculitis.

In 2022, the first results from the UKB-PPP study were published as a BioRxiv preprint titled Genetic regulation of the human plasma proteome in 54,306 UK Biobank participants (Sun, et al). Using the Olink Explore 1536 platform, this study revealed a pQTL mapping of 1,463 proteins that identified 10,248 primary genetic associations, of which 85% were newly discovered.

The study also identified multiple trans pQTL associations between inflammasome components and downstream effector proteins that may cause inherited autoinflammatory conditions. These associations included genes that encode inflammasome scaffolding proteins, negative regulators of inflammasome activity, and GSDMD, which enables the non-canonical secretion of IL-18 and IL-1b, and is an activator of pyroptosis. The results indicate a significant role for common forms of genetic variability in inflammasome-mediated innate immune responses that may lead to actionable genetic signatures of autoinflammatory diseases.

Ultimately, we’ve reached a new era of proteogenomic population-scale studies, which will undoubtedly accelerate the discovery of disease causation and development of novel therapeutics and biomarkers. It is an exciting time in biology!