Metabolomics is, essentially, basic biochemistry. It’s the study of how (small) molecules are taken apart, modified, and put together by biological processes, in locations as diverse as soil, the gut, and bodily fluids. Metabolites are building blocks, waste products, and contaminants. Many are parts of networks and pathways, both established and yet to be elucidated. They can serve as biomarkers. Yet many of these metabolites are known only through the traces they leave on chromatograms and mass spectra—“seen,” but not identified. These latter are what make up the “dark metabolome.”

Here we look at the implications of not knowing what the vast majority of metabolites are, and some tools and methods—including curated databases and machine learning algorithms—being employed to address the challenge.

Into the dark

“If we run a high-resolution experiment on a biological sample, like muscle tissue, and we do an extract, we can literally detect 20–30,000 compounds, that we don’t know what they are,” says Facundo Fernández, professor and Vasser-Woolley Chair in Bioanalytical Chemistry at Georgia Institute of Technology. Estimates of how much of that is knowable given enough resources and today’s technology range widely, from about 1 to 20–30%.

The remaining (up to) 99% consists of genuine biological metabolites, artifacts of the experimental process, and contaminants like dyes and surfactants, anti-flammable agents added to carpets, for example, and things added to sprays and deodorants “that end up everywhere. And when we have very sensitive equipment, this is potentially what we see,” says David Wishart, professor of Biological Sciences and of Computing Science at the University of Alberta. As long as it remains in the dark, there is no way to distinguish the biologically relevant from what is essentially experimental noise.

The metabolome is closer to the phenotype than even the genome, transcriptome, or proteome. Metabolites are regulators. “To understand the organism, you need to look at what it’s producing,” says Wishart.

Fernández adds that “We’re not going to understand how living organisms function unless we understand how the metabolome changes, [and] we can’t ignore the dark metabolome.”

Being able to find and identify those features that change through development or with exposure to environmental stress, for example, discriminate control groups from diseased patients, or flag potential drug targets, requires instrumentation capable of resolving what’s there, as well as tools such as libraries, databases, and artificially intelligent software to know what’s been resolved.

Hardware

Mass spectrometry (MS)-based metabolomics continues to be a popular choice for sensitivity. It allows one to determine the mass of a compound, which yields an elemental composition and aids the metabolite identification process. Yet “there can be hundreds of metabolites that share the same accurate mass,” says Baljit Ubhi, market manager, metabolomics and lipidomics, at SCIEX. So MS is generally followed by a fragmentation step (often termed MS-MS or MS2) that gives information about what pieces (daughter ions) the compound can be broken into. These, in turn, can be matched against a library of spectra and allow higher confidence in the assignment of a molecule.

MS can be preceded by liquid chromatography or ion mobility separation to improve the resolution. “If you have one peak that’s overlapping with ten others in the same mass space you’ve got a mixed spectrum and you’re never going to be able to definitively identify that metabolite,” she explains.

In fact, “the path to better metabolite annotation, and therefore better dark metabolome discovery, is the use of orthogonal molecular metrics that will describe a given molecule accurately,” and the more the better, opines Fernández. So not only, say, reverse phase liquid chromatography (LC), but add to that a second dimension LC like HILIC, for example. “I’m personally an advocate of the use of ion mobility cross-sections as another parameter.”

Nuclear magnetic resonance (NMR) can also be used, but it doesn’t have the sensitivity to deal with complex mixtures. “Its purpose is in structural identification for specific compounds that are isolated in high quantities,” points out Oliver Fiehn, director of the West Coast Metabolomics Center at the University of California, Davis. He “and others have shown that even chiral isomers can be separated and identified without NMR. So NMR is not in high use anymore in metabolomics.”

Ideally all the orthogonal techniques would be in a single instrument with high throughput. For now with LC though, for example, “we’re talking about runs of 10, 15, 20 minutes, typically. Can we do a one- or two-minute separation before sending things in the MS? Do we get the same capacity? Can we detect the same number of peaks?,” asks Fernández. “I think we’re getting there—we’re probably a few years away.”

Matching and making spectra

Chromatographic retention time, NMR, MS, and MS2 spectra, ion mobility, and even information about the number of acidic protons and other sub-structures such as phosphate or acetyl groups, all contribute pieces of the compound identity puzzle. These bits of data are generally compared with libraries to find the best matches. Yet if there is no match in the database, or there are just too many matches to make even educated guesses, then the metabolite remains part of the dark metabolome.

“It’s an informatics problem,” Notes Ubhi. “There is a challenge of a lack of databases, and curated databases, specifically for LC-MS/MS data, which leaves the user interrogating only a small percentage of their dataset. The ultimate goal is how can we actually extract more knowledge from the data that we’ve collected.”

To be clear, there are plenty of libraries, amalgams and metaanalyses, both open source and proprietary. Some, like BloodExposome.org, include “basically everything that people have reported,” says creator Fiehn. But that “doesn’t mean there’s not more.”

There are algorithms that look for similarities between unknowns and experimental spectra. Others use artificial intelligence to predict the results of a theoretical experiment—MS/MS fragmentation spectrum, for example, or chemical bond fragmentations or re-arrangements—ranking them by likelihood. And databases continue to be created that house and catalog such in silico data.

The problem, historically, is that there are databases such as PubChem with millions of structures, but “99.99% are not biological, are not going to match anything,” Wishart says. His group created BioTransformer, which uses rules (like Phase I and Phase II enzyme transformations and gut metabolism) to mimic what goes on in the body. “People use it now to take any compound they want, run it through, and it simulates its metabolism.”

Elucidating the dark metabolome is made even more challenging by the fact that (unlike the case with genomics, proteomics, and even lipidomics) metabolites aren’t made up of repeating units, and are thus are very complex. But the host of tools being developed—this article has barely scratched the surface—will allow some much-needed light to be shown on the potential treasure trove.