The pace of discovery in cell signaling/signal transduction research is accelerating. Although papers on “signal transduction” date back to 1949, and the first PubMed-indexed paper used the term “cell signaling” in 19811, more than half of all 458,056 articles on cell signaling/signal transduction have been published in the past 10 years.

reactome

Image: Human interactome. Functional modules in the “first human ‘all-by-all’ binary reference interactome map” from the Center for Cancer Systems Biology at the Dana-Farber Cancer Institute Image made available under a CC-BY-NC-ND 4.0 International license.

Data on pathway members and member candidates have exploded. A reference map of the human protein interactome tracks 9,095 proteins involved in 64,006 binary reactions with one another.2 The resulting map of protein nodes and interaction connections has produced what at least one interactome researcher has called a “hairball”—a knot of interconnectedness that is maddeningly difficult to unravel.

The complexity would seem to cry out for machine learning, but the technique has penetrated more slowly than it has in other areas (such as copy number variant research). Just 42 signaling papers incorporated machine learning in 2018, though that is more than twice the number appearing in 2017 or 2016.

signaling pathways

Image: Signaling publications. As attention to explicating signal pathways grows, a small but growing number of researchers are bringing machine-learning tools to bear. (D. McCormick from PubMed data)

It may be too early to call machine learning a trend in cell signal analysis, much less declare it a major force. At the same time, when Ja Xiu and colleagues at IBM Watson Health reviewed the field, they concluded that, “Integration of artificial intelligence (AI) approaches such as machine learning, deep learning, and natural language processing (NLP) to tackle the challenges of scalability and high dimensionality of data and to transform big data into clinically actionable knowledge is expanding and becoming the foundation of precision medicine."3

Machine learning in brief

Among the many readily available machine-learning primers are several aimed specifically at biological research4,5,6. Machine learning and other tools are many. The reviews by the IBM Watson Health team3 and the primer by James Zhou, et al.,5 include extensive lists, and several groups (such as Bing Zhang’s at Baylor7 and the shifting partnerships of Anton Buzdin and Nicolas Borisov at Sechenov University8) have been prolific tool producers.

Broadly, though, machine-learning analysis will break down into four steps: reduce dimensionality, train, predict, and confirm.

Reduce dimensionality. As the IBM Watson Health group noted, genomic and proteomic data in all its forms is complex. To reduce the number of variables and streamline calculations, researchers must usually simplify their data first.3 Ideally, the researcher can identify a handful of orthogonal factors—those that have maximum influence on the final picture, without being affected by changes in the other variables. Dimensional reduction is like projecting a 3D blob of data onto a wall, like the metal sculptures that look like random tangles of rods but create a picture when lit by a bright light in the right place (as in the accompanying photo of “Light” by sculptor Fred Eerdekens).

cell signaling

Image: Reducing dimensionality: Finding the right way to project a complex dataset onto a simplified representation that makes it easier to analyze. (Metal sculpture and photo by of sculpture by Fred Eerdekens. © Fred Eerdekens. Used by permission.)





Train. The machine-learning algorithm is, at heart, an iterative approximation method (like Newton’s method, which bedevils first-semester calculus students) that builds its own equations by trial and error. It needs to be trained. There are many approaches to training, but a common one is to feed in data on case after case with all of the variables and a known end-result. Eventually, the algorithm finds a calculation that approximates the right end result for each set of inputs.

The Xu analysis stresses that data quality control is vital for both training and prediction.3 There’s another caveat: evolutionary relatedness among gene families might skew machine-learning results. J.D. Washburn, Hai Wang, and colleagues at Cornell, the U.S. Department of Agriculture, and the Chinese Academy of Agricultural Sciences warn that “applying [deep learning] methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions.”9 To counter “evolutionary bias”: First, include examples from multiple gene families in training and prediction data. Second, include orthologous genes (from different species, but derived from a common ancestor).

Predict. After the training, the algorithm is ready to set to work. The language becomes slippery here. The machine produces predictions rather than results: The reduced-dimension representation of the input data produces a reduced-dimensional prediction—a reduced-dimension signature that should correspond to a higher-dimension state. Mapping that signature back onto the real world is what the process is all about. Or that’s what one would think…

Confirm. …but confirmation is real challenge. The joy and bane of the most exciting machine-learning results is this: They can be surprising, which means that they demand extraordinary laboratory confirmation.

As Xu and the IBM Watson Health group warned, AI and machine-learning results cannot be accepted in isolation: “well-designed studies with causal inference are needed to filter out biomarkers that have strong correlative effect but no real causative effects in tumorigenesis.”3

Surprising and unexpected results

Despite the exhaustive work (the best of current successful applications of machine learning to cell pathway analysis have involved nearly heroic confirmation studies), the results are exciting.

Consider Bruce Clurman and colleagues at the Fred Hutchinson Cancer Center, who have been unraveling functions of Fbw7 for more than a decade. Fbw7 (F-box/WD repeat-containing protein 7, a ubiquitin ligase) is the substrate receptor of SCF, the SKP1-CUL1-F-box protein. It’s a recognized tumor suppressor; one of its functions is to target phosphorylated transcription factors for disposal. Mutated Fbw7 is often found in tumors, but how its mutations contribute to oncogenesis has remained murky. As the Clurman group has written, Fbw7 and other oncogenes “regulate diverse processes, and their complex biology has confounded mechanistic studies of carcinogenesis and targeted therapy development.”

To try to pin down how mutations in FBXW7 (the gene encoding Fbw7) contribute to tumor formation, the group looked up from the signal-pathway trees to scan the transcriptomic forest. They analyzed hundreds of genes, finding 123 candidates that showed mutation rates greater than 4% in at least 2 of 10 tumor types.10

With collaborators from the University of Washington and a pair of data scientists with Oregon Health & Science University (OHSU), they sifted through data from The Cancer Genome Atlas to correlate mutations in Fbw7 with alterations in each of the 123 candidate genes in each of the 10 tumors.

OHSU’s Mehmet Gönen (also affiliated with Koç University) and Adam Margolin (now at Mount Sinai School of Medicine) had developed “kernelized Bayesian transfer learning” (KBTL), a machine-learning tool for extremely complex datasets.11 Unlike many other packages, KBTL works on multiple related problems at once. It can thus train on multiple large datasets at the same time, sharing what it learns in one problem with processes at work on others. The developers say that this expedites finding features common to all of the comparisons.

What the Clurman group found stunned them: Finding strong effects from Fbw7 mutations in all of the tumors studied was exciting. Finding that the mutations affected mitochondrial processes in every case was, to put it mildly, unexpected. In every case, Fbw7 mutations—changes, knockdowns, or losses—all increased mitochondrial gene expression.

In the exhaustive confirmation phase, examination of metabolic signatures showed that the mutations ramped up mitochondria respiration. In some cases, mutations prompted the mitochondria to increase glucose metabolism. Others (including knockouts) not only boosted metabolic activity but also allowed the mitochondrion to burn other fuels (such as glutamine).

“Surprisingly,” the Clurman team writes, “Fbw7 mutations shifted cellular metabolism toward oxidative phosphorylation [and] …revealed unexpected metabolic reprogramming and possible therapeutic targets in Fbw7-mutant cancers—and “provides a framework to study other complex, oncogenic mutations.”10

More on the way

Other big-data studies are producing results that require thought and confirmation but usher in new ways of understanding and treating cancer.

In colorectal cancer research, for example, the Clinical Proteomic Tumor Analysis Consortium compared whole exome, copy-number array, RNA-seq, miRNA-seq, and proteomic data from tumor cells and “normal adjacent tissues” (NATs) from 110 colon cancer patients. Ultimately, they wrote, “proteomics identified an association between decreased CD8 T cell infiltration and increased glycolysis in microsatellite instability-high (MSI-H) tumors, suggesting glycolysis as a potential target to overcome the resistance of MSI-H tumors to immune checkpoint blockade.” 12

Another group (from Seoul National University and Vietnam National University) searched for motifs that predict better or worse clinical outcomes.13 Along with markers for poor disease-free survival, they found that enrichment for two biomarkers associated with better outcomes (CXCL8 and CXCL11) was associated with epithelial cell signaling in Helicobacter pylori infection and inflammatory processes.

References

1. W. E. Gall and G. M. Edelman, "Lateral diffusion of surface molecules in animal cells and tissues," Science, vol. 213, no. 4510, pp. 903-905, 1981.

2. K. Luck, D.-K. Kim, L. Lambourne, K. Sprohn, D. E. Hill, M. Vidal, F. P. Roth and M. A. Calderwood, "A reference map of the human protein interactome," BioRxiv, pp. 1-64, 10 April 2019.

3. J. Xu, P. Yang, S. Xue, B. Sharma, M. Sanchez-Martin, F. Wang, K. A. Beaty, E. Dehan and B. Parikh, "Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives," Hum Genet, vol. 138, no. 2, pp. 109-124, 2019.

4. D. Bzdok, M. Krzywinski and N. Altman, "Machine learning: a primer," Nature Methods, pp. 1119-1120, December 2017.

5. J. Zhou, M. Huss, A. Abid, P. Mohammadi, A. Torkamani and A. Telenti, "A primer on deep learning in genomics," Nature Genetics, vol. 51, no. 1, pp. 12-18, 2019.

6. N. Kriegeskorte and T. Golan, "Neural network models and deep learning: a primer for biologists," Current Biology, vol. 29, no. 7, 2019.

7. Zhang Lab

8. "PubMed Search for papers by Nicolas/Nikolay Borisov and Anton A Buzdin," [Online]. Search results. [Accessed 9 Sept. 2019].

9. J. D. Washburn, M. K. Mejia-Guerra, G. Ramstein, K. A. Kremling, R. Valluru, E. S. Buckler and H. Wang, "Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence," Proc Natl Acad Sci U S A, vol. 116, no. 12, pp. 5542-5549, 2019.

10. R. J. Davis, M. Gönen, D. H. Margineantu, S. Handeli, J. Swanger, P. Hoellerbauer, P. J. Paddison, H. Gu, D. Raftery, J. E. Grim, D. M. Hockenbery, A. A. Margolin and B. E. Clurman, "Pan-cancer transcriptional signatures predictive of oncogenic mutations reveal that Fbw7 regulates cancer cell oxidative metabolism," Proc Nat Acad Sci U S A, vol. 115, no. 21, pp. 5462-5467, 22 May 2018.

11. M. Gönen and A. A. Margolin, "Kernelized Bayesian transfer learning," in Twenty-Eigth AAI Conference on Artificial Intelligence, Quebec, Canada, 2014.

12. S. Vasaikar, C. Huang, X. Wang, V. A. Petyuk, S. R. Savage, B. Wen, Y. Dou, Y. Zhang, Z. Shi, O. A. Arshad, M. A. Gritsenko, L. J. Zimmerman, J. E. McDermott, T. R. Clauss, R. J. Moore, R. Zhao, M. E. Monroe, Y. T. Wang, M. C. Chambers, B. Zhang, K. D. Rodland, D. C. Liebler, T. Liu and Clinical Proteomic Tumor Analysis Consortium, "Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities," Cell, vol. 177, no. 4, pp. 1035-1049, 2019.

13. N. P. Long, S. Park, N. H. Anh, T. D. Nghi, S. J. Yoon, J. H. Park, J. Lim and S. W. Kwon, "High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer," Int J Mol Sci, vol. 20, no. 12, p. E296, 2019.