The Biology and Laboratory Paradigm Shift: A Review of Machine Learning and Artificial Intelligence in Biomolecular Data Interpretation and Predictive Modeling
Abstract
Background: The life sciences are experiencing an explosion of data from high-throughput genomics, proteomics, and metabolomics. It is a challenging problem to interpret the complex data sets in parallel with developments in artificial intelligence (AI) and machine learning (ML).
Aim: This review categorizes the groundbreaking contribution of AI/ML to biomolecular data science during the period 2015-2024, elucidating its use in multi-omics analysis, protein structure prediction, and experimental automation.
Methods: We performed a systematic literature review highlighting the application of sophisticated computational models such as deep neural networks, graph neural networks, and transformer architectures in diverse biomolecular data.
Results: Our results establish that AI/ML has changed the discipline at its core. These technologies facilitate the discovery of new biomarkers and drug targets from multi-omics data and have made breakthrough achievements in protein structure prediction using AlphaFold2. In addition, AI is now automating experimental design, making closed-loop systems that accelerate discovery.
Conclusion: AI and ML are no longer ancillary tools but intrinsic drivers of a new paradigm in molecular biology. Although data quality and interpretability challenges persist, the incorporation of AI is imperative for decoding the patterns of complex biological systems and developing personalized medicine.
Full text article
References
Anwardeen, N. R., Diboun, I., Mokrab, Y., Althani, A. A., & Elrayess, M. A. (2023). Statistical methods and resources for biomarker discovery using metabolomics. BMC bioinformatics, 24(1), 250. https://doi.org/10.1186/s12859-023-05383-0
Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., ... & Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871-876. https://doi.org/10.1126/science.abj8754
Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., & Linial, M. (2022). ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8), 2102-2110. https://doi.org/10.1093/bioinformatics/btac020
Bruynseels, K., Santoni de Sio, F., & Van den Hoven, J. (2018). Digital twins in health care: ethical implications of an emerging engineering paradigm. Frontiers in genetics, 9, 31. https://doi.org/10.3389/fgene.2018.00031
Chen, K. M., Wong, A. K., Troyanskaya, O. G., & Zhou, J. (2022). A sequence-based global map of regulatory activity for deciphering human genetics. Nature genetics, 54(7), 940-949. https://doi.org/10.1038/s41588-022-01102-2
Dauparas, J., Anishchenko, I., Bennett, N., Bai, H., Ragotte, R. J., Milles, L. F., ... & Baker, D. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615), 49-56. https://doi.org/10.1126/science.add2187
Dührkop, K., Fleischauer, M., Ludwig, M., Aksenov, A. A., Melnik, A. V., Meusel, M., ... & Böcker, S. (2019). SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nature methods, 16(4), 299-302. https://doi.org/10.1038/s41592-019-0344-8
Eraslan, G., Avsec, Ž., Gagneur, J., & Theis, F. J. (2019). Deep learning: new computational modelling techniques for genomics. Nature reviews genetics, 20(7), 389-403. https://doi.org/10.1038/s41576-019-0122-6
Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., ... & Hassabis, D. (2021). Protein complex prediction with AlphaFold-Multimer. biorxiv, 2021-10. https://doi.org/10.1101/2021.10.04.463034
Feuerriegel, S., Hartmann, J., Janiesch, C., & Zschech, P. (2024). Generative ai. Business & Information Systems Engineering, 66(1), 111-126. https://doi.org/10.1007/s12599-023-00834-7
Gendron, Y., Andrew, J., & Cooper, C. (2022). The perils of artificial intelligence in academic publishing. Critical Perspectives on Accounting, 87, 102411. https://doi.org/10.1016/j.cpa.2021.102411
Gessulat, S., Schmidt, T., Zolg, D. P., Samaras, P., Schnatbaum, K., Zerweck, J., ... & Wilhelm, M. (2019). Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nature methods, 16(6), 509-518. https://doi.org/10.1038/s41592-019-0426-7
Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome biology, 18(1), 83. https://doi.org/10.1186/s13059-017-1215-1
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. nature, 596(7873), 583-589. https://doi.org/10.1038/s41586-021-03819-2
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12), 1053-1058. https://doi.org/10.1038/s41592-018-0229-2
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
Malkomes, G., & Garnett, R. (2018). Automating Bayesian optimization with Bayesian optimization. Advances in Neural Information Processing Systems, 31.
Marouf, M., Machart, P., Bansal, V., Kilian, C., Magruder, D. S., Krebs, C. F., & Bonn, S. (2020). Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nature communications, 11(1), 166. https://doi.org/10.1038/s41467-019-14018-z
Mobadersany, P., Yousefi, S., Amgad, M., Gutman, D. A., Barnholtz-Sloan, J. S., Velázquez Vega, J. E., ... & Cooper, L. A. (2018). Predicting cancer outcomes from histology and genomics using convolutional networks. Proceedings of the National Academy of Sciences, 115(13), E2970-E2979. https://doi.org/10.1073/pnas.1717139115
Moor, M., Banerjee, O., Abad, Z. S. H., Krumholz, H. M., Leskovec, J., Topol, E. J., & Rajpurkar, P. (2023). Foundation models for generalist medical artificial intelligence. Nature, 616(7956), 259-265. https://doi.org/10.1038/s41586-023-05881-4
Moses, L., & Pachter, L. (2022). Museum of spatial transcriptomics. Nature methods, 19(5), 534-546. https://doi.org/10.1038/s41592-022-01409-2
Ofer, D., Brandes, N., & Linial, M. (2021). The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 19, 1750-1758. https://doi.org/10.1016/j.csbj.2021.03.022
Picard, M., Scott-Boyer, M. P., Bodein, A., Périn, O., & Droit, A. (2021). Integration strategies of multi-omics data for machine learning analysis. Computational and Structural Biotechnology Journal, 19, 3735-3746. https://doi.org/10.1016/j.csbj.2021.06.030
Poplin, R., Chang, P. C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., ... & DePristo, M. A. (2018). A universal SNP and small-indel variant caller using deep neural networks. Nature biotechnology, 36(10), 983-987. https://doi.org/10.1038/nbt.4235
Raikar, G. V. S., Raikar, A. S., & Somnache, S. N. (2023). Advancements in artificial intelligence and machine learning in revolutionising biomarker discovery. Brazilian Journal of Pharmaceutical Sciences, 59, e23146. https://doi.org/10.1590/s2175-97902023e23146
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., ... & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), e2016239118. https://doi.org/10.1073/pnas.2016239118
Seifrid, M., Pollice, R., Aguilar-Granda, A., Morgan Chan, Z., Hotta, K., Ser, C. T., ... & Aspuru-Guzik, A. (2022). Autonomous chemical experiments: Challenges and perspectives on establishing a self-driving lab. Accounts of Chemical Research, 55(17), 2454-2466. https://doi.org/10.1021/acs.accounts.2c00220
Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612-634. https://doi.org/10.1109/JPROC.2021.3058954
Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., ... & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706-710. https://doi.org/10.1038/s41586-019-1923-7
Strubell, E., Ganesh, A., & McCallum, A. (2020, April). Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 09, pp. 13693-13696). https://doi.org/10.1609/aaai.v34i09.7123
Thornton, J. M., Laskowski, R. A., & Borkakoti, N. (2021). AlphaFold heralds a data-driven revolution in biology and medicine. Nature Medicine, 27(10), 1666-1669. https://doi.org/10.1038/s41591-021-01533-0
Varadi, M., Anyango, S., Deshpande, M., Nair, S., Natassia, C., Yordanova, G., ... & Velankar, S. (2022). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1), D439-D444. https://doi.org/10.1093/nar/gkab1061
Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., ... & Zitnik, M. (2023). Scientific discovery in the age of artificial intelligence. Nature, 620(7972), 47-60. https://doi.org/10.1038/s41586-023-06221-2
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., ... & Baker, D. (2023). De novo design of protein structure and function with RFdiffusion. Nature, 620(7976), 1089-1100. https://doi.org/10.1038/s41586-023-06415-8
Yang, K. D., Belyaeva, A., Venkatachalapathy, S., Damodaran, K., Katcoff, A., Radhakrishnan, A., ... & Uhler, C. (2021). Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nature communications, 12(1), 31. https://doi.org/10.1038/s41467-020-20249-2
Zhang, B., Whiteaker, J. R., Hoofnagle, A. N., Baird, G. S., Rodland, K. D., & Paulovich, A. G. (2019). Clinical potential of mass spectrometry-based proteogenomics. Nature Reviews Clinical Oncology, 16(4), 256-268. https://doi.org/10.1038/s41571-018-0135-7
Zhou, J., Theesfeld, C. L., Yao, K., Chen, K. M., Wong, A. K., & Troyanskaya, O. G. (2018). Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics, 50(8), 1171-1179. https://doi.org/10.1038/s41588-018-0160-6
Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., & Hoffman, M. M. (2019). Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50, 71-91. https://doi.org/10.1016/j.inffus.2018.09.012
Zrimec, J., Börlin, C. S., Buric, F., Muhammad, A. S., Chen, R., Siewers, V., ... & Zelezniak, A. (2020). Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nature communications, 11(1), 6141. https://doi.org/10.1038/s41467-020-19921-4
Authors
Copyright (c) 2024 Saeed Ali Alasmari, Abdulmajeed Saad Bin Baz, Mohammed Awadh Alshehri, Saad Salem Aldawsari, Awadh Jarallah Alkaabi, Saad Mahdi Saleh Alamri, Turki Saeed Alwadaie, ALhanouf Mohammed Moredh, Turkia Mohammed Alharthi, Abdullah Ahmed Alamer, Abdullah Salman Al Salman

This work is licensed under a Creative Commons Attribution 4.0 International License.
