top of page
NemaLife YouTube Banner-2.png

NemaLife News

Big Data and Small Worm: Towards Foundational AI Models for Bioactive Discovery

Writer: Pragya SrivastavaPragya Srivastava

The quest for bioactives that enhance overall health and well-being has captivated researchers, health practitioners, and consumers across the world for centuries. This pursuit represents a sophisticated fusion of traditional wisdom and cutting-edge scientific research, bridging ancient practices with modern discoveries, generating a vast repository of knowledge [1–3] encompassing data from traditional medicine, high-throughput screening, scientific studies, and clinical trials. Leveraging this vast knowledge repository is pivotal for steering future breakthroughs, pinpointing advantageous compounds, forecasting innovative synergies, and unveiling fresh perspectives on the intricate interplay between bioactives and human well-being.


But, how do we go about harnessing such expansive data scattered across time, knowledge systems, and organisms? How do we make meaningful use of existing data from different knowledge systems, and channel them towards our goals of bioactive discovery and product innovation? Here, generative AI emerges as a transformative tool, capable of unlocking unprecedented insights and accelerating scientific and translational discovery [4]. Generative AI tools are uniquely positioned to leverage this extensive knowledge base. Powered by comprehensive datasets, scientific understanding, and contextual awareness of human health, such AI systems can offer transformative potential in biomedical research, precision medicine, and personalized healthcare [5]. By integrating and analyzing diverse data sources, generative AI will transform bioactive discovery, treatment optimization, and our overall approach to health and disease management [6].


However, as we expect generative AI tools to assist us, we need first to assist AI if we want to discern patterns in complex biological and biomedical data to generate actionable insights. We need to provide AI with the right ‘nourishment’ to learn, ‘grow’ and be a formidable augmentation to our capabilities. This nutritious 'diet' for AI models consists of complete, harmonized, and multimodal foundational datasets, meticulously curated to minimize noise and redundancy. These datasets must be enriched with comprehensive scientific and medical contexts, serving as the bedrock for AI's understanding of biological complexities. The creation of such high-quality, context-rich datasets presents our first significant hurdle [7,8].

Generative AI models offer transformative potential for bioactive discovery but require high-quality, integrated foundational datasets spanning diverse knowledge systems, timescales, and organisms for accurate, actionable insights.
Generative AI models offer transformative potential for bioactive discovery but require high-quality, integrated foundational datasets spanning diverse knowledge systems, timescales, and organisms for accurate, actionable insights.

Challenges in building AI models

Even though a vast amount of data on bioactives exists from multiple biological models and human trials, it is challenging to integrate them to build foundational datasets for AI [9,10]. Some specific challenges include:


  • Heterogeneities: The task of integrating data faces its first challenge from multiple types of heterogeneities. On the biological level, studies may have been conducted on different organisms or diverse populations. On a technical level, differences in experimental methods, compound composition, interpretation, and data-drift over time are some factors that can introduce heterogeneities across datasets.


  • Inconsistencies: Differences in sources of data, terminology, formats, structures, metadata, and semantics make it difficult to integrate data in a seamless manner.


  • Data quality: Data collected from different resources may have different amounts of accuracy, noise, redundancy, and bias which may be further amplified by AI. Removing these to build relevant high-quality datasets can become a time-consuming task.


  • Data accessibility and retrieval: Existence of data silos, regulations involving data protection and compliance and suboptimal strategies for data-storage can make it impractical to use the data to build datasets for AI.


Given these complex challenges, the goal becomes clear: to rapidly and cost-effectively generate high-quality, consistent, and complete datasets that can fuel meaningful advancements in AI-driven bioactive discovery.


C. elegans: A 4-time Nobel prize winner

In addressing this need, we can once again turn to an unlikely hero in biomedical research – the microscopic nematode C. elegans. This tiny organism, which has contributed to multiple Nobel Prize-winning discoveries over the decades, may offer yet another innovative solution [11]. C. elegans, has become a remarkable powerhouse in the world of biological research, making waves in genetics, neurobiology, and developmental processes [12,13].

Caenorhabditis elegans is a celebrated preclinical model with multiple organs including the nervous, muscular, reproductive, and gastrointestinal systems. Adapted from: Ann K. Corsi, Bruce Wightman, and Martin Chalfie. WormBook, 2015.
Caenorhabditis elegans is a celebrated preclinical model with multiple organs including the nervous, muscular, reproductive, and gastrointestinal systems. Adapted from: Ann K. Corsi, Bruce Wightman, and Martin Chalfie. WormBook, 2015.

What makes this little worm so captivating is its simplicity: with a fully mapped genome and a well-defined neural connectome, researchers can easily manipulate its genetic makeup, paving the way for high-throughput studies that generate vast datasets. Its significance as a model organism is nothing short of extraordinary—it has played a pivotal role in driving four Nobel Prizes! From uncovering the secrets of programmed cell death to revolutionizing gene silencing techniques and illuminating the mysteries of microRNA regulation, C. elegans has proven time and again that big discoveries can come from the smallest creatures.


Complete and massive datasets from a tiny worm

In addition to its scientific stardom, C. elegans also shines when it comes to generating large, high-quality, multimodal datasets to power AI for several reasons [14–16]:


  • High-throughput screening: The short lifespan, high reproductive rate, the ease of maintenance and manipulation of C. elegans combined with high-throughput experimental methods such as microfluidics, have enabled parallel screening of synchronized populations of tens of thousands of worms.


  • Cost-effectiveness: Working with C. elegans significantly reduces resource burdens compared to mammalian studies, thanks to its low maintenance costs and rapid life cycle of approximately three days, which allows for quick generation of large sample sizes.


  • Extensive scientific studies: As a well-studied organism with a fully sequenced genome and well-characterized developmental pathways and connectome, C. elegans offers a rich repository of existing scientific literature with 2000+ publications per year. This extensive background enhances contextual understanding and supports the generation of robust datasets for AI applications.


  • Living proof data from multiple timescales: Experiments with C. elegans can capture behaviors over various timescales—from seconds to hours to days—allowing researchers to generate structured datasets that reflect both immediate responses to stimuli and longer-term developmental processes, including aging.


  • Completeness of the data:  Methods of molecular-based discovery targeting specific pathways and in vitro methods often focus on single modality. In a living organism, even as simple as C. elegans, complex biological interdependencies across multiple biological scales cannot be captured by data with single or fewer modalities and complete datasets capturing behavior at all or as many scales as possible are needed. C. elegans is an ideal organism to generate data on multiple modalities including genomics, transcriptomics, proteomics, metabolomics, and whole-organism phenotypes The completeness of datasets obtainable from C. elegans is crucial to train AI-models capable of making accurate predictions.

Worm-to-human AI models for bioactive discovery and validation to improve human health and wellbeing.
Worm-to-human AI models for bioactive discovery and validation to improve human health and wellbeing.

The magic of C. elegans-powered AI models doesn’t end with training on its foundational datasets. This remarkable organism doubles as an unparalleled test bed for validating AI-driven predictions and evaluating novel compounds. With its easily quantifiable phenotypic behaviors, dose-dependent responses, and addiction-like behavioral patterns, C. elegans offers a robust system for high-throughput screening of bioactive candidates.


What truly sets C. elegans apart is the sheer depth and completeness of the data it provides. These holistic datasets, when paired with AI models and existing scientific knowledge, have the power to uncover hidden patterns and untangle complex biological interdependencies like never before [17,18]. The result? A groundbreaking shift in bioactive discovery—faster insights, smarter predictions, and transformative implications for human health.


References:

  1. Waltenberger, B., Mocan, A., Šmejkal, K., Heiss, E. H. & Atanasov, A. G. Natural products to counteract the epidemic of cardiovascular and metabolic disorders. Molecules 21, 807 (2016).


  2. Harvey, A. L., Edrada-Ebel, R. & Quinn, R. J. The re-emergence of natural products for drug discovery in the genomics era. Nat. Rev. Drug Discov. 14, 111–129 (2015).


  1. Sorrenti, V., Burò, I., Consoli, V. & Vanella, L. Recent Advances in Health Benefits of Bioactive Compounds from Food Wastes and By-Products: Biochemical Aspects. International Journal of Molecular Sciences vol. 24 Preprint at https://doi.org/10.3390/ijms24032019 (2023).


  4. King, R. D. et al. The automation of science. Science (1979) 324, 85–89 (2009).


  1. Mak, K. K. & Pichika, M. R. Artificial intelligence in drug development: present status and future prospects. Drug Discov Today 24, 773–780 (2019).


  1. Dzobo, K., Adotey, S., Thomford, N. E. & Dzobo, W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS 24, 247–263 (2020).


  1. AI-Ready Datasets: The Key to Optimizing Foundation Models in Biomedical R&D. https://www.elucidata.io/blog/creating-ai-ready-datasets-for-foundation-models-in-biomedical-r-and-d.


  1. The Hidden Complexity of Scaling a Healthcare Data Business | by Ryan Fukushima | Jan, 2025 | Medium. https://medium.com/@ryfukushima/the-hidden-complexity-of-scaling-a-healthcare-data-business-595fb578b2b8.


  1. Kondratyeva, L., Alekseenko, I., Chernov, I. & Sverdlov, E. Data Incompleteness May form a Hard-to-Overcome Barrier to Decoding Life’s Mechanism. Biology (Basel) 11, (2022).


  1. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15, (2018).


  1. Corsi, A. K., Wightman, B. & Chalfie, M. A Transparent Window into Biology: A Primer on Caenorhabditis elegans. Genetics 200, 387–407 (2015).


  1. Ardiel, E. L. & Rankin, C. H. An elegant mind: Learning and memory in Caenorhabditis elegans. Learning and Memory 17, 191–201 (2010).


  1. Bargmann, C. I. Neurobiology of the Caenorhabditis elegans genome. Science (1979) 282, 2028–2033 (1998).


  1. Pati, A. et al. CMGSDB: Integrating heterogeneous Caenorhabditis elegans data sources using compositional data mining. Nucleic Acids Res 36, (2008).


  1. Hutter, H. & Moerman, D. Big Data in Caenorhabditis elegans: Quo vadis? Molecular Biology of the Cell vol. 26 3909–3914 Preprint at https://doi.org/10.1091/mbc.E15-05-0312 (2015).


  1. Gerstein, M. B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787 (2010).


  1. Thomas, A. et al. Topological Data Analysis of C. elegans Locomotion and Behavior. Front Artif Intell 4, (2021).


  1. Helm, A., Blevins, A. S. & Bassett, D. S. The growing topology of the C. elegans connectome. Preprint at https://doi.org/10.1101/2020.12.31.424985 (2021).

Commenti


bottom of page