Generalized biological foundation model with unified nucleic acid and protein language

Yong He*, Pan Fang, Yongtao Shan, Yuanfei Pan, Yanhong Wei, Yichang Chen, Yihao Chen, Yi Liu, Zhenyu Zeng, Zhan Zhou, Feng Zhu, Edward C. Holmes, Jieping Ye, Jun Li, Yuelong Shu, Mang Shi*, Zhaorong Li*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

1 Downloads (CityUHK Scholars)

Abstract

The language of biology, encoded in DNA, RNA and proteins, forms the foundation of life but remains challenging to decode owing to its complexity. Traditional computational methods often struggle to integrate information across these molecules, limiting a comprehensive understanding of biological systems. Advances in natural language processing with pre-trained models offer possibilities for interpreting biological language. Here we introduce LucaOne, a pre-trained foundation model trained on nucleic acid and protein sequences from 169,861 species. Through large-scale data integration and semi-supervised learning, LucaOne shows an understanding of key biological principles, such as DNA–protein translation. Using few-shot learning, it effectively comprehends the central dogma of molecular biology and performs competitively on tasks involving DNA, RNA or protein inputs. Our results highlight the potential of unified foundation models to address complex biological questions, providing an adaptable framework for bioinformatics research and enhancing the interpretation of life’s complexity. © The Author(s) 2025
Original languageEnglish
Pages (from-to)942-953
JournalNature Machine Intelligence
Volume7
Issue number6
DOIs
Publication statusPublished - 18 Jun 2025

Funding

This work was supported by the National Natural Science Foundation of China (82341118). M.S. is funded by the Shenzhen Science and Technology Program (KQTD20200820145822023), the Guangdong Province ‘Pearl River Talent Plan’ Innovation and Entrepreneurship Team project (2019ZT08Y464), and the Guangzhou National Laboratory Major Project (GZNL2023A01001). Y.P. is funded by the National Natural Science Foundation of China (NSFC) Basic Research Project for Doctoral Students (grant number 323B2018).

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Generalized biological foundation model with unified nucleic acid and protein language'. Together they form a unique fingerprint.

Cite this