TY - JOUR
T1 - Generalized biological foundation model with unified nucleic acid and protein language
AU - He, Yong
AU - Fang, Pan
AU - Shan, Yongtao
AU - Pan, Yuanfei
AU - Wei, Yanhong
AU - Chen, Yichang
AU - Chen, Yihao
AU - Liu, Yi
AU - Zeng, Zhenyu
AU - Zhou, Zhan
AU - Zhu, Feng
AU - Holmes, Edward C.
AU - Ye, Jieping
AU - Li, Jun
AU - Shu, Yuelong
AU - Shi, Mang
AU - Li, Zhaorong
PY - 2025/6/18
Y1 - 2025/6/18
N2 - The language of biology, encoded in DNA, RNA and proteins, forms the foundation of life but remains challenging to decode owing to its complexity. Traditional computational methods often struggle to integrate information across these molecules, limiting a comprehensive understanding of biological systems. Advances in natural language processing with pre-trained models offer possibilities for interpreting biological language. Here we introduce LucaOne, a pre-trained foundation model trained on nucleic acid and protein sequences from 169,861 species. Through large-scale data integration and semi-supervised learning, LucaOne shows an understanding of key biological principles, such as DNA–protein translation. Using few-shot learning, it effectively comprehends the central dogma of molecular biology and performs competitively on tasks involving DNA, RNA or protein inputs. Our results highlight the potential of unified foundation models to address complex biological questions, providing an adaptable framework for bioinformatics research and enhancing the interpretation of life’s complexity. © The Author(s) 2025
AB - The language of biology, encoded in DNA, RNA and proteins, forms the foundation of life but remains challenging to decode owing to its complexity. Traditional computational methods often struggle to integrate information across these molecules, limiting a comprehensive understanding of biological systems. Advances in natural language processing with pre-trained models offer possibilities for interpreting biological language. Here we introduce LucaOne, a pre-trained foundation model trained on nucleic acid and protein sequences from 169,861 species. Through large-scale data integration and semi-supervised learning, LucaOne shows an understanding of key biological principles, such as DNA–protein translation. Using few-shot learning, it effectively comprehends the central dogma of molecular biology and performs competitively on tasks involving DNA, RNA or protein inputs. Our results highlight the potential of unified foundation models to address complex biological questions, providing an adaptable framework for bioinformatics research and enhancing the interpretation of life’s complexity. © The Author(s) 2025
UR - http://www.scopus.com/inward/record.url?scp=105008314635&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-105008314635&origin=recordpage
U2 - 10.1038/s42256-025-01044-4
DO - 10.1038/s42256-025-01044-4
M3 - RGC 21 - Publication in refereed journal
SN - 2522-5839
VL - 7
SP - 942
EP - 953
JO - Nature Machine Intelligence
JF - Nature Machine Intelligence
IS - 6
ER -