Skip to main navigation Skip to search Skip to main content

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Yiqing Shen (Co-first Author), Zan Chen (Co-first Author), Michail Mamalakis (Co-first Author), Yungeng Liu (Co-first Author), Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang*

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLM-Dataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B's enhanced protein sequence understanding capability, we introduce TourSynbioAgent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis. Our results show that this combination facilitates protein engineering tasks in wet labs, leading to higher positive rates, improved mutations, shorter delivery times, and increased automation. The model weights are available at https://huggingface.co/tsynbio/Toursynbio and codes at https://github.com/tsynbio/TourSynbio. © 2024 IEEE.
Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine
EditorsMario Cannataro, Huiru (Jane) Zheng, Lin Gao, Jianlin (Jack) Cheng, João Luís de Miranda, Ester Zumpano, Xiaohua Hu, Young-Rae Cho, Taesung Park
PublisherIEEE
Pages2382-2389
ISBN (Electronic)9798350386226
ISBN (Print)9798350386233
DOIs
Publication statusPublished - Dec 2024
Externally publishedYes
Event2024 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2024) - Lisbon, Portugal
Duration: 3 Dec 20246 Dec 2024
https://ieeebibm.org/BIBM2024/

Publication series

NameProceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM
ISSN (Print)2156-1125
ISSN (Electronic)2156-1133

Conference

Conference2024 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2024)
Abbreviated titleBIBM 2024
PlacePortugal
CityLisbon
Period3/12/246/12/24
Internet address

Research Keywords

  • AI Agent
  • Deep Learning
  • Multi-modal Large Model
  • Protein Engineering

Fingerprint

Dive into the research topics of 'TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering'. Together they form a unique fingerprint.

Cite this