ViraLM: empowering virus discovery through the genome foundation model

Cheng Peng (Co-first Author), Jiayu Shang (Co-first Author), Jiaojiao Guan, Donglin Wang, Yanni Sun*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

3 Citations (Scopus)
3 Downloads (CityUHK Scholars)

Abstract

Motivation: Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.
Results: In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By using the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.
Availability and implementation: The source code of ViraLM is available via: https://github.com/ChengPENG-wolf/ViraLM.
© The Author(s) 2024. Published by Oxford University Press.
Original languageEnglish
Article numberbtae704
JournalBioinformatics
Volume40
Issue number12
Online published23 Nov 2024
DOIs
Publication statusPublished - Dec 2024

Funding

This work was supported by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) [11209823], the Hong Kong Innovation and Technology Fund (ITF) [MRP/071/20X], and the City University of Hong Kong.

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'ViraLM: empowering virus discovery through the genome foundation model'. Together they form a unique fingerprint.

Cite this