Accurate Characterization of Bacteriophages by Integrating Their Properties with Deep Learning Models

Project: Research

View graph of relations


Bacteriophages (“phages”), a type of virus mainly infecting bacteria, are highly abundantand ubiquitous. They can kill their bacterial hosts or integrate their sequences into thehosts’ genomes. Thus, they play vital roles in regulating the compositions and functionsof microbial communities. Recently, phage therapy gained a resurgence of attentionbecause of the growing concern of antibiotic-resistant bacteria. These bacteria evolve todefeat antibiotic medicines and have become a major threat to global health. Phagetherapy is one promising strategy to treat these bacterial infections. Despite theimportance of phages, novel phages awaiting to be discovered and characterizedconstitute a large portion of “viral dark matter.” The recent advances in high-throughputsequencing, particularly metagenomic sequencing, have made novel phagediscovery significantly faster. However, a major barrier to leveraging all the data forphage study is the lack of optimized computational methods/tools. Toward our long-termgoal of investigating “what phages are there” and “what do they do in various habitats”,we propose three research objectives.  The first objective is to detect phage contigs from noisy and heterogeneous metagenomicdata, which is one primary resource for new phage discovery. The second objective is toclassify novel phages into their taxonomic groups, which can provide essential knowledgeon their evolution and functions. The third objective is to develop a computationalmethod to identify the hosts of new phages, which can provide us with key insights intophages’ roles in microbiome and fundamental knowledge for designing phage therapy.  To achieve the three objectives, we must tackle several major challenges. The firstchallenge is the high diversity of phages, which can fail even the most powerfulsequence comparison methods for phage classification and detection. Second, thousandsof phage genera form a long tail distribution with highly imbalanced data distribution.Phages from rare groups are often misclassified into large groups. The third challenge isthe limited characterization of phage-host interactions, which is just the tip of aniceberg compared to all phages on the planet.  We will tackle these challenges using modern machine learning models that can extractmore abstract features beyond sequence conservation from a large amount of sequencingdata. By integrating phage properties with the customized learning models, our researchis expected to advance the field of computational characterization of phages. We will testand optimize our new tools on both the public data and the marine microbial sequencingdata provided by our collaborator.  


Project number9043533
Grant typeGRF
Effective start/end date1/01/24 → …