Illuminating the Dark Matter of Viruses: Bacteriophage Identification and Characterization using Deep Learning 


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date4 Sept 2023


Bacteriophages (aka phages) are viruses mainly infecting bacteria. They are the most widely distributed and abundant biological entities in the biosphere, with an estimated population of more than 1031 particles. Phages have important applications in pharmaceuticals because they can be used as an alternative to antibiotics to kill bacterial pathogens, so-called phage therapy. Over the past decades, the overuse of antibiotics has resulted in the evolution of bacteria into superbugs that carry antibiotic-resistant genes, posing a significant threat to human health. In 2017, over 23,000 people dead caused by the superbugs, and it is estimated that by 2050, superbugs could kill more people than cancer. Thus, researchers are exploring phages’ potential to combat superbugs and improve patient outcomes. This involves using phages to treat bacterial infections by targeting specific bacteria without harming the body's healthy cells. In addition, phages play a crucial role in maintaining and shaping microbial ecology by killing their hosts or genome integration. For example, phages can lyse 20%-40% of bacteria per day in marine ecosystems. Accumulating studies show that phages have an essential impact on multiple applications, such as the food industry, engineering bacterial genomes, and disease diagnostics.

Despite the abundance and importance of phages, they are much less studied than other microbes. Novel phages waiting to be discovered and characterized constitute a large portion of the "viral dark matter." Today, the primary means of novel phage discovery is (viral) metagenomic sequencing, which is culture-independent and thus can sequence all genetic materials from various environments. Intensive sequencing efforts for different ecosystems have revealed many viral-like particles that may not be easily cultured using traditional methods. However, lacking optimized methods/tools had become a significant barrier to leveraging all the data for phage study. In particular, a critical need is to discover and characterize phages that are not closely related to the known ones in public databases.

Since 2006, AI-based models, especially deep learning, have achieved performance breakthroughs in computer vision, natural language processing, and video/speech recognition. The advantage of deep learning models in learning massive amounts of data enables them to become the most widely used computational approaches to complex cognitive tasks. Thus, leveraging deep learning for phage identification and characterization is expected to perform better than traditional methods. However, because cultivating and characterizing phages is time-consuming, labor-intensive, and costly in laboratories, the existing datasets usually comprise massive unlabeled data, and long-tailed labels distributed and annotated with label noises. All these properties limit the generalizability of deep-learning models when applied to genomic data. In our works, we provided efficient deep learning-based solutions to decipher the biological sequences for phage discovery and analysis. To help understand the features/rubrics learned by deep-learning models, we also visualized the essential components of the model and analyzed the biological meaning.

Toward our goal of a comprehensive understanding of "who are there" and "what do they do" for phages, we investigate five key research questions: 1) phage discovery from metagenomic data; 2) phage lifestyle prediction; 3) phage taxonomic classification; 4) phage host prediction; and 5) phage virion protein annotation. Answering these questions allows us to gain deeper insights into phages' diversity/distribution in different habitats, their properties/functions, and their interactions with other microbes.

Recent large language models, such as ChatGPT, have demonstrated significant advancements in natural language understanding (NLP). Inspired by semantic analysis problems in NLP, we employ the state-of-the-art contextualized embedding model, Transformer, to automatically learn abstract patterns from the "language" of phages. In this language, the phage sequences are regarded as sentences defined on a phage-aware vocabulary. There are two major advantages behind this formulation. First, some proteins play critical roles in phages' life cycle. For example, coat proteins and receptor-binding proteins can help us distinguish phages from bacteria. These proteins can act as strong signals similar to the words describing obvious emotions in human language. Second, proteins often interact with other proteins to carry out biological functions. Similar to multiple words that can form phrases with different meanings, some protein combinations in the contigs can also provide important evidence for phage identification. Under this formulation, we adapt the Transformer model to protein-based tokens and tackle the two phage protein-aware tasks: phage identification and lifestyle prediction.

However, there is an additional challenge for taxonomic classification and host prediction. Phages with known taxa and bacterial hosts are just the tip of the iceberg. Given the enormous diversity of phages and the sheer amount of unlabeled phages, we formulate the phage taxonomic classification and host prediction as semi-supervised learning problems. In these tasks, we construct a knowledge graph to connect labeled and unlabelled phages and choose the graph convolutional neural network to learn the topological structure. Thus, information from both labeled and unlabelled data can be used to enlarge the receptive field.

Unlike the aforementioned tasks, in which inputs contain multiple proteins, the features of phage protein annotation are independent protein sequences. Thus, the main focus of the protein annotation task is how to efficiently encode these proteins into numerical matrices. Considering that phages’ proteins are highly diverse, we apply chaos game representation to encode protein sequences. Chaos game representation is a generalized Markov chain that allows one-to-one mapping between the image and the sequence. Thus, the converted images can embed biological features, such as motifs (short similar recurring subsequences), which can be leveraged by the deep learning model. Then, inspired by pattern recognition problems in computer vision, we apply the Vision Transformer to learn the importance of different sub-images and their associations for protein annotation.

Finally, we developed a web server named PhaBOX to integrate all the methods mentioned above for the benefit of virologists. To our best knowledge, this is the first web server for comprehensive phage sequence analysis in metagenomic data. To help users conduct downstream analysis, PhaBOX also provides visualization of the essential features for making the predictions, such as the similarity-based relationships between the query sequences and other phages, predicted proteins on the sequences, and protein homology. In summary, PhaBOX provides a one-stop shop for phage identification and analysis for users with or without informatics training. We hope it can help advance the field of phage study in various ecosystems.