Advancing Plasmid Characterization in Metagenomic Data: Plasmid Protein Annotation and Host Range Identification

Project: Research

View graph of relations


Experts refer to antibiotic resistance as “the slow-moving pandemic”, which underlines three things: the high mortality rate, the extensive scope of its impact, and the long duration of its influence. Antibiotic resistance happens when bacteria become immune to antibiotics. They can do this in two ways: 1) inherit antibiotic resistance genes (ARGs) from their parent bacteria; 2) get these genes from other bacteria through a process called horizontal gene transfer (HGT). Key players in HGT are plasmids, self-replicating molecules found within many bacterial cells. They can pass on their ARGs to other bacteria in the same environment, making those bacteria resistant to antibiotics too. Because plasmid can disseminate ARGs, monitoring plasmids is important for assessing the risk of antibiotic resistance. Many plasmids have been sequenced via advanced sequencing technologies. However, plasmid characterization falls far behind the sequencing efforts. There are two fundamental needs: 1) Many proteins in plasmids have not been annotated, meaning we don't know their functions; 2) Many newly sequenced plasmids lack host range information, so we don't know which organisms they affect. The central goal of this proposal is to develop domain knowledge-guided deep learning models for understanding plasmid-mediated gene transfer. To address this, we have three objectives: 1) annotating the functions of plasmid proteins; 2) predicting hosts of plasmids; and 3) deriving plasmid-mediated ARG transfer network in collaboration with two microbiologists.  Lacking protein annotation poses a significant challenge for plasmid characterization and host prediction. Traditional methods, like sequence alignment, have limited success due to the high diversity of plasmids. In our approach, we will leverage the modular structure of proteins in plasmids. Correspondingly, our methodology will capitalize on the large protein model and the modern natural language models to enhance feature representation and context-aware learning. With the annotated functions, we will select proteins involved in the plasmid transfer process and design a host prediction algorithm integrating diverse features: the key proteins involved in plasmid transfer, plasmid similarity, plasmid vs. chromosome similarity, and protein structure binding. Finally, we will apply our designed algorithms to two types of samples: those from wild rats in Hong Kong, and those from wastewater treatment plants. Given the close association of these samples with human habitats, their microbiomes may harbor pathogens of interest. Moreover, these ecosystems exhibit intricate dynamics, making them ideal testbeds for investigating potential transfer of ARGs to pathogens via plasmids. 


Project number9043697
Grant typeGRF
StatusNot started
Effective start/end date1/01/25 → …