Abstract
The rapid iteration of computational intelligence has fueled advancements in modern computational biology, enabling better solutions to challenging problems such as the exploration of DNA-transcription factor binding mechanisms, peptide properties prediction, and small-molecule drug discovery. Our thesis centers on the exploration of molecular sequence patterns and identifications, aiming to uncover the structural and functional implications embedded in biomolecular sequences. Through a series of methodological contributions, we introduce a series of approaches to bridge molecular sequences with interaction associated analysis across three key domains: (1) DNA shape motif discovery for elucidating transcription factor binding motifs beyond sequence-level signals; (2) peptide property prediction via structure-informed graph modeling; and (3) all-atom level recognition of intermolecular and intramolecular interactions for protein complex structural interpretation.In the first study, we tackle the problem of DNA shape motif discovery by developing three methods that generalize shape motif detection across multiple DNA shape features. DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors (TFs), bind to specific DNA motifs to regulate gene expression and other cellular activities. Although traditional sequence motifs are well-established, recent insights highlight the role of DNA’s intrinsic three-dimensional topology in modulating protein–DNA interactions. With the continuous advancement in the study of DNA three-dimensional structures, an increasing number of spatial features have been discovered. To date, 14 types of DNA shape features at different scales have been proposed. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address this, we propose a series of methods to discover non-redundant DNA shape motifs with generalization to multiple features. First, the Gibbs sampling method (ShapeMF) is generalized to multi-feature DNA discovery. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are developed for shape motif discovery with higher performance, convergence capability, and efficiency. The discovered DNA shape motif instances provide insights into low-signal ChIP-Seq peak summits, complementing existing sequence motif discovery efforts. Additionally, our model captures the potential interplays across multiple shape features. To demonstrate long-term impact, a valuable platform of tools is provided for exploring DNA shape motifs.
In addition to traditional machine learning algorithms, geometric deep learning methods have experienced rapid development in recent years. The graph structure is well-suited for modeling various bio-active molecules, such as proteins, peptides, and small molecules. Methods represented by AlphaFold and RoseTTAFold have significantly improved the accuracy of protein 3D structure prediction. However, there remains substantial room for improvement in graph-based deep learning methods for peptides. In the second project, we aimed to model peptide data and leveraged amino acid sequence inputs to achieve more accurate peptide property prediction, as bio-active peptide therapeutics have long been a topic of interest. Notably, antimicrobial peptides (AMPs) have been extensively studied for their therapeutic potential. Meanwhile, the demand for annotating other therapeutic peptides, such as antiviral peptides (AVPs) and anticancer peptides (ACPs), has also increased in recent years. However, we contend that the structure of peptide chains and the intrinsic interactions between amino acids have not been fully explored in existing protocols. To address this, we developed a new graph deep learning model, namely TP-LMMSG, which offers lightweight and easy-todeploy advantages while improving annotation performance in a generalizable manner. The results demonstrate that our model accurately predicts the properties of different peptides. It outperforms the state-of-the-art models on AMP, AVP, and ACP prediction across multiple experimentally validated datasets. Moreover, TP-LMMSG addresses the challenges of time-consuming pre-processing in graph neural network frameworks. With its flexibility in integrating heterogeneous peptide features, our model can make substantial contributions to the screening and discovery of therapeutic peptides.
In the third study, we aimed to advance the understanding of both intermolecular interactions, such as ligand-protein binding, and prevalent intramolecular interactions within molecules, such as peptides. In the aforementioned studies, we recognized that accurate interaction detection plays a crucial role in understanding molecular binding and folding. However, there remains substantial room for improvement in existing open-source tools for this task. To this end, we developed PLI+, a high-resolution, all-atom molecular interaction recognition method. It is a novel method that simultaneously supports the identification of both intermolecular and intramolecular non-covalent interactions, while also extending the recognition scope to all-atom scale. PLI+ enables a wide range of interaction profiling in ligand–protein complexes, including hydrogen bonds, salt bridges, hydrophobic contacts, halogen bonds, metal coordination, 𝜋–cation, 𝜋–stacking, 𝜋–hydrogen bonds, water bridges, and customizable non-classical interactions. Powered by rigorous SMARTS-sequence-based substructure parsing with optimized geometric constraints, PLI+ achieves superior detection reliability compared to existing tools. Furthermore, our method supports the detection of intramolecular interactions and achieves high computational efficiency. Notably, the PLI+ method incorporates optimized visualization for analyzing intramolecular interactions, making it suitable not only for high-throughput molecular screening and ligand–receptor binding pose analysis, but also for probing internal interplay within structurally flexible molecules such as bio-active peptides and nucleic acids. By integrating PLI+ into our work, we establish a unified analytical strategy that enables interaction-aware modeling across diverse molecular sequence types—including small molecules, amino acids, and nucleotides, thereby establishing a closed loop from molecular sequence analysis to functional identification.
| Date of Award | 22 Sept 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Ka Chun WONG (Supervisor) & Jianbo YUE (Supervisor) |