Abstract
The effectiveness of the immune system in combating a myriad of evolving pathogens relies on a dynamic and diverse repertoire of T cell receptors (TCRs). TCRs are pivotal components of the adaptive immune response, enabling T cells to identify and respond to a broad spectrum of antigens, including those derived from pathogens, tumors, and other foreign entities. The enduring nature of immunological memory highlights the significance of high-throughput sequencing of adaptive TCR repertoires (TCR-seq), which provides comprehensive insights into both historical and contemporary interactions of the human immune system. This technology greatly enhances our understanding of fundamental immune processes with far-reaching implications for biomedicine. Given the substantial biological insights embedded within TCR repertoires, mapping the amino acid sequences of TCRs to their biological and clinical properties presents a significant challenge in the field of computational immunology. During my Ph.D. studies, I concentrated on developing novel computational methodologies for the characterization of TCR repertoires at the sequence level. This included probabilistic modeling of TCR repertoires, predicting the antigen specificity of TCRs, representation learning of TCRs, and quantifying the selection in TCR repertoires.Probabilistic modeling of TCR repertoires offers a valuable approach for characterizing the intricate sequence patterns inherent in these repertoires, thereby providing novel insights into the functioning of the adaptive immune system. We introduced TCRpeg, an autoregressive deep learning model to capture the underlying sequence patterns of TCR repertoires from a probabilistic standpoint. Benchmark evaluations indicate that TCRpeg outperforms existing probabilistic models in inferring sequence probability distributions, achieving a significant improvement in average performance. With its promising capabilities in probability inference, TCRpeg enhances a variety of TCR-related tasks, including classification of antigen-specific TCRs, validation of previously identified TCR motifs, generation of novel TCR sequences, and augmentation of TCR data.
Subsequently, we investigated the characterization of the antigen specificity of TCRs, a critical step toward the advancement of personalized immunotherapy and the development of targeted vaccines. We introduced a deep learning model named TEINet to predict TCR binding specificity, incorporating two distinct pretrained encoders to transform TCRs and epitopes into numerical vectors. In addition to demonstrating superior prediction accuracy compared to existing tools, we also summarized and compared the negative sampling strategies employed in this prediction task to select the superior one for future development of the prediction model for TCR specificity.
Next, we developed a deep representation learning framework named TCR2vec to effectively characterize TCR sequences within the embedding space. Recognizing that TCRs that bind to the same antigen tend to display highly similar sequences or binding patterns, we introduced a novel pretraining task known as similarity preservation modeling (SPM). This task is designed to maintain the pairwise sequence similarities of TCRs within the embedding space. By jointly optimizing masked language modeling (MLM) and SPM through a multi-task learning approach, TCR2vec generates embeddings that effectively encode both contextual and functional information about TCRs. We demonstrated the utility and robustness of TCR2vec across two significant downstream tasks: the prediction of antigen specificity and TCR clustering.
Then, we characterized the selection pressure exerted on TCRs. During maturation, precursor T cells undergo a series of selection processes that are essential for developing a functional and self-tolerant immune repertoire. To this end, we introduced TCRsep, a computational tool designed to predict selection factors that quantify the selection dynamics within immune receptor repertoires. Unlike previous selection models that rely on indirect learning approaches, TCRsep directly minimizes the L2 difference between the predicted and ideal selection factors, even in the absence of access to the ideal values. Through extensive simulation experiments and supplementary evaluations, we comprehensively validated the inference capacity and superiority of TCRsep compared to existing methods. By applying TCRsep to the analysis of over 1,500 repertoire samples, we elucidated the relationship between selection pressure and repertoire diversity during aging, explored the potential stability and individuality of selection, examined the role of selection in characterizing TCR sharing profiles, and demonstrated the effectiveness of screening disease-associated TCRs by analyzing their sharing profiles.
In summary, the analysis of TCR repertoires yields critical insights into the immune response to pathogens and provides a historical record of prior infections. Given the inherent complexity of TCR repertoire analysis and the interconnectedness of various factors influencing TCR dynamics, this thesis explores multiple avenues for characterizing these repertoires. By focusing on probabilistic characterization, antigen specificity, deep learning representations, and selection pressure, we aim to achieve a comprehensive understanding of TCR behavior. This multifaceted approach is essential for informing the development of advanced therapeutic strategies and effective vaccine designs.
| Date of Award | 15 Aug 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Shuaicheng LI (Supervisor) |