Decoding T Cell Receptor Repertoire Profiling in Immune Related Diseases


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date20 Jun 2023


T cell receptor (TCR) expressing on the surface of T cells could recognize and bind various auto- and foreign-antigens to induce adaptive immune response. The capacity to protect human body from injury lies on the vast diversity formed based on gene segments’ rearrangement. Given TCRs could specifically recognize the antigens appeared in immune related diseases, they could play a crucial role in disease detection and treatment, such as cancer immunotherapy. Currently, TCR repertoire can be captured by experimental methods from million T cells, and then the sequences can be obtained by next-generation sequencing (NGS). However, there are some issues required to be addressed. First, the paired-end (PE) reads generated by NGS are required to be merged into one sequence by the overlapped region between them. Whereas the wide range of TCR lengths cause low accuracy by using the tools initially developed for other genetic sequencing data, so a special tool to process TCR data is required. Second, decoding TCR repertoire in immune related diseases is urgently needed. TCRs could be the available biomarkers for disease in theory, so it is crucial issue that how to identify the disease-associated TCRs (daTCRs) and how to use the daTCRs for disease diagnose and treatment outcome prediction.

To address these issues, first, we developed a software package for IMmune PE reads merger of sequencing data, named IMperm. We used the k-mer-and-vote strategy to pin down the overlapped region rapidly. IMperm could handle all types of PE reads, eliminate adapter contamination, and successfully merge low-quality and minor/non- overlapping reads. Compared to existing tools, IMperm performed better in both simulated and sequencing data. Notably, IMperm was well suited to processing the data of minimal residual disease (MRD) detection in leukemia and lymphoma, and detected 19 novel MRD clones in 14 patients with leukemia from previously published data. Additionally, IMperm can handle PE reads from other sources, and we demonstrated its effectiveness on two genomic and one cell-free DNA datasets. IMperm is implemented in the C programming language, and consumes little runtime and memory.

Next, we decoded the TCR repertoire profiling in Myasthenia Gravis (MG), an autoimmune disease. Autoreactive TCRs, specifically recognizing patients’ auto-antigens, could be perfect biomarkers to represent the disease in theory. Here, we sequenced the TCR repertoire by NGS from peripheral blood samples of 549 MG patients. Compared with healthy controls, TCRs in MG samples showed different TCR characteristics. We developed a bioinformatic pipeline, IMisc2, to identify daTCRs and to classify MG and control by a classification model. 398 TCR clusters were identified as MG daTCRs by IMisc2. They were further validated by an independent MG dataset and related autoimmune diseases. Furthermore, MG daTCRs could be used to predict the severity of disease for patients.

Then, we decoded the TCR repertoire profiling in two cancers: lung cancer and ovarian cancer. For lung cancer, we developed a bioinformatic pipeline IMtaTCR to identify tumor-associated TCRs (taTCRs) from the TCR repertoire in tumor by comparing to the normal tissue. We identified thousands of taTCRs for adenocarcinoma (ADCA), squamous carcinoma (SCCA), and small cell lung carcinoma (SCLC), respectively, from public sequence data. Next, we modified an existing model based on convolutional neural network to predict taTCR score in blood samples. We collected the pre-treatment blood samples from 162 lung cancer patients that would subject to chemotherapy or adjuvant chemotherapy. We found the higher taTCR scores indicated slower disease progression and longer overall survival for the patients diagnosed with ADCA, SCCA, or SCLC individually. Additionally, we analyzed the spatial heterogeneity in ovarian cancer by multi-sites sampling with multi-omics data, including whole genome sequencing (WGS), single-cell RNA (scRNA-seq) and TCR sequencing. Regulatory and exhausted T cells (Tex) were significantly increased in ovarian sites, while central memory T cells were enriched in omental lesions. Neoantigens were identified from the WGS data, and were significantly correlated with TCR diversity of two Tex cell subtypes. Neoantigen-associated TCRs were inferred based on the neoantigens, and these TCRs were from the two Tex cell subtypes.

We believe our methods can bring helps for other investigators to process TCR sequencing data, and believe our findings can provide new perspectives on disease progression and treatment.

    Research areas

  • Bioinformatics, TCR repertoire, IMperm, Myasthenia gravis, Cancer