Decoding Disease and Immunity by High-throughput Sequencing: From Genomes to Immune Receptors

Student thesis: Doctoral Thesis

Abstract

High-throughput sequencing is a revolutionary technological advancement that has unveiled the complexity of biological sequences. Complexity is best appreciated in genomes and immune receptor repertoires. The human genome has 3x109 base pairs, whereas the theoretical diversity of lymphocyte antigen receptors is at least 1013. Accompanying the complexity are notable individual differences, which form the basis for deciphering diseases and immune responses. With the constant increase in sequencing data volume and the emergence of single-cell technologies, the demand for computational analysis methods continues to grow. Focusing on disease and immunity, we examined the background of genomic and immune profiling along with current computational analysis approaches. On this basis, we introduced novel methods to address previous limitations and enhance performances.

For a long time, genome-wide coverage has been a metric assessing sequencing quality and quantity, yet the underlying biological insights have remained unexploited. In this work, we performed a comparative analysis on genome-wide coverage profiles between nucleus genome DNA (gDNA, n=3,202; 1000 Genomes Project) and cell-free DNA (cfDNA; European Genome Archive) samples from healthy individuals (n=113) and cancer patients (n=362). Across all sample types, we observed a conserved coverage landscape characterized by segmentation, where adjacent genomic windows present similar coverage. Beyond GC content, we identified protein-coding gene density and nucleosome density as key factors influencing gDNA and cfDNA coverage, respectively. Notably, differential coverage between cfDNA and gDNA was found in immune-receptor loci, intergenic regions, and non-coding genes, reflecting distinct genomic activities across cell types. In cancer cfDNA samples, increased coverage in non-coding genes and intergenic regions, coupled with decreased coverage in protein-coding genes and genic regions, suggested a reduced contribution by normal cells. Importantly, we discovered a distinct pattern of coverage convergence in cancer-derived cfDNA, with the extent of convergence positively correlated with cancer stages. Leveraging these insights, we developed and validated an outlier-detection method for cfDNA-based cancer screening that does not require cancer samples for training. The proposed method outperformed existing benchmarks in both condition-matched and condition-unmatched cancer detection tasks.

The specificity of a T-cell receptor (TCR) repertoire defines an individual's immune capacity. While existing methods have focused on the qualitative aspects of TCR specificity, the quantitative dimensions have remained unaddressed. We developed TCRanno, a Python package to quantify TCR repertoire specificity. We generated epitope-aware embedding representations of TCR sequences which are indicative of specificity. Quantitative profiles of TCR repertoire specificity at epitope, antigen and organism levels are generated by aggregating clonotype frequencies. Applying TCRanno to 4,195 TCR repertoires uncovered significant quantitative changes in specificity upon infections, autoimmune diseases, and cancers. Notably, TCRanno identified cytomegalovirus-specific TCRs in seronegative healthy individuals, suggesting the possibility of abortive infections. It also discovered an age-accumulated subpopulation of SARS-CoV-2-specific TCRs in pre-pandemic samples, which may explain the aggressive symptoms and age-related severity of COVID-19. Additionally, TCRanno revealed that encounters with Hepatitis B antigens might trigger systemic lupus erythematosus. TCRanno annotations distinguished TCR repertoires between healthy individuals and those with cancers such as melanoma, lung cancer, and breast cancer. Finally, TCRanno demonstrated its use in single-cell TCR sequencing coupled with gene expression data by enabling the isolation of T-cells with specificity of interest.

B-cell lineage trees illustrate the evolutionary process during the affinity maturation of B-cell receptors (BCRs). Current methods for constructing B-cell lineage trees generally lack considerations on the inheritance and positive accumulation of advantageous mutation from parent to child, which is central to affinity maturation. To overcome the limitations we developed AffMB (Affinity Maturation of B-cell receptor), a comprehensive toolkit to generate and visualize SHM-ordered, inheritance-based B-cell lineage trees using single-cell or bulk sequencing data of B-cell receptor repertoires. AffMB's SHM-ordered inheritance tree algorithm demonstrated advantages over state-of-the-art benchmarks on various simulations. When applied to single-cell data from BNT162b2 vaccination (n=42), AffMB demonstrated the ability to predict vaccination responses and showed the feasibility of discovering candidates of high-affinity antibody.
Date of Award8 Aug 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorShuaicheng LI (Supervisor)

Cite this

'