Skip to main navigation Skip to search Skip to main content

Algorithms for Sorting by Reversal with Inverted Pairs of Repeats and Whole Chromosome Haplotype Assembly

Project: Research

Project Details

Description

In this project, we will design algorithms for two important problems in computational biology.The first problem is pan-genome analysis emphasizing scaffold comparison and rearrangementevent identification and the second problem is the chromosomal haplotype assembly problem.A pan-genome (or supra-genome) describes the full complement of genes in a clade (typically forspecies in bacteria and archaea), which can have large variation in gene content among closelyrelated strains). For pan-genome analysis, the genomes from different strains of the same speciesare decomposed to core segments (in all the strains), dispensable segments (in two or morestrains) and strain specific segments (in one strain only) by using some multiple sequencealignment tools. Various statistics analyses have been done after the decomposition. Here we arethe first to propose to do scaffold comparison of pan-genomes and study various kinds ofrearrangement events and the mechanisms behind those events. Since the genomes within thesame species are much more similar to each other, pan-genome comparison allows us to observethe most recent evolutionary events and the sequence segments nearby those events. It will bevery helpful to reveal the mechanisms under those rearrangement operations. We have somepreliminary finding, i.e., in bacteria such as E. Coli and Pseudomonas aeruginosa, it is verypopular that a pair of inverted TEs is associated with the two ends of a reversal segment. Thismechanism can also explain why breakpoint reuses happen for reversal events.In this project, we propose to develop tools and algorithms for pan-genome scaffold comparisonand rearrangement events (such as reversal, block interchange, insertion and deletion) analysis.Haplotypes play a crucial role in genetic analysis and have many applications such as genedisease diagnoses, association studies, ancestry inference, etc. Due to the current sequencingtechniques, the reads are decomposed into a set of disjoint blocks, where the reads fromdifferent blocks do not overlap. Consequently, the assembled haplotype usually containsthousands of small disjoint pieces (blocks). Even with the 3rd generation sequencing techniquesuch as PACBIO, it is estimated that each chromosome may still contain about 100 blocks.Obtaining one piece of haplotype covering the whole chromosome remains a challenge problemand has attracted lots of attentions recently. This problem is referred to as the chromosomalhaplotype assembly problem.In this project, we propose to use the sequencing data for a family containing at least threeindividuals (instead of one individual) to infer the haplotypes of individuals for the wholechromosome (resulting in one block for each chromosome). We will design algorithms anddevelop software packages to solve the problem.?
Project number9042346
Grant typeGRF
StatusFinished
Effective start/end date1/01/1717/06/21

Keywords

  • Algorithms , Sorting by Reversal , Gene Rearrangement , Haplotype assembly , Haplotype inference

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.