Genome Sequencing Assembly and the Applications in Biological Researches
基因組測序和組裝在生物學領域的應用
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 26 Jun 2019 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(10059830-3426-49d1-92f5-653290e667f6).html |
---|---|
Other link(s) | Links |
Abstract
The fast development of the Next-sequencing technology (NGS) significantly reduces the cost for sequencing more genomes, including de novo sequencing, a way to build a novo reference genome. A reference genome is necessary to investigate a species in molecular level and the genome itself consisting vital information for understanding its unique features and evolutionary history. Almost all model organisms have been sequenced and analyzed, non-model species and metagenome are continuing to be decoded. Bioinformatics tools and pipeline are indispensable for assembling the new genome and mining the data for biological purposes. In my PhD, I dedicated to bioinformatics study on a non-model organism genome and one metagenome by adopting different sequencing technologies and various bioinformatics tools. The first project focuses on the carnivorous plant genomic study, Cephalotus, which is able to trap, capture and digest small animals and absorb nutrition from them. Here, we sequenced and analyzed the genome and compared with other carnivorous plants, we have identified genes ralated to carnivory and concluded their origins from stress-relative genes under convergent evolution. The second project is to investigate the assemblies of metagenome by linked-read sequencing. we investigated the effects of different parameters and produced a practical guideline of metagenome assembly using 10X linked-read sequencing technology. We concluded average sequencing coverage (C) is critical for assembling medium abundance microbes. Increasing eight times average Coverage of short Reads per fragment (CR) can generate 2.51 times more assembly sequences, but large CR would lead to more unexpected PCR duplicates, around 0.3X is the best choice. Average physical Coverage of the genome by long DNA Fragments (CF) controlled contig quality because of the contig connectedness supported of multiple fragments.
Less fragments per partition (NFP) helped deconvolve the complex assembly graph into multiple simpler subassemble problems. Interestingly, the average fragment length (μFL) has little effects on contig length, 10kb fragments are sufficient for metagenome assembly. Our observations lead to a practical protocol for metagenome assembly to merge multiple libraries with sufficient sequencing reads, each of them with less than 1 ng DNA input as standard protocol proposed. Comparing with Illumina short reads, assembly on linked-reads significantly improved for both completeness and continuity. To our knowledge, our study is the first to examine potential influence of various factors on metagenome assembly by linked-reads. We believe this study can move us one step further to identify the unknown microbes and generate reference metagenome sequences in the future.
Less fragments per partition (NFP) helped deconvolve the complex assembly graph into multiple simpler subassemble problems. Interestingly, the average fragment length (μFL) has little effects on contig length, 10kb fragments are sufficient for metagenome assembly. Our observations lead to a practical protocol for metagenome assembly to merge multiple libraries with sufficient sequencing reads, each of them with less than 1 ng DNA input as standard protocol proposed. Comparing with Illumina short reads, assembly on linked-reads significantly improved for both completeness and continuity. To our knowledge, our study is the first to examine potential influence of various factors on metagenome assembly by linked-reads. We believe this study can move us one step further to identify the unknown microbes and generate reference metagenome sequences in the future.
- Genomics, Bioinformatics