Improving the sensitivity of long read overlap detection using grouped short k-mer matches
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Article number | 190 |
Journal / Publication | BMC Genomics |
Volume | 20 |
Issue number | Suppl. 2 |
Online published | 4 Apr 2019 |
Publication status | Published - 2019 |
Conference
Title | 17th Asia Pacific Bioinformatics Conference (APBC 2019) |
---|---|
Place | China |
City | Wuhan |
Period | 14 - 16 January 2019 |
Link(s)
DOI | DOI |
---|---|
Attachment(s) | Documents
Publisher's Copyright Statement
|
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85064118196&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(6f05afd5-d9f9-466f-86d3-0ad03b5bf5e8).html |
Abstract
Background: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. Results: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads' overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. Conclusions: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.
Research Area(s)
- Group hit criteria, Metagenomics, Overlap detection, Third-generation sequencing
Citation Format(s)
Improving the sensitivity of long read overlap detection using grouped short k-mer matches. / Du, Nan; Chen, Jiao; Sun, Yanni.
In: BMC Genomics, Vol. 20, No. Suppl. 2, 190, 2019.
In: BMC Genomics, Vol. 20, No. Suppl. 2, 190, 2019.
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review
Download Statistics
No data available