Skip to main navigation Skip to search Skip to main content

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

  • Nan Du
  • , Jiao Chen
  • , Yanni Sun*
  • *Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

86 Downloads (CityUHK Scholars)

Abstract

Background: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. Results: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads' overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. Conclusions: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.
Original languageEnglish
Article number190
JournalBMC Genomics
Volume20
Issue numberSuppl. 2
Online published4 Apr 2019
DOIs
Publication statusPublished - 2019
Event17th Asia Pacific Bioinformatics Conference (APBC 2019) - Wuhan, China
Duration: 14 Jan 201916 Jan 2019
http://glab.hzau.edu.cn/APBC2019/

Research Keywords

  • Group hit criteria
  • Metagenomics
  • Overlap detection
  • Third-generation sequencing

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Improving the sensitivity of long read overlap detection using grouped short k-mer matches'. Together they form a unique fingerprint.

Cite this