Improving protein domain classification for third-generation sequencing reads using deep learning
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Article number | 251 |
Journal / Publication | BMC Genomics |
Volume | 22 |
Online published | 9 Apr 2021 |
Publication status | Published - 2021 |
Link(s)
DOI | DOI |
---|---|
Attachment(s) | Documents
Publisher's Copyright Statement
|
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85104096332&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(3bcc939a-b801-45e8-b6bb-ae950cae118a).html |
Abstract
Background: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads.
Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification.
Conclusions: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction.
Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification.
Conclusions: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction.
Research Area(s)
Citation Format(s)
Improving protein domain classification for third-generation sequencing reads using deep learning. / Du, Nan; Shang, Jiayu; Sun, Yanni.
In: BMC Genomics, Vol. 22, 251, 2021.
In: BMC Genomics, Vol. 22, 251, 2021.
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review
Download Statistics
No data available