TY - JOUR
T1 - Exploring Scalable Parallelization for Edit Distance-Based Motif Search
AU - Qiu, Junqiao
AU - Ebnenasir, Ali
PY - 2023/3
Y1 - 2023/3
N2 - Motif Searching is an important problem that can reveal crucial information from biological data. Since the general motif searching is NP-hard and the volume of biological data is growing exponentially in recent years, there is a pressing need for developing time and space-efficient algorithms to find motifs. In this paper, we explore scalable parallelization for Edit Distance-Based Motif Search (EMS). We introduce two parallel designs, recursEMS which integrates the existing EMS solver into a parallel recursion tree running in multiple processes, and parEMS that presents a novel thread-based method which avoids the storage of redundant motif candidates. To make the parallel designs practical, we implement SPEMS, a Scalability-sensitive Parallel solver for EMS. For any given biological dataset and search instance, SPEMS can provide an EMS parallelization towards the optimal performance, or a sub-optimal performance but being more space efficient. Evaluations on two real-world DNA dataset TRANSFAC and ChIP-seq show that SPEMS can obtain 10× geometric mean speedup over the state-of-the-art at the expense of no less than 74.7% memory overheads, or provide 2.2× geometric mean speedup with the possibility of consuming less memory, when running on a 48-core machine. © 2022 IEEE.
AB - Motif Searching is an important problem that can reveal crucial information from biological data. Since the general motif searching is NP-hard and the volume of biological data is growing exponentially in recent years, there is a pressing need for developing time and space-efficient algorithms to find motifs. In this paper, we explore scalable parallelization for Edit Distance-Based Motif Search (EMS). We introduce two parallel designs, recursEMS which integrates the existing EMS solver into a parallel recursion tree running in multiple processes, and parEMS that presents a novel thread-based method which avoids the storage of redundant motif candidates. To make the parallel designs practical, we implement SPEMS, a Scalability-sensitive Parallel solver for EMS. For any given biological dataset and search instance, SPEMS can provide an EMS parallelization towards the optimal performance, or a sub-optimal performance but being more space efficient. Evaluations on two real-world DNA dataset TRANSFAC and ChIP-seq show that SPEMS can obtain 10× geometric mean speedup over the state-of-the-art at the expense of no less than 74.7% memory overheads, or provide 2.2× geometric mean speedup with the possibility of consuming less memory, when running on a 48-core machine. © 2022 IEEE.
KW - Biology
KW - Costs
KW - DNA
KW - Edit-distance motif search
KW - Parallel algorithms
KW - parallelism
KW - Proteins
KW - Search problems
KW - Time complexity
UR - http://www.scopus.com/inward/record.url?scp=85139433690&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85139433690&origin=recordpage
U2 - 10.1109/TCBB.2022.3208867
DO - 10.1109/TCBB.2022.3208867
M3 - RGC 21 - Publication in refereed journal
SN - 1545-5963
VL - 20
SP - 1587
EP - 1593
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 2
ER -