TY - JOUR
T1 - Fast exact algorithms for the closest string and substring problems with application to the planted (L,d)-motif model
AU - Chen, Zhi-Zhong
AU - Wang, Lusheng
PY - 2011
Y1 - 2011
N2 - We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97 d) time for DNA strings and in O(nL + nd · 61.86 d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92 d) time for DNA strings and in O(nL + nd · 47.21 d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n-1)m 2(L + d · 17.97 d · m ⌊log2(d+1)⌋)) time for DNA strings and in O((n-1)m 2(L + d · 61.86 d · m ⌊log2(d+1)⌋)) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too. © 2006 IEEE.
AB - We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97 d) time for DNA strings and in O(nL + nd · 61.86 d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92 d) time for DNA strings and in O(nL + nd · 47.21 d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n-1)m 2(L + d · 17.97 d · m ⌊log2(d+1)⌋)) time for DNA strings and in O((n-1)m 2(L + d · 61.86 d · m ⌊log2(d+1)⌋)) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too. © 2006 IEEE.
KW - closest string
KW - closest substring
KW - DNA motif discovery
KW - Parameterized algorithm
UR - http://www.scopus.com/inward/record.url?scp=79960921508&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-79960921508&origin=recordpage
U2 - 10.1109/TCBB.2011.21
DO - 10.1109/TCBB.2011.21
M3 - RGC 21 - Publication in refereed journal
C2 - 21282867
SN - 1545-5963
VL - 8
SP - 1400
EP - 1410
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 5
M1 - 5708134
ER -