TY - JOUR
T1 - MEMORY-Based Hardware Architectures to Detect ClamAV Virus Signatures with Restricted Regular Expression Features
AU - Or, Nga Lam
AU - Wang, Xing
AU - Pao, Derek
PY - 2016/4/1
Y1 - 2016/4/1
N2 - We aim to implement a single-chip hardware detection engine for virus scanning. Our study is based on the ClamAV virus database, which contains 88.9 K strings and 9.6 K extended hex-signatures with restricted regular expression (regex) features. We have previously presented cost-effective hardware architectures to detect the 88.9K strings and 3.2K regex patterns that are composed of multiple string segments. In this paper, we shall present hardware architectures to detect the remaining 6.4 K regex patterns. Our method is based on the information reduction approach. We transform the byte-oriented matching problem to a token-based matching problem. A regex pattern contains one or more segments, and a segment may be subdivided into multiple non-trivial tokens. In general, a token is associated with one or a few regexes only. The input byte-stream is converted into a token-stream using dedicated hardware units, where the number of tokens is much less than the number of bytes. The token-stream is processed by a NFA-based aggregation unit to determine if any segment can be found. Detected segments are further processed by a scoreboard to determine if any multi-segment pattern can be found. For proof-of-concept, our method is implemented on a Virtex-6 FPGA which consumes 1.84 MB on-chip memory.
AB - We aim to implement a single-chip hardware detection engine for virus scanning. Our study is based on the ClamAV virus database, which contains 88.9 K strings and 9.6 K extended hex-signatures with restricted regular expression (regex) features. We have previously presented cost-effective hardware architectures to detect the 88.9K strings and 3.2K regex patterns that are composed of multiple string segments. In this paper, we shall present hardware architectures to detect the remaining 6.4 K regex patterns. Our method is based on the information reduction approach. We transform the byte-oriented matching problem to a token-based matching problem. A regex pattern contains one or more segments, and a segment may be subdivided into multiple non-trivial tokens. In general, a token is associated with one or a few regexes only. The input byte-stream is converted into a token-stream using dedicated hardware units, where the number of tokens is much less than the number of bytes. The token-stream is processed by a NFA-based aggregation unit to determine if any segment can be found. Detected segments are further processed by a scoreboard to determine if any multi-segment pattern can be found. For proof-of-concept, our method is implemented on a Virtex-6 FPGA which consumes 1.84 MB on-chip memory.
KW - Hardware architecture
KW - regular expression matching
KW - string matching
KW - virus detection
UR - http://www.scopus.com/inward/record.url?scp=84963723911&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-84963723911&origin=recordpage
U2 - 10.1109/TC.2015.2439274
DO - 10.1109/TC.2015.2439274
M3 - RGC 21 - Publication in refereed journal
SN - 0018-9340
VL - 65
SP - 1225
EP - 1238
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 4
M1 - 7115115
ER -