Abstract
Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks. However, DL models are often strikingly vulnerable to adversarial attacks. Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assuming no model knowledge is accessible to attackers). These attacks typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity. However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing token directionality and overlapping token sensitivities, respectively. In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression. We evaluated XATA's attack performance on DL models executing text classification on the IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity datasets and DL models conducting text regression on the My Personality, Drug Review, and CommonLit Readability datasets. The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks. These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks. © 2023 IEEE.
| Original language | English |
|---|---|
| Pages (from-to) | 12400-12414 |
| Journal | IEEE Transactions on Knowledge and Data Engineering |
| Volume | 35 |
| Issue number | 12 |
| Online published | 26 Apr 2023 |
| DOIs | |
| Publication status | Published - Dec 2023 |
| Externally published | Yes |
Funding
This work was partially supported by the NSFC under Grants 72171071, 72293581, 72271084, 72188101, 72101079, 72293580, 71771131, 72110107003, in part by the Excellent Fund of HFUT under Grant JZ2021HGPA0060, in part by the China Scholarship Council under Grant 202206690034.
Research Keywords
- Additive feature attribution
- Additives
- Adversarial attack
- Closed box
- Explainable methods
- Mathematical models
- Perturbation methods
- Sensitivity
- Task analysis
- Text categorization
- Text classification
- Text regression
Fingerprint
Dive into the research topics of 'Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver