Skip to main navigation Skip to search Skip to main content

Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression

  • Yidong Chai
  • , Ruicheng Liang*
  • , Sagar Samtani
  • , Hongyi Zhu
  • , Meng Wang
  • , Yezheng Liu
  • , Yuanchun Jiang*
  • *Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks. However, DL models are often strikingly vulnerable to adversarial attacks. Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assuming no model knowledge is accessible to attackers). These attacks typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity. However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing token directionality and overlapping token sensitivities, respectively. In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression. We evaluated XATA's attack performance on DL models executing text classification on the IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity datasets and DL models conducting text regression on the My Personality, Drug Review, and CommonLit Readability datasets. The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks. These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks. © 2023 IEEE.
Original languageEnglish
Pages (from-to)12400-12414
JournalIEEE Transactions on Knowledge and Data Engineering
Volume35
Issue number12
Online published26 Apr 2023
DOIs
Publication statusPublished - Dec 2023
Externally publishedYes

Funding

This work was partially supported by the NSFC under Grants 72171071, 72293581, 72271084, 72188101, 72101079, 72293580, 71771131, 72110107003, in part by the Excellent Fund of HFUT under Grant JZ2021HGPA0060, in part by the China Scholarship Council under Grant 202206690034.

Research Keywords

  • Additive feature attribution
  • Additives
  • Adversarial attack
  • Closed box
  • Explainable methods
  • Mathematical models
  • Perturbation methods
  • Sensitivity
  • Task analysis
  • Text categorization
  • Text classification
  • Text regression

Fingerprint

Dive into the research topics of 'Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression'. Together they form a unique fingerprint.

Cite this