Skip to main navigation Skip to search Skip to main content

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs

  • Xiangwen Wang
  • , Jie Peng
  • , Kaidi Xu
  • , Huaxiu Yao
  • , Tianlong Chen

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

1 Downloads (CityUHK Scholars)

Abstract

Recently, there has been a growing focus on conducting attacks on large language models (LLMs) to assess LLMs’ safety. Yet, existing attack methods face challenges, including the need to access model weights or merely ensuring LLMs output harmful information without controlling the specific content of their output. Exactly control of the LLM output can produce more inconspicuous attacks which could reveal a new page for LLM security. To achieve this, we propose RLTA: the Reinforcement Learning Targeted Attack, a framework that is designed for attacking language models (LLMs) and is adaptable to black box (weight inaccessible) scenarios. It is capable of automatically generating malicious prompts that trigger target LLMs to produce specific outputs. We demonstrate RLTA in two different scenarios: LLM trojan detection and jailbreaking. The comprehensive experimental results show the potential of RLTA in enhancing the security measures surrounding contemporary LLMs. © 2024 Association for Computational Linguistics.
Original languageEnglish
Title of host publicationPrivateNLP 2024 - The Fifth Workshop on Privacy in Natural Language Processing
Subtitle of host publicationProceedings of the Workshop
EditorsIvan Habernal, Sepideh Ghanavati, Abhilasha Ravichander,
PublisherAssociation for Computational Linguistics
Pages170-177
ISBN (Print)9798891761391
DOIs
Publication statusPublished - Aug 2024
Externally publishedYes
Event5th Workshop on Privacy in Natural Language Processing (PrivateNLP 2024) - Bangkok, Thailand
Duration: 15 Aug 2024 → …
https://aclanthology.org/2024.privatenlp-1

Publication series

NamePrivateNLP - Workshop on Privacy in Natural Language Processing, Proceedings of the Workshop

Conference

Conference5th Workshop on Privacy in Natural Language Processing (PrivateNLP 2024)
PlaceThailand
CityBangkok
Period15/08/24 → …
Internet address

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs'. Together they form a unique fingerprint.

Cite this