Abstract
Recently, there has been a growing focus on conducting attacks on large language models (LLMs) to assess LLMs’ safety. Yet, existing attack methods face challenges, including the need to access model weights or merely ensuring LLMs output harmful information without controlling the specific content of their output. Exactly control of the LLM output can produce more inconspicuous attacks which could reveal a new page for LLM security. To achieve this, we propose RLTA: the Reinforcement Learning Targeted Attack, a framework that is designed for attacking language models (LLMs) and is adaptable to black box (weight inaccessible) scenarios. It is capable of automatically generating malicious prompts that trigger target LLMs to produce specific outputs. We demonstrate RLTA in two different scenarios: LLM trojan detection and jailbreaking. The comprehensive experimental results show the potential of RLTA in enhancing the security measures surrounding contemporary LLMs. © 2024 Association for Computational Linguistics.
| Original language | English |
|---|---|
| Title of host publication | PrivateNLP 2024 - The Fifth Workshop on Privacy in Natural Language Processing |
| Subtitle of host publication | Proceedings of the Workshop |
| Editors | Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, |
| Publisher | Association for Computational Linguistics |
| Pages | 170-177 |
| ISBN (Print) | 9798891761391 |
| DOIs | |
| Publication status | Published - Aug 2024 |
| Externally published | Yes |
| Event | 5th Workshop on Privacy in Natural Language Processing (PrivateNLP 2024) - Bangkok, Thailand Duration: 15 Aug 2024 → … https://aclanthology.org/2024.privatenlp-1 |
Publication series
| Name | PrivateNLP - Workshop on Privacy in Natural Language Processing, Proceedings of the Workshop |
|---|
Conference
| Conference | 5th Workshop on Privacy in Natural Language Processing (PrivateNLP 2024) |
|---|---|
| Place | Thailand |
| City | Bangkok |
| Period | 15/08/24 → … |
| Internet address |
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/
Fingerprint
Dive into the research topics of 'Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver