Papers on Prompt Injection

The search for papers about prompt injection yielded several relevant results, particularly focused on trustworthiness and security aspects of Generative Pre-trained Transformer (GPT) models. Here are some of the notable papers:

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (Oral & Poster)
- Authors: Not listed for the oral version; Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li for the poster version.
- Abstract: This work conducts a comprehensive trustworthiness evaluation for GPT-4 and GPT-3.5, considering aspects like toxicity, bias, adversarial robustness, and privacy. It highlights vulnerabilities in these models, including their susceptibility to being misled by toxic and biased prompts, privacy leaks, and more. Specifically, it discusses 'PromptInject', a method that inserts harmful content into prompts to mislead language models, demonstrating the vulnerability of GPT-3 to such attacks.
- Relevant Chunks: Pages 107,108 discuss PromptInject and its applications in misleading GPT models.
- URL: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models Oral, DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models Poster
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP (Poster)
- Authors: Lu Yan, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Xuan Chen, Guangyu Shen, Xiangyu Zhang.
- Abstract: This paper proposes a test-time poisoned sample detection framework for NLP models, leveraging the interpretability of model predictions. It uses ChatGPT as a paraphrasing tool to detect poisoned samples, considering the prompt's role in maintaining semantics while removing triggers.
- Relevant Chunks: Pages 2, 4.
- URL: ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP
Jailbroken: How Does LLM Safety Training Fail? (Poster & Oral)
- Authors: Alexander Wei, Nika Haghtalab, Jacob Steinhardt.
- Abstract: This paper investigates why jailbreak attacks succeed against safety-trained large language models (LLMs) like GPT-4 and Claude v1.3. It identifies two failure modes: competing objectives and mismatched generalization. The paper demonstrates that vulnerabilities persist despite extensive safety training.
- Relevant Chunks: Page 8 discusses specific jailbreak strategies and their implications.
- URL: Jailbroken: How Does LLM Safety Training Fail? Poster, Jailbroken: How Does LLM Safety Training Fail? Oral

These papers collectively provide insights into the vulnerabilities and trustworthiness issues of GPT models, particularly focusing on the impact of prompt manipulation techniques like PromptInject and jailbreaking prompts.