New chat
what is the latest on adversarial attacks in instruction tuning ?

The latest research on adversarial attacks in instruction tuning reveals several significant findings:
-
Exploitability of Instruction Tuning: Shu et al. (2023) discuss how instruction tuning, an effective technique for aligning large language models (LLMs) with human intent, can be exploited by adversaries. They introduce "AutoPoison," an automated data poisoning pipeline, which can inject specific instruction-following examples into training data to change the model's behavior. This includes content injection and over-refusal attacks. Their results indicate that a small fraction of poisoned data can significantly alter a model's behavior while maintaining stealthiness in the examples【Shu, M., Wang, J., Zhu, C., Geiping, J., Xiao, C., & Goldstein, T. (2023). On the Exploitability of Instruction Tuning. Page 3, Link】.
-
General-purpose Vision-Language Models with Instruction Tuning: Dai et al. (2023) explore instruction tuning in the context of vision-language models. Their work, InstructBLIP, involves a systematic study of vision-language instruction tuning based on pretrained BLIP-2 models. This research underscores the potential of instruction tuning in enhancing the capabilities of vision-language models【Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P. N., & Hoi, S. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. Page 8, Link】.
-
Instruction Tuning on Open Resources: Wang et al. (2023) investigate the state of instruction tuning on a range of open instruction-following datasets. Their study finds that different instruction-tuning datasets can uncover or enhance specific skills, but no single dataset provides the best performance across all evaluations. This highlights the complexity and challenges in instruction tuning for language models【Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K., Wadden, D., MacMillan, K., Smith, N., Beltagy, I., & Hajishirzi, H. (2023). How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. Pages 1, 2, Link】.
These studies collectively highlight the evolving nature of instruction tuning, its potential vulnerabilities to adversarial attacks, and the ongoing efforts to understand and enhance its effectiveness across different domains.