MMLU Performance Insights
what's the latest on performance on MMLU, who is ahead ?

The latest research on the performance of large language models (LLMs) on the Massively Multitask Language Understanding (MMLU) benchmark offers various insights and developments. Here are some key findings from recent papers:
-
In-Context Impersonation Reveals Large Language Models' Strengths and Biases: This study by Salewski, Alaniz, Rio-Torto, Schulz, and Akata (2023) found that when LLMs are asked to impersonate domain experts, their performance on MMLU tasks improves compared to when they impersonate non-domain experts. This trend was consistent across all four MMLU domains. The study also noted that performance in Humanities tasks was generally higher than in other domains, aligning with previous literature findings【Salewski, L., Alaniz, S., Rio-Torto, I., Schulz, E., & Akata, Z. (2023). In-Context Impersonation Reveals Large Language Models' Strengths and Biases. Page 7, Link】.
-
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models: This research by Huang et al. (2023) discussed the fluctuation of LLMs' performance on MMLU, noting that accuracy tends to be around random-chance until reaching the scale of models like GPT-3【Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, J., Lv, C., Zhang, Y., lei, j., Fu, Y., Sun, M., & He, J. (2023). C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Page 9, Link】.
-
Effective Human-AI Teams via Learned Natural Language Rules and Onboarding: Mozannar et al. (2023) mentioned using ChatGPT for evaluation on the MMLU dataset. ChatGPT achieved an accuracy of 69% during their evaluation, focusing on specific subjects within MMLU【Mozannar, H., Lee, J., Wei, D., Sattigeri, P., Das, S., & Sontag, D. (2023). Effective Human-AI Teams via Learned Natural Language Rules and Onboarding. Pages 25-26, Link】.
-
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models: This paper by Zhang et al. (2023) emphasized the use of human exam questions, like those in MMLU, to evaluate LLMs, noting that tackling such questions often requires diverse skills like language understanding and complex reasoning【Zhang, W., Aljunied, M., Gao, C., Chia, Y. K., & Bing, L. (2023). M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. Pages 2-3, Link】.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena: This research by Zheng et al. (2023) evaluated models derived from LLaMA on MMLU, using GPT-4 for grading. They found that fine-tuning on high-quality dialog datasets consistently improved model performance on MMLU【Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Page 9, Link】.
-
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources: Wang et al. (2023) provided a performance comparison of various models, including on the MMLU benchmark. They noted that no single dataset or combination provided the best performance across all evaluations【Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K., Wadden, D., MacMillan, K., Smith, N., Beltagy, I., & Hajishirzi, H. (2023). How