RocketEval: Efficient automated LLM evaluation via grading checklist

Tianjun Wei (Co-first Author), Wei Wen (Co-first Author), Ruizhi Qiao*, Xing Sun, Jianghong Ma*

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-BENCH and WILDBENCH datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at https://github.com/Joinn99/RocketEval-ICLR. © 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
Original languageEnglish
Title of host publicationInternational Conference on Learning Representations 2025 (ICLR 2025)
EditorsY. Yue, A. Garg, N. Peng, F. Sha, R. Yu
PublisherInternational Conference on Learning Representations, ICLR
Pages101641-101667
Number of pages27
ISBN (Print)9798331320850
Publication statusPublished - Apr 2025
Event13th International Conference on Learning Representations (ICLR 2025) - Singapore EXPO, Singapore, Singapore
Duration: 24 Apr 202528 Apr 2025
https://iclr.cc/Conferences/2025

Publication series

NameInternational Conference on Learning Representations, ICLR

Conference

Conference13th International Conference on Learning Representations (ICLR 2025)
Abbreviated titleICLR 2025
PlaceSingapore
CitySingapore
Period24/04/2528/04/25
Internet address

Funding

This work was partially supported by the National Natural Science Foundation of China (Project No. 62202122 and No. 62073272), the Shenzhen Science and Technology Program under Grant No. GXWD20231130110308001, and the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2024A1515011949.

Fingerprint

Dive into the research topics of 'RocketEval: Efficient automated LLM evaluation via grading checklist'. Together they form a unique fingerprint.

Cite this