2026-03-24
Learning to judge: LLMs designing and applying evaluation rubrics
Publication
Publication
Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.
| Additional Metadata | |
|---|---|
| The 19th Conference of the European Chapter of the Association for Computational Linguistics | |
| Organisation | Human-Centered Data Analytics |
|
Siro, C., Aliannejadi, P., & Aliannejadi, M. (2026). Learning to judge: LLMs designing and applying evaluation rubrics. In Findings of the European Chapter of the Association for Computational Linguistics (EACL 2026). |
|