Large language models (LLMs) have demonstrated increasing task-solving abilities not present in smaller models. Utilizing the capabilities and responsibilities of LLMs for automated evaluation (LLM4Eval) has recently attracted considerable attention in multiple research communities. Building on the success of previous workshops, which established foundations in automated judgments and RAG evaluation, this third iteration aims to address emerging challenges as IR systems become increasingly personalized and interactive. The main goal of the third LLM4Eval workshop is to bring together researchers from industry and academia to explore three critical areas: the evaluation of personalized IR systems while maintaining fairness, the boundaries between automated and human assessment in subjective scenarios, and evaluation methodologies for systems that combine multiple IR paradigms (search, recommendations, and dialogue). By examining these challenges, we seek to understand how evaluation approaches can evolve to match the sophistication of modern IR applications. The format of the workshop is interactive, including roundtable discussion sessions, fostering dialogue about the future of IR evaluation while avoiding one-sided discussions. This is the third iteration of the workshop series, following successful events at SIGIR 2024 and WSDM 2025, with the first iteration attracting over 50 participants.

, ,
doi.org/10.1145/3726302.3730367
The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval
creativecommons.org/licenses/by-sa/4.0/
Human-Centered Data Analytics

Siro, C., Rahmani, H. A., Aliannejadi, M., Craswell, N., Clarke, C. L. A., Faggioli, G., … Yilmaz, E. (2025). LLM4Eval: Large Language Model for Evaluation in IR. In Proceedings of the ACM SIGIR Conference on Human Information Interaction and Retrieval (pp. 4188–4191). doi:10.1145/3726302.3730367