Traditional batch evaluation metrics assume that user interaction with search results is limited to scanning down a ranked list. However, modern search interfaces come with additional elements supporting result list refinement (RLR) through facets and filters, making user search behavior increasingly dynamic. We develop an evaluation framework that takes a step beyond the interaction assumption of traditional evaluation metrics and allows for batch evaluation of systems with and without RLR elements. In our framework we model user interaction as switching between different sublists. This provides a measure of user effort based on the joint effect of user interaction with RLR elements and result quality. We validate our framework by conducting a user study and comparing model predictions with real user performance. Our model predictions show significant positive correlation with real user effort. Further, in contrast to traditional evaluation metrics, the predictions using our framework, of when users stand to benefit from RLR elements, reflect findings from our user study. Finally, we use the framework to investigate under what conditions systems with and without RLR elements are likely to be effective. We simulate varying conditions concerning ranking quality, users, task and interface properties demonstrating a cost-effective way to study whole system performance.
, , ,
Behavior-aware Search Evaluation for Information Retrieval , Behavior-aware Search Evaluation for Information Retrieval
Annual ACM SIGIR Conference
Human-Centered Data Analytics

He, J, Bron, M, de Vries, A.P, Azzopardi, L, & de Rijke, M. (2015). Untangling Result List Refinement and Ranking Quality: a Framework for Evaluation and Prediction. In Proceedings of Annual ACM SIGIR Conference 2015 (SIGIR 38). ACM. doi:10.1145/2766462.2767740