Recent advances in Large Language Models (LLMs) have enabled powerful systems that perform tasks by reasoning over tabular data [9, 10, 13, 7, 4]. While these systems typically assume relevant data is provided with a query, real-world use cases are mostly open-domain, meaning they receive a query without context regarding the underlying tables. Retrieving relevant tables is typically done over dense embeddings of serialized tables [5]. Yet, there is limited understanding of the effectiveness of different inputs and serialization methods for using such offthe-shelf text-embedding models for table retrieval. In this work, we show that different serialization strategies result in significant variations in retrieval performance. Additionally, we surface shortcomings in commonly used benchmarks applied in open-domain settings, motivating further study and refinement.

, , , ,
ELLIS workshop on Representation Learning and Generative Models for Structured Data
Database Architectures

Gomm, D., & Hulsebos, M. (2025). Metadata matters in dense table retrieval.