Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.

, , , , , ,
doi.org/10.1145/3654975
ACM on Management of Data (SIGMOD 2024)
Database Architectures

Doehmen, T., Geacu, R., Hulsebos, M., & Schelter, S. (2024). SchemaPile: A large collection of relational database schemas. In Proceedings of the ACM International Conference on Management of Data (SIGMOD) (pp. 172:1–172:25). doi:10.1145/3654975