SQALE is a large-scale, semi-synthetic Text-to-SQL dataset grounded in real-world database schemas. It was designed to push the boundaries of natural language to SQL generation, combining realistic schema diversity, complex query structures, and linguistically varied natural language questions. The code for the generation pipeline of this dataset can be accessed on GitHub.

Democratizing Insight Retrieval from (Semi-)Structured Data
opensource.org/license/MIT
Database Architectures

Wolff, C., Gomm, D., & Hulsebos, M. (2025). SQaLe: A large-scale semi-synthetic dataset.