Composable pipeline for curating large-scale text-to-SQL corpora by extending database schemas, synthesising natural-language questions, and validating SQL programs with LLMs. The dataset can be accessed under trl-lab/SQaLe-text-to-SQL-dataset/ on Hugging Face Datasets.

Democratizing Insight Retrieval from (Semi-)Structured Data
www.gnu.org/licenses/gpl-3.0.en.html
Database Architectures

Wolff, C., Gomm, D., & Hulsebos, M. (2025). SQaLe: A text-to-SQL dataset generation pipeline grounded in real schemas.