This paper presents a case study focused on synthesizing relational datasets within Official Statistics for software and technology testing purposes. Specifically, the focus is on generating synthetic data for testing and validating software code. Our study conducts a comprehensive comparative analysis of various synthesis approaches tailored for a multi-table relational database featuring a one-to-one relationship versus a single table. We leverage state-of-the-art single and multi-table synthesis methods to evaluate their potential to maintain the analytical validity of the data, ensure data utility, and mitigate risks associated with disclosure. The evaluation of analytical validity includes assessing how well synthetic data replicates the structure and characteristics of real datasets. First, we compare synthesis methods based on their ability to maintain constraints and conditional dependencies found in real data. Second, we evaluate the utility of synthetic data by training linear regression models on both real and synthetic datasets. Lastly, we measure the privacy risks associated with synthetic data by conducting attribute inference attacks to measure the disclosure risk of sensitive attributes. Our experimental results indicate that the single-table data synthesis method demonstrates superior performance in terms of analytical validity, utility, and privacy preservation compared to the multi-table synthesis method. However, we find promise in the premise of multi-table data synthesis in protecting against attribute disclosure, albeit calling for future exploration to improve the utility of the data.

, , , ,
doi.org/10.1007/978-3-031-69651-0_27
Lecture Notes in Computer Science , International Conference on Privacy in Statistical Databases
AI, Media & Democracy Lab
International Conference, PSD 2024
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Slokom, M., Agrawal, S., Krol, N. C., & de Wolf, P.-P. (2024). Relational or single: A comparative analysis of data synthesis approaches for privacy and utility on a use case from statistical office. In Proceedings of the International Conference on Privacy in Statistical Databases (pp. 403–419). doi:10.1007/978-3-031-69651-0_27