In this paper, we investigate different approaches for generating synthetic microdata from open-source aggregated data. Specifically, we focus on macro-to-micro data synthesis. We explore the potential of the Gaussian copulas framework to estimate joint distributions from aggregated data. Our generated synthetic data is intended for educational and software testing use cases. We propose three scenarios to achieve realistic and high-quality synthetic microdata: (1) zero knowledge, (2) internal knowledge, and (3) external knowledge. The three scenarios involve different knowledge of the underlying properties of the real microdata, i.e., standard deviation, and covariate. Our evaluation includes matching tests to evaluate the privacy of the synthetic datasets. Our results indicate that macro-to-micro synthesis achieves better privacy preservation compared to other methods, demonstrating both the potential and challenges of synthetic data generation in maintaining data privacy while providing useful data for analysis.

, , , , ,
doi.org/10.1007/978-3-031-69651-0_28
Lecture Notes in Computer Science , International Conference on Privacy in Statistical Databases
International Conference on Privacy in Statistical Databases - PSD 2024
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Aghaddar, M., Su, L. N., Slokom, M., Barnhoorn, L., & de Wolf, P.-P. (2024). A case study exploring data synthesis strategies on tabular vs. aggregated data sources for official statistics. In International Conference on Privacy in Statistical Databases (pp. 420–435). doi:10.1007/978-3-031-69651-0_28