Differentially Private string sanitization for frequency-based mining tasks

Chen, Huiping; Dong, Changyu; Fan, Liyue; Loukides, Grigorios; Pissis, Solon; Stougie, Leen

doi:10.1109/ICDM51629.2021.00014

H. Chen (Huiping), C. Dong (Changyu), L. Fan (Liyue), G. Loukides (Grigorios), S. Pissis (Solon) and L. Stougie (Leen)

2021-12-07

Differentially Private string sanitization for frequency-based mining tasks

Presented at the 21st IEEE International Conference on Data Mining, ICDM 2021 (December 2021), Virtual, Online

Strings are used to model genomic, natural language, and web activity data, and are thus often shared broadly. However, string data sharing has raised privacy concerns stemming from the fact that knowledge of length-k substrings of a string and their frequencies (multiplicities) may be sufficient to uniquely reconstruct the string; and from that the inference of such substrings may leak confidential information. We thus introduce the problem of protecting length-k substrings of a single string S by applying Differential Privacy (DP) while maximizing data utility for frequency-based mining tasks. Our theoretical and empirical evidence suggests that classic DP mechanisms are not suitable to address the problem. In response, we employ the order-k de Bruijn graph G of S and propose a sampling-based mechanism for enforcing DP on G. We consider the task of enforcing DP on G using our mechanism while preserving the normalized edge multiplicities in G. We define an optimization problem on integer edge weights that is central to this task and develop an algorithm based on dynamic programming to solve it exactly. We also consider two variants of this problem with real edge weights. By relaxing the constraint of integer edge weights, we are able to develop linear-time exact algorithms for these variants, which we use as stepping stones towards effective heuristics. An extensive experimental evaluation using real-world large-scale strings (in the order of billions of letters) shows that our heuristics are efficient and produce near-optimal solutions which preserve data utility for frequency-based mining tasks.

Additional Metadata
Keywords	Differential privacy, String algorithms, Data sanitization, Frequent pattern mining
Persistent URL	doi.org/10.1109/ICDM51629.2021.00014
Conference	21st IEEE International Conference on Data Mining, ICDM 2021
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Chen, H., Dong, C., Fan, L., Loukides, G., Pissis, S., & Stougie, L. (2021). Differentially Private string sanitization for frequency-based mining tasks. In Proceedings of the 21st IEEE International Conference on Data Mining, ICDM 2021 (pp. 41–50). doi:10.1109/ICDM51629.2021.00014

View at Publisher

Full Text ( Author Manuscript , 855kb )

Differentially Private string sanitization for frequency-based mining tasks

Publication

Publication

Address

CWI researchers

Questions or comments?

Differentially Private string sanitization for frequency-based mining tasks

Publication

Publication

Workflow

Workflow

Add Content