Hide and mine in strings: Hardness, algorithms, and experiments

Bernardini, Giulia; Conte, Alessio; Gourdel, Garance; Grossi, Roberto; Loukides, Grigorios; Pisanti, Nadia; Pissis, Solon; Punzi, Giulia; Stougie, Leen; Sweering, Michelle

doi:10.1109/TKDE.2022.3158063

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well.

Additional Metadata
Keywords	Data mining, Bioinformatics, Genomics, DNA, Data integrity, Privacy, Resists, Data privacy, Data sanitization, Knowledge hiding, Frequent pattern mining, String algorithms
Persistent URL	doi.org/10.1109/TKDE.2022.3158063
Journal	IEEE Transactions on Knowledge and Data Engineering
Project	Networks , Algorithms for PAngenome Computational Analysis , Pan-genome Graph Algorithms and Data Integration , Optimization for and with Machine Learning
Grant	This work was funded by the The Netherlands Organisation for Scientific Research (NWO); grant id nwo/024.002.003 - Networks, This work was funded by the European Commission 7th Framework Programme; grant id h2020/956229 - Algorithms for PAngenome Computational Analysis (ALPACA), This work was funded by the European Commission 7th Framework Programme; grant id h2020/872539 - Pan-genome Graph Algorithms and Data Integration (PANGAIA), This work was funded by the The Netherlands Organisation for Scientific Research (NWO); grant id nwo/OCENW.2019.015 - Optimization for and with Machine Learning
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Bernardini, G., Conte, A., Gourdel, G., Grossi, R., Loukides, G., Pisanti, N., … Sweering, M. (2023). Hide and mine in strings: Hardness, algorithms, and experiments. IEEE Transactions on Knowledge and Data Engineering, 35(6), 5948–5963. doi:10.1109/TKDE.2022.3158063

View at Publisher

Full Text ( Author Manuscript , 1mb )

See Also
software\|data Hide and mine solver G. Bernardini (Giulia), A. Conte (Alessio), Gourdel (Garance), R. Grossi (Roberto), G. Loukidis, N. Pisanti (Nadia), S. Pissis (Solon), G. Punzi (Giulia), L. Stougie (Leen) and M.J.M. Sweering (Michelle)

Hide and mine in strings: Hardness, algorithms, and experiments

Publication

Publication

software|data
Hide and mine solver

Address

CWI researchers

Questions or comments?

Hide and mine in strings: Hardness, algorithms, and experiments

Publication

Publication

software|data Hide and mine solver

Workflow

Workflow

Add Content

software|data
Hide and mine solver