The rise of data sharing through private and public data portals necessitates more attention to detecting and protecting sensitive data before datasets get published. While research and practice have converged on the importance of documenting Personal Identifiable Information (PII), automatic, accurate and scalable methods for detecting such data in (tabular) datasets are behind. Moreover, we argue that sensitive data detection is more than PII type detection, and methods should consider the more fine-grained context of the dataset and how its publication can be misused beyond the identification of individuals. To guide research in this direction, we present a novel framework for contextual sensitive data detection based on type contextualization and domain contextualization. For type contextualization, we introduce the detect-then-reflect mechanism, in which large language models (LLMs) first detect potential sensitive column types in tables (e.g. PII types such as email address), and then assess their actual sensitivity based on the full table context. For domain contextualization, we propose the retrieve-then-detect mechanism that contextualizes LLMs in external domain knowledge, such as data governance instruction documents, to identify sensitive data beyond PII. Experiments on synthetic and humanitarian datasets show that: 1) the detect-then-reflect mechanism significantly reduces the number of false positives for type-based sensitive data detection, whereas 2) the retrieve-then-detect mechanism is an effective stepping stone for domain-specific sensitive data detection, and retrieval-augmented LLM explanations already provide a useful input for manual data auditing processes more efficient.

Expert Meeting on Statistical Data Confidentiality (SDC2025), Conference of European Statisticians, United Nations Economic Commission for Europe
Database Architectures

Telkamp, L., Rabier, M., Teran, J., & Hulsebos, M. (2025). Detecting contextually sensitive data with AI. In United Nations Economic Commission for Europe, Conference of European Statisticians, Expert Meeting on Statistical Data COnfidentiality, 2025.