2025-09-27
Rethinking dataset discovery with DataScout
Publication
Publication
Dataset Search—the process of finding appropriate datasets for a given task—remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g. granularity, attributes, size), semantics (e.g., data semantics, creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive—users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria—making query iteration challenging. To bridge these gaps, we introduce DataScout to proactively steer users through the process of dataset discovery via—(i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators, generated dynamically based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search reveals that users uniquely employ DataScout’s features not only for structured explorations, but also to glean feedback on their search queries and build conceptual models of the search space.
| Additional Metadata | |
|---|---|
| doi.org/10.1145/3746059.3747727 | |
| Democratizing Insight Retrieval from (Semi-)Structured Data | |
| 38th Annual ACM Symposium on User Interface Software and Technology | |
| creativecommons.org/licenses/by/4.0/ | |
| Organisation | Database Architectures |
|
Lin, R., Chopra, B., Lin, W., Shankar, S., Hulsebos, M., & Parameswaran, A. G. (2025). Rethinking dataset discovery with DataScout. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology (pp. 179:1–179:16). doi:10.1145/3746059.3747727 |
|