Exploratory data analysis is the primary technique used by data scientists to extract knowledge from new data sets. This type of workload is composed of trial-and-error hypothesis-driven queries with a human in the loop. To keep up with the data scientist's productivity, the system must be capable of answering queries in interactive times. Given that these queries are highly selective multidimensional queries, multidimensional indexes are necessary to ensure low latency. However, creating the appropriate indexes is not a given due to the highly exploratory and interactive nature of such human-in-the-loop scenarios.In this paper, we identify four main objectives that are desirable for exploratory data analysis workloads: (1) low overhead over the initial queries, (2) low query variance (i.e., high robustness), (3) predictable index convergence, and (4) low total workload time. Given that not all of them can be achieved at the same time, we present three novel incremental multidimensional indexing techniques that represent three sample points on a Pareto front for this multi-objective optimization problem. (a) The Adaptive KD-Tree is designed to achieve the lowest total workload time at the expense of a higher indexing penalty for the initial queries, lack of robustness, and unpredictable convergence. (b) The Progressive KD-Tree has predictable convergence and a user-defined indexing cost for the initial queries. However, total workload time can be higher than with Adaptive KD-Trees, and per-query time still varies. (c) The Greedy Progressive KD-Tree aims at full robustness at the expense of only improving the per-query cost after full index convergence.Our extensive experimental evaluation using both synthetic and real-life data sets and workloads shows that (a) the Adaptive KD-Tree reduces total workload time by up to a factor 2 compared to the state-of-the-art, (b) the Progressive KD-Tree achieves predictable convergence with up to one order of magnitude lower initial query cost, and (c) the Greedy Progressive KDTree exhibits the lowest query variance up to three orders of magnitude lower than the state-of-the-art.

doi.org/10.1109/ICDE51399.2021.00060
Cross-Industry Predictive Maintenance Optimization Platform
37th IEEE International Conference on Data Engineering
Database Architectures

Nerone, M., Holanda, P., de Almeida, E., & Manegold, S. (2021). Multidimensional adaptive & progressive indexes. In Proceedings of the IEEE International Conference on Data Engineering (ICDE) (pp. 624–635). doi:10.1109/ICDE51399.2021.00060