SPA: Economical and workload-driven indexing for data analytics in the cloud

Boncz, Peter; Chronis, Yannis; Finis, Jan; Halfpap, Stefan; Leis, Viktor; Neumann, Thomas; Nica, Anisoara; Sauer, Caetano; Stolze, Knut; Zukowski, Marcin

doi:10.1109/ICDE55515.2023.00302

P.A. Boncz (Peter), Y. Chronis (Yannis), J. Finis (Jan), S. Halfpap (Stefan), V. Leis (Viktor), T. Neumann (Thomas), Nica, A. (Anisoara), C. Sauer (Caetano), Stolze, K. (Knut) and M. Zukowski (Marcin)

2023-07-26

SPA: Economical and workload-driven indexing for data analytics in the cloud

Presented at the 39th IEEE International Conference on Data Engineering, ICDE 2023 (April 2023), Anaheim, CA, USA

Selective queries are not uncommon in large-scale data analytics, for example, when drilling down into a specific customer in a dashboard. Traditionally, selective queries are accelerated by creating secondary indexes. However, because of their large size, expensive maintenance, and difficulty to tune and automate, indexes are typically not used in modern cloud data warehouses or data lakes. Instead, such systems rely mostly on full table scans and lightweight optimizations like min/max filtering, whose effectiveness depends heavily on the data layout and value distributions.We propose SPA as the vision for automatically optimizing selective queries for immutable copy-on-write data formats. SPA adaptively indexes subsets of the data in an incremental and workload-driven manner. It makes fine-grained decisions and continuously monitors their benefit, dynamically allocating an optimization budget in a way that bounds the additional cost of indexing. Furthermore, it guarantees a performance improvement in the cases where indexes - potentially partial ones - prove to be beneficial. When indexes lose their benefit due to a shifting workload, they are gradually deconstructed in favor of optimizations that accommodate recent trends. As SPA does not require information about updates performed on the data, it can also be employed as an accelerator for systems that do not control the data, e.g., in cloud data lakes.

Additional Metadata
Keywords	Drilling, Data analysis, Filtering, Layout, Maintenance engineering, Big Data applications, Market research
Persistent URL	doi.org/10.1109/ICDE55515.2023.00302
Conference	39th IEEE International Conference on Data Engineering, ICDE 2023
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Boncz, P., Chronis, Y., Finis, J., Halfpap, S., Leis, V., Neumann, T., … Zukowski, M. (2023). SPA: Economical and workload-driven indexing for data analytics in the cloud. In Proceedings of the IEEE International Conference on Data Engineering (ICDE) (pp. 3740–3746). doi:10.1109/ICDE55515.2023.00302

Full Text ( Author Manuscript , 243kb )

SPA: Economical and workload-driven indexing for data analytics in the cloud

Publication

Publication

Address

CWI researchers

Questions or comments?

SPA: Economical and workload-driven indexing for data analytics in the cloud

Publication

Publication

Workflow

Workflow

Add Content