Analytical database systems offer high-performance in-memory aggregation. If there are many unique groups, temporary query intermediates may not fit RAM, requiring the use of external storage. However, switching from an in-memory to an external algorithm can degrade performance sharply. We revisit external hash aggregation on modern hardware, aiming instead for robust performance that avoids a 'performance cliff' when memory runs out. To achieve this, we introduce two techniques for handling temporary query intermediates. First, we propose unifying the memory management of temporary and persistent data. Second, we propose using a page layout that can be spilled to disk despite being optimized for main memory performance. These two techniques allow operator implementations to process larger-than-memory query intermediates with only minor modifications. We integrate these into DuckDB's parallel hash aggregation. Experimental results show that our implementation gracefully degrades performance as query intermediates exceed the available memory limit, while main memory performance is competitive with other analytical database systems.

, ,
doi.org/10.1109/ICDE60146.2024.00288
40th IEEE International Conference on Data Engineering, ICDE 2024
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Kuiper, L., Boncz, P., & Mühleisen, H. (2024). Robust external hash aggregation in the solid state age. In Proceedings of the International Conference on Data Engineering (pp. 3753–3766). doi:10.1109/ICDE60146.2024.00288