Interactive data analysis is often conveniently done on personal computers that have limited memory. Current analytical data management systems rely almost exclusively on main memory for computation. When the data size exceeds the memory limit, many systems cannot complete queries or resort to an external execution strategy that assumes a high I/O cost. These strategies are often much slower than the in-memory strategy. However, I/O cost has gone down: Most modern laptops have fast NVMe storage. We believe that the difference between in-memory and external does not have to be this big. We implement a parallel external sorting operator in DuckDB that demonstrates this. Experimental results with our implementation show that even when the data size far exceeds the memory size, the performance loss is negligible. From this result, we conclude that it is possible to have a graceful degradation from in-memory to external sorting.

, , , ,
CEUR Workshop Proceedings
2021/2022 British International Conference on Databases, BICOD 2021/2022
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Kuiper, L., Raasveldt, M., & Mühleisen, H. (2022). Efficient external sorting in DuckDB. In Proceedings of BICOD 2021/2022 (pp. 40–45).