We formulate a conceptual model for white-box compression, which represents the logical columns in tabular data as an openly defined function over some actually stored physical columns. Each block of data should thus go accompanied by a header that describes this functional mapping. Because these compression functions are openly defined, database systems can exploit them using query optimization and during execution, enabling e.g. better filter predicate pushdown. In addition, we show that white-box compression is able to identify a broad variety of new opportunities for compression, leading to much better compression factors. These opportunities are identified using an automatic learning process that learns the functions from the data. We provide a recursive pattern-driven algorithm for such learning. Finally, we demonstrate the effectiveness of white-box compression on a new benchmark we contribute hereby: the Public BI benchmark provides a rich set of real-world datasets.

We believe our basic prototype for white-box compression opens the way for future research into transparent compressed data representations on the one hand and database system architectures that can efficiently exploit these on the other, and should be seen as another step into the direction of data management systems that are self-learning and optimize themselves for the data they are deployed on.

Biennial Conference on Innovative Data Systems Research
Database Architectures

Ghita, B., Gomes Tomé, D., & Boncz, P. (2020). White-box compression: Learning and exploiting compact table representations. In Proceedings of the Conference on Innovative Data Systems Research.