As data science gets deployed more and more into operational applications, it becomes important for data science frameworks to be able to perform computations in interactive, sub-second time. Indexing and caching are two key techniques that can make interactive query processing on large datasets possible. In this demo, we show the design, implementation and performance of a new indexing abstraction in Apache Spark, called the Indexed DataFrame. This is a cached DataFrame that incorporates an index to support fast lookup and join operations, and supports updates with multi-version concurrency. We demonstrate the Indexed Dataframe on a social network dataset using microbench-marks and real-world graph processing queries, in datasets that are continuously growing.

doi.org/10.1145/3299869.3320227
ACM SIGMOD International Conference on Management of Data
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

Uta, A., Ghit, B., Dave, A., & Boncz, P. (2019). [Demo] Low-latency spark queries on updatable data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD) (pp. 2009–2012). doi:10.1145/3299869.3320227