As data science gets deployed more and more into operational applications, it becomes important for data science frameworks to be able to perform computations in interactive, sub-second time. Indexing and caching are two key techniques that can make interactive query processing on large datasets possible. In this demo, we show the design, implementation and performance of a new indexing abstraction in Apache Spark, called the Indexed DataFrame. This is a cached DataFrame that incorporates an index to support fast lookup and join operations, and supports updates with multi-version concurrency. We demonstrate the Indexed Dataframe on a social network dataset using microbench-marks and real-world graph processing queries, in datasets that are continuously growing.

Additional Metadata
Persistent URL dx.doi.org/10.1145/3299869.3320227
Conference ACM SIGMOD International Conference on Management of Data
Citation
Uta, A, Ghit, B, Dave, A, & Boncz, P.A. (2019). [Demo] Low-latency spark queries on updatable data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 2009–2012). doi:10.1145/3299869.3320227