[Demo] Low-latency spark queries on updatable data
As data science gets deployed more and more into operational applications, it becomes important for data science frameworks to be able to perform computations in interactive, sub-second time. Indexing and caching are two key techniques that can make interactive query processing on large datasets possible. In this demo, we show the design, implementation and performance of a new indexing abstraction in Apache Spark, called the Indexed DataFrame. This is a cached DataFrame that incorporates an index to support fast lookup and join operations, and supports updates with multi-version concurrency. We demonstrate the Indexed Dataframe on a social network dataset using microbench-marks and real-world graph processing queries, in datasets that are continuously growing.
|ACM SIGMOD International Conference on Management of Data|
|Organisation||Centrum Wiskunde & Informatica, Amsterdam, The Netherlands|
Uta, A, Ghit, B, Dave, A, & Boncz, P.A. (2019). [Demo] Low-latency spark queries on updatable data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 2009–2012). doi:10.1145/3299869.3320227