In-Memory Indexed Caching for Distributed Data Processing

Uta, Alexandru; Ghit, Bogdan; Dave, Ankur; Rellermeyer, Jan; Boncz, Peter

A. Uta (Alexandru), B. Ghit (Bogdan), A. Dave (Ankur), J. Rellermeyer (Jan) and P.A. Boncz (Peter)

2021-12-12

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

Additional Metadata
Stakeholder	Databricks
Organisation	Database Architectures
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Uta, A., Ghit, B., Dave, A., Rellermeyer, J.& Boncz, P. (2021). In-Memory Indexed Caching for Distributed Data Processing.

View at arXiv

Free Full Text ( Final Version , 1mb )

See Also
inProceedings In-memory indexed caching for distributed data processing A. Uta (Alexandru), B. Ghit (Bogdan), A. Dave (Ankur), J. Rellermeyer (Jan) and P.A. Boncz (Peter)

In-Memory Indexed Caching for Distributed Data Processing

Publication

Publication

inProceedings
In-memory indexed caching for distributed data processing

Address

CWI researchers

Questions or comments?

In-Memory Indexed Caching for Distributed Data Processing

Publication

Publication

inProceedings In-memory indexed caching for distributed data processing

Workflow

Workflow

Add Content

inProceedings
In-memory indexed caching for distributed data processing