We describe the design and implementation of EdgeFrame: a graph-specialized Spark DataFrame that caches the edges of a graph in compressed form on all worker nodes of a cluster, and provides a fast and scalable Worst-Case-Optimal Join (WCOJ) that is especially useful for matching of complex and cyclical patterns in large graphs. Our choice to forego shuffle- or communication-based WCOJ is motivated by our analysis of the Shares algorithm for distributed WCOJ, that was proven communication-optimal, but which we show to quickly deteriorate to a full broadcast of all data already with moderately complex graph patterns. Our work shows that specializing WCOJ to a multi-way self-join, and leveraging compressed storage, provides a significant opportunity for better WCOJ performance. Finally, we investigate WCOJ parallelization and load-balancing strategies and show that fine-grained dynamic load-balancing with work-stealing is to be preferred, creating interesting insights and challenges for the future evolution of the Spark scheduler.

dx.doi.org/10.1145/3398682.3399162
3rd ACM SIGMOD Joint International Workshop on Graph Data Management Experiences and Systems and Network Data Analytics, GRADES-NDA 2020
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Fuchs, P, Boncz, P.A, & Ghit, B. (2020). Edgeframe: Worst-case optimal joins for graph-pattern matching in spark. In Proceedings of the 3rd ACM SIGMOD Joint International Workshop on Graph Data Management Experiences and Systems and Network Data Analytics, GRADES-NDA 2020. doi:10.1145/3398682.3399162