CERN's Large Hadron Collider (LHC), the world's largest high-energy physics (HEP) instrument, collects tens of petabytes of data per year. The LHC's next phase is expected to produce up to ten times more data, which calls for novel, more efficient ways of storing and processing these data.HEP collider data are prepared and provided to physicists as read-only data sets, stored in a custom columnar data format. While traditionally all data needed for a particular analysis were captured in a single data set, the increasing scale of the LHC and the advent of modern analysis techniques now requires analysis workflows to use data from different data sets. However, the processing model established across the HEP community does not yet provide a straightforward way to achieve this and currently relies heavily on data duplication to produce the desired data sets. This leads to significant overhead in analysis workflows, both in runtime and storage.To reduce this overhead, we propose more efficient ways to combine HEP data sets. Specifically, we design union and join operations, as defined in relational algebra, to combine HEP data sets at runtime, eliminating therefore the need for data duplication. In this paper, we specify these operations for HEP data and introduce EVENTSETPROCESSOR - an engine that implements these operations for HEP data processing. Through a first prototype, we show that this engine integrates well in existing HEP workflows, and that it can perform up to twice as fast as the current approach.

, , ,
doi.org/10.1109/eScience65000.2025.00020
2025 IEEE International Conference on eScience (eScience)
Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands

de Geus, F. W., Padulano, V. E., Blomer, J., Mühleisen, H., & Varbanescu, A. L. (2025). EVENTSETPROCESSOR: An engine for efficiently combining high-energy physics data. In Proceedings - 2025 IEEE International Conference on e-Science, eScience 2025 (pp. 93–101). doi:10.1109/eScience65000.2025.00020