Robust subgroup discovery: Discovering subgroup lists using MDL

Proença, Hugo; Grünwald, Peter; H. W. Bäck, Thomas; van Leeuwen, Matthijs

doi:10.1007/s10618-022-00856-x

H.M. Proença (Hugo), P.D. Grünwald (Peter), T. H. W. Bäck (Thomas) and M. van Leeuwen (Matthijs)

2022-08-12

Robust subgroup discovery: Discovering subgroup lists using MDL

Data Mining and Knowledge Discovery , Volume 36 p. 1885- 1970

We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.

Additional Metadata
Keywords	Subgroup discovery, Subgroup list, The Minimum Description Length (MDL)principle, Interpretability
Persistent URL	doi.org/10.1007/s10618-022-00856-x
Journal	Data Mining and Knowledge Discovery
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Proença, H., Grünwald, P., Bäck, T., & van Leeuwen, M. (2022). Robust subgroup discovery: Discovering subgroup lists using MDL. Data Mining and Knowledge Discovery, 36, 1885–1970. doi:10.1007/s10618-022-00856-x

View at Publisher

Free Full Text ( Final Version , 3mb )

Robust subgroup discovery: Discovering subgroup lists using MDL

Publication

Publication

Address

CWI researchers

Questions or comments?

Robust subgroup discovery: Discovering subgroup lists using MDL

Publication

Publication

Workflow

Workflow

Add Content