DivExplorer Project

DivExplorer enables to analyze subgroup performance in datasets, efficiently identifying the data subgroups that are anomalous.

Given a dataset, DivExplorer can find subgroups where specified attributes have higher or lower average value compared to the overall dataset. As an example, this can be used to find subgroups in a census dataset that have higher than average income.

In machine learning, DivExplorer enables the idenfitication of data subgroups for which classifiers have higher false-positive or false-negative rates than the average, or the identification of subgroups that are ranked higher or lower than the average.

Here, you can find the papers and videos related to the project, as well a Python package you can use to analyze your datasets.

Python Package

You can analyze your datasets using the divexplorer Python package, and you can look at its source code and documentation. Here is a notebook that demonstrates how to use the package to analyze the behavior of datasets and classifiers. The notebook can be run on Google Colab for your convenience. You can find in this repository the source code and notebooks of a generalized version of DivExplorer to identify and interpret subgroup behavior in data and models. It applies to dataset statistics, classification, regression, rankings and scoring functions.

Videos

5-minute introduction

20-minute in-depth

Papers

Exploring Subgroup Performance in End-To-End Speech Models. A. Kodounas, E. Pastor, G. Attanasio, V. Mazzia, M. Giollo, T. Gueudre, L. Cagliero, L. de Alfaro, E. Baralis, D. Amberti. In Proceedings of the International Conference on Acoustings, Speech, and Signal Processing (ICASSP), 2023.

A Hierarchical Approach to Anomalous Subgroup Discovery. E. Pastor, E. Baralis, L. de Alfaro. In Proceedings of the 39th IEEE International Conference on Data Engineering (ICDE), 2023.

Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence. E. Pastor, L. de Alfaro, E. Baralis. In Proceedings of the 2021 ACM SIGMOD Conference, 2021.

How Divergent Is Your Data? E. Pastor, A. Gavgavian, E. Baralis, L. de Alfaro. In Proceedings of the 47th International Conference on Very Large Data Bases (VLDB), Demo Track, 2021.

Identifying Biased Subgroups in Ranking and Classification. E. Pastor, L. de Alfaro, E. Baralis. In Proceedings of the Responsible AI @ KDD 2021 Workshop, 2021.

Project Members

Elena Baralis, Luca de Alfaro, Eliana Pastor.