Semi-supervised learning can be applied to datasets that contain both labeled and unlabeled instances and can result in more accurate predictions compared to fully supervised or unsupervised learning in case limited labeled data is available. A subclass of problems, called Positive-Unlabeled (PU) learning, focuses on cases in which the labeled instances contain only positive examples. Given the lack of negatively labeled data, estimating the general performance is difficult. In this paper, we propose a new approach to approximate the F1 score for PU learning. It requires an estimate of what fraction of the total number of positive instances is available in the labeled set. We derive theoretical properties of the approach and apply it to several datasets to study its empirical behavior and to compare it to the most well-known score in the field, LL score. Results show that even when the estimate is quite off compared to the real fraction of positive labels the approximation of the F1 score is significantly better compared with the LL score.

Elsevier B.V.
doi.org/10.1007/978-3-030-64583-0_15
Lecture Notes in Computer Science
Stochastics

Tabatabaei, S. A., Klein, J., & Hoogendoorn, M. (2021). Estimating the F1 score for learning from positive and unlabeled examples. In LOD 2020: Machine Learning, Optimization, and Data Science (pp. 1–12). doi:10.1007/978-3-030-64583-0_15