Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three datasets collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with fine-grained V-A self-reports show that for subject-independent low/neutral/high V-A classification, EDMIL outperforms the state-of-the-art methods. Our experiments find that weakly-supervised-learning can reduce overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals.

, , , , , , , , , ,
doi.org/10.1109/TAFFC.2022.3158234
IEEE Transactions on Affective Computing
Distributed and Interactive Systems

Zhang, T, El Ali, A, Wang, C, Hanjalic, A, & César Garcia, P.S. (2022). Weakly-supervised learning for fine-grained emotion recognition using physiological signals. IEEE Transactions on Affective Computing. doi:10.1109/TAFFC.2022.3158234