Loading…

Speech emotion recognition based on the reconstruction of acoustic and text features in latent space

Speech emotion recognition (SER) has been actively studied in the recent decade and has achieved promising results. Most state-of-the-art SER methods are based on a classification approach that ultimately outputs the softmax probability over different emotion classes. On the other hand, we have rece...

Full description

Saved in:
Bibliographic Details
Main Authors: Santoso, Jennifer, Sekiguchi, Rintaro, Yamada, Takeshi, Ishizuka, Kenkichi, Hashimoto, Taiichi, Makino, Shoji
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speech emotion recognition (SER) has been actively studied in the recent decade and has achieved promising results. Most state-of-the-art SER methods are based on a classification approach that ultimately outputs the softmax probability over different emotion classes. On the other hand, we have recently introduced an anomalous sound detection approach to improve the SER performance of the neutral class. It uses a neutral speech detector consisting of an autoencoder that reconstructs acoustic and text features in latent space and is trained using only neutral speech data. The experimental result confirmed that the reconstruction error could be successfully used as a cue to decide whether or not the class is neutral and suggested that it could be applied to other emotion classes. In this paper, we propose an SER method based on the reconstruction of acoustic and text features in latent space, in which the reconstructor for different emotion classes, including the neutral class, is used. The proposed method selects the emotion class with the lowest normalized reconstruction error as the SER result. Unlike the classifier approach, one reconstructor is dedicated to each emotion class and trained using only the data of the target emotion class. Therefore, the reconstructor can be trained without being affected by imbalanced training data and also facilitates the application of data augmentation to only a specific emotion class. Our experimental result obtained using the IEMOCAP dataset showed that the proposed method improved the class-average weighted accuracy by 1.7% to 77.8% compared with the state-of - the-art SER methods.
ISSN:2640-0103
DOI:10.23919/APSIPAASC55919.2022.9980203