Loading…

JSV-VC: Jointly Trained Speaker Verification and Voice Conversion Models

This paper proposes a variational autoencoder (VAE)-based method for voice conversion (VC) on arbitrary source-target speaker pairs without parallel corpora, i.e., non-parallel any-to-any VC. One typical approach is to use speaker embeddings obtained from a speaker verification (SV) model as the con...

Full description

Saved in:

Bibliographic Details
Main Authors:	Seki, Shogo, Kameoka, Hirokazu, Tanaka, Kou, Kaneko, Takuhiro
Format:	Conference Proceeding
Language:	English
Subjects:	Acoustics any-to-any mapping joint learning paralleldata-free Signal processing speaker verification Speech processing Training Voice conversion
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper proposes a variational autoencoder (VAE)-based method for voice conversion (VC) on arbitrary source-target speaker pairs without parallel corpora, i.e., non-parallel any-to-any VC. One typical approach is to use speaker embeddings obtained from a speaker verification (SV) model as the condition for a VC model. However, converted speech is not guaranteed to reflect a target speaker's characteristics in a naive combination of VC and SV models. Moreover, speaker embeddings are not designed for VC problems, leading to suboptimal conversion performance. To address these issues, the proposed method, JSV-VC, trains both VC and SV models jointly. The VC model is trained so that converted speech is verified as the target speaker in the SV model, while the SV model is trained in order to output consistent embeddings before and after the VC model. The experimental evaluation reveals that JSV-VC outperforms conventional any-to-any VC methods quantitatively and qualitatively.
ISSN:	2379-190X
DOI:	10.1109/ICASSP49357.2023.10096901