Loading…

SPTTS: Parallel Speech Synthesis without Extra Aligner Model

In this work, we develop a novel non-autoregressive TTS model to predict all mel-spectrogram frames in parallel. Different from the previous non-autoregressive TTS methods, which typically require an external aligner implemented by an attention-based autoregressive model, our model can be opti-mized...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhao, Zeqing, Chen, Xi, Liu, Hui, Wang, Xuyang, Yang, Lin, Wang, Junjie
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this work, we develop a novel non-autoregressive TTS model to predict all mel-spectrogram frames in parallel. Different from the previous non-autoregressive TTS methods, which typically require an external aligner implemented by an attention-based autoregressive model, our model can be opti-mized jointly without sophisticated external aligners. Motivated by the CTC-based speech recognition, which is a simple and effective manner to achieve the frame-level forced-alignment between the speech and text, our main idea is to consider the aligner learning of TTS as a CTC-based speech recognition like task. Specifically, our model learns the alignment generator by adopting the CTC-loss, to provide supervision for the duration predictor learning on the fly. In this way, we are able to learn a one-stage TTS system by optimizing the aligner with the feed forward transformer jointly. In inference phase, the aligner is removed and the duration predictor is used to predict duration sequence for synthesizing speech. To demonstrate our method, we conduct extensive experiments on an open-source Chinese standard Mandarin speech dataset 1 1 https://www.data-baker.com/open_source.html. The results show that our method achieves competitive performance compared with counterpart models (e.g. FastSpeech: a well-known non-autoregressive with extra aligner) in terms of the synthesized speech quality and robustness.
ISSN:2640-0103