Loading…

Deep Convolutional Neural Networks for Large-scale Speech Tasks

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech...

Full description

Saved in:

Bibliographic Details
Published in:	Neural networks 2015-04, Vol.64, p.39-48
Main Authors:	Sainath, Tara N., Kingsbury, Brian, Saon, George, Soltau, Hagen, Mohamed, Abdel-rahman, Dahl, George, Ramabhadran, Bhuvana
Format:	Article
Language:	English
Subjects:	Architecture Deep learning Neural networks Neural Networks (Computer) Spectra Spectral correlation Speech Speech recognition Speech Recognition Software Strategy Tasks Training
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%–14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.
ISSN:	0893-6080 1879-2782
DOI:	10.1016/j.neunet.2014.08.005