Loading…

HMM-based Lip Reading with Stingy Residual 3D Convolution

In this paper, we propose a novel approach for sentence-level lip-reading by using hidden Markov model (HMM) framework. To calculate the posterior probability of HMM states, the architecture of convolutional neural network based visual module followed by multi-headed self-attention Transformers is d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zeng, Qifeng, Du, Jun, Wang, Zirui
Format:	Conference Proceeding
Language:	English
Subjects:	compact 3D convolution Convolution hidden Markov model Hidden Markov models lip-reading Lips Pipelines Solid modeling Three-dimensional displays transformer visual speech recognition Visualization
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this paper, we propose a novel approach for sentence-level lip-reading by using hidden Markov model (HMM) framework. To calculate the posterior probability of HMM states, the architecture of convolutional neural network based visual module followed by multi-headed self-attention Transformers is designed. Recently, 3D convolution for visual module to extract temporal features is popular for lip-reading tasks, which can achieve a higher accuracy at the cost of more computations compared with 2D convolution. This motivates us to invent plug-and-play compact 3D convolution unit called "Stingy Residual 3D" (StiRes3D). We use heterogeneous convolution kernels for different input channels, and apply channel-wise convolutions and point-wise convolutions to make the block compact. Evaluated on Lip Reading Sentence2 (LRS2-BBC) dataset, we first demonstrate that our HMM-based approach outperforms connectionist temporal classification (CTC) based approach with the same visual module and Transformer architecture, yielding a word error rate reduction of 1.9%. Then we empirically show that the proposed approach with StiRes3D based visual module can achieve obvious improvements in terms of both recognition accuracy and model efficiency, over the Pseudo 3D network with a compact 3D convolution design. Our approach also outperforms the current state-of-the-art approach with a word error rate reduction of 1.5%.
ISSN:	2640-0103