Action recognition based on element-level fine-grained multi-modal fusion

Abstract Traditional action recognition algorithms often only pay attention to video RGB features or optical flow features. These methods do not make good use of the audio information in the video. Based on RGB and optical flow characteristics, this paper introduces the processing of audio informati...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of physics. Conference series 2021-09, Vol.2010 (1), p.12114
Main Authors:	Peng, Guozheng, Han, Lixin, Yang, Jiaxue
Format:	Article
Language:	eng
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Traditional action recognition algorithms often only pay attention to video RGB features or optical flow features. These methods do not make good use of the audio information in the video. Based on RGB and optical flow characteristics, this paper introduces the processing of audio information, and classifies videos based on element-level fine-grained multi-modal fusion. Through experimental comparison, the accuracy of the multi-modal fusion algorithm proposed in this paper is improved by 7.38% on the HMDB51 dataset and 3.18% on the UCF101 dataset compared to the simple modal splicing. At the same time, it is proved that the introduction of audio modes can effectively improve the performance of the model.
ISSN:	1742-6588 1742-6596