Loading…

HNIP: Compact Deep Invariant Representations for Video Matching, Localization, and Retrieval

With emerging demand for large-scale video analysis, MPEG initiated the compact descriptor for video analysis (CDVA) standardization in 2014. Beyond handcrafted descriptors adopted by the current MPEG-CDVA reference model, we study the problem of deep learned global descriptors for video matching, l...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on multimedia 2017-09, Vol.19 (9), p.1968-1983
Main Authors:	Jie Lin, Ling-Yu Duan, Shiqi Wang, Yan Bai, Yihang Lou, Chandrasekhar, Vijay, Tiejun Huang, Kot, Alex, Wen Gao
Format:	Article
Language:	English
Subjects:	Artificial neural networks Convolutional neural networks (CNNs) deep global descriptors Demand analysis Design improvements Empirical analysis Encoding Feature extraction handcrafted descriptors hybrid nested invariance pooling Image coding Invariance Localization Matching MPEG CDVA MPEG CDVS MPEG encoders Performance evaluation Representations Retrieval Robustness Standardization Transform coding Video compression Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	With emerging demand for large-scale video analysis, MPEG initiated the compact descriptor for video analysis (CDVA) standardization in 2014. Beyond handcrafted descriptors adopted by the current MPEG-CDVA reference model, we study the problem of deep learned global descriptors for video matching, localization, and retrieval. First, inspired by a recent invariance theory, we propose a nested invariance pooling (NIP) method to derive compact deep global descriptors from convolutional neural networks (CNNs), by progressively encoding translation, scale, and rotation invariances into the pooled descriptors. Second, our empirical studies have shown that a sequence of well designed pooling moments (e.g., max or average) may drastically impact video matching performance, which motivates us to design hybrid pooling operations via NIP (HNIP). HNIP has further improved the discriminability of deep global descriptors. Third, the technical merits and performance improvements by combining deep and handcrafted descriptors are provided to better investigate the complementary effects. We evaluate the effectiveness of HNIP within the well-established MPEG-CDVA evaluation framework. The extensive experiments have demonstrated that HNIP outperforms the state-of-the-art deep and canonical handcrafted descriptors with significant mAP gains of 5.5% and 4.7%, respectively. In particular the combination of HNIP incorporated and handcrafted global descriptors has significantly boosted the performance of CDVA core techniques with comparable descriptor size.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2017.2713410