Loading…

A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension

One-stage Referring Expression Comprehension (REC) is a task that requires accurate alignment between text descriptions and visual content. In recent years, numerous efforts have been devoted to cross-modal learning for REC, while the influence of other factors in this task still lacks a systematic...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on multimedia 2024, Vol.26, p.3689-3700
Main Authors: Luo, Gen, Zhou, Yiyi, Sun, Jiamu, Sun, Xiaoshuai, Ji, Rongrong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:One-stage Referring Expression Comprehension (REC) is a task that requires accurate alignment between text descriptions and visual content. In recent years, numerous efforts have been devoted to cross-modal learning for REC, while the influence of other factors in this task still lacks a systematic study. To fill this gap, we conduct an empirical study in this article. Concretely, we ablate 42 candidate designs/settings based on a common REC framework, and these candidates cover the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three REC benchmark datasets. The extensive experimental results reveal the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation. Based on these findings, we further propose a simple yet strong model called SimREC, which achieves new state-of-the-art performance on these benchmarks. In addition to these progresses, we also find that with much less training overhead and parameters, SimREC can achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2023.3314153