Loading…

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Recent advances in fine-tuning large-scale vision-language pre-trained models (VL-PTMs) have shown promising results in quick adaption to downstream tasks. However, prior research often lacks comprehensive investigation into out-of-distribution (OOD) generalization. Fine-tuning has a potential risk...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of computer vision 2024-09, Vol.132 (9), p.3375-3407
Main Authors:	Zhu, Lin, Yin, Weihan, Yang, Yiyao, Wu, Fan, Zeng, Zhaoyu, Gu, Qinying, Wang, Xinbing, Zhou, Chenghu, Ye, Nanyang
Format:	Article
Language:	English
Subjects:	Ablation Affinity Alignment Artificial Intelligence Computer Imaging Computer Science Datasets Image Processing and Computer Vision Language Learning Pattern Recognition Pattern Recognition and Graphics Regularization Robustness Special Issue on Multimodal Learning Upper bounds Vision Visual discrimination
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent advances in fine-tuning large-scale vision-language pre-trained models (VL-PTMs) have shown promising results in quick adaption to downstream tasks. However, prior research often lacks comprehensive investigation into out-of-distribution (OOD) generalization. Fine-tuning has a potential risk of overfitting, especially on few-shot OOD datasets when significant distribution shifts occur between the few-shot training examples and test sets. Previous research on fine-tuning’s robustness to distribution shifts does not consider different characteristics of distribution shifts and may not effectively handle noisy data with spurious correlations. To address these challenges, we propose the Vision-Language Alignment Learning under Affinity and Divergence Principles (VLAD) to adapt VL-PTMs to robust few-shot OOD generalization with theoretical guarantees. Built upon the large-scale pre-trained vision-language foundation model CLIP, we leverage frozen language embeddings as invariant anchors to protect against distribution shifts, while using adapter layers to fine-tune pre-trained visual features for improved vision-language alignment. Besides, we introduce affinity and divergence principles to further mitigate overfitting during the vision-language aligning process by increasing class discrimination and suppressing non-causal features. More importantly, we offer theoretical evidence highlighting the superiority of general language knowledge in achieving more robust OOD generalization performance. The tighter upper bound of the OOD generalization errors by the proposed regularization loss is also shown in theoretical analysis. Our approach is substantiated by extensive experiments and ablation studies on diverse datasets, validating our theoretical findings. The code is available at https://github.com/LinLLLL/VLAD .
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-024-02036-4