Abstract
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
Community
TL;DR: The paper introduces **VladVA: Vision-Language Adaptation for Discriminative Visual Assistant, a novel approach to enhance the image-text discriminative abilities of Large Vision-Language Models (LVLMs). While current CLIP-style models excel in zero-shot tasks, they often struggle with language comprehension and compositional reasoning, exhibiting a "bag of words" behavior. In contrast, LVLMs demonstrate superior vision-language reasoning but are less suitable for discriminative tasks due to their generative nature.
VladVA addresses this by transforming a generative LVLM into a discriminative one, unlocking its potential for powerful image-text discrimination and enhanced language understanding. Key innovations include:
- Tailored Training Framework: Leverages diverse image-text pairs, training with both contrastive and next-token prediction losses to boost discrimination while preserving compositional capabilities.
- Efficient Adaptation: Incorporates soft prompting and LoRA adapters for fine-tuning, ensuring effectiveness and computational efficiency.
- Performance Gains: Delivers significant improvements over state-of-the-art models in benchmarks for image-text retrieval and compositional reasoning, achieving up to 15% better accuracy.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation (2024)
- UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models (2024)
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities (2024)
- Multimodal Autoregressive Pre-training of Large Vision Encoders (2024)
- Unified Generative and Discriminative Training for Multi-modal Large Language Models (2024)
- FoPru: Focal Pruning for Efficient Large Vision-Language Models (2024)
- LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper