Multimodal Language Model - a Norm Collection

Norm 's Collections

VAE

Image / Video Gen

Multimodal Language Model

Fundamental Research

Computer Vision

Multimodal Language Model

updated 10 days ago

What does matter besides data receipt when training a Multimodal language model?

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24 • 39
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 68
openbmb/MiniCPM-V-2_6

Image-Text-to-Text • Updated 18 days ago • 93.1k • 842
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 97
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14 • 124
OpenGVLab/InternViT-6B-448px-V1-2

Image Feature Extraction • Updated Aug 23 • 249 • 25
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25 • 53
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18 • 16
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28 • 8
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20 • 57

Note 1. The intra-image bidirectional attention is important, and replacing it with causal attention hurts text-to-image generation. 2. There is a clear advantage to using the U-Net up and down blocks instead of a simple linear layer for modality mapping.
LISA: Reasoning Segmentation via Large Language Model

Paper • 2308.00692 • Published Aug 1, 2023 • 1

Note 1. Extract the feature of token from the last hidden layer of LLM and project to SAM decoder. 2. Joint train with pixel-level understanding data often leads to decreased image-level capability.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27 • 52

Note 1. Image Encoder: a ConvNeXt-L-based CLIP model to reach high resolution. 2. Directly combining a frozen perception module with LLM doesn’t perform well. 3. Use a simple MLP to map the LLM output’s hidden states of the [SEG] token to the visual space. 4. Propose a good Region Encoder Design adapted from a pre-trained Image-Encoder. 5. “ Expression [SEG]." Since the “Expression" is flexible and variable, the LLM is less likely to overfit to a fixed response
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22 • 39

Note 1. A Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible. 2. A Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. 3. Concate them together and bang, here we have a good video features even without training.
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Paper • 2311.05698 • Published Nov 9, 2023 • 9

Note 1. Scale to 512 input video frames with the Token Turning Machine Combiner. 2. The ‘Process’ is implemented with a standard Transformer with layers of MHA and MLPs. The functions ‘Read’, ‘Write’, and ‘Output’ is implemented with Attention Pooling.
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 122

Note 1. Training methods: 1.1 progressively higher-quality data, the maximum image resolution gradually increases, and more model parts are unfrozen. 2. Dataset 2.1 Apply image deduplication, it is possible to train on just half of the LAION dataset with only a minimal reduction in performance compared to using the full dataset
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28 • 83

Note 1. Unfreezing the CLIP encoder significantly improves when interpolating to a higher MLLM input resolution that differs from the pre-training resolution. 2. Introduce a Pre-Alignment training stage: 2.1 Traini each pre-trained vision expert with their own projector on SFT data, while keeping the language model frozen,
NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 72
allenai/Molmo-7B-D-0924

Image-Text-to-Text • Updated Oct 10 • 77.3k • 446
meta-llama/Llama-3.2-11B-Vision-Instruct

Image-Text-to-Text • Updated Sep 30 • 2.14M • • 1.03k
Video Instruction Tuning With Synthetic Data

Paper • 2410.02713 • Published Oct 3 • 37

Note 1. Arrange slow and fast frames in an interleaving pattern. p × p pooling and 2p × 2p pooling for slow and fast frames, respectively 2. Use a tagging model to categorize the video content; InsTag (https://arxiv.org/pdf/2308.07074)
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Paper • 2410.16267 • Published Oct 21 • 15
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18 • 74
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Paper • 2410.17434 • Published Oct 22 • 24
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published 12 days ago • 40