Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.08687

about 2 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 39
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 20

Paper - Multimodal

Paper related to Multimodal Model - Research for a : Modular, Multimodal, Multi-Stream, Mixture of Expert, Universal Transformer, Matryoshka embedding

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published 7 days ago • 25
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published 10 days ago • 41
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 8 days ago • 103
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published 8 days ago • 13

MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Paper • 2412.07147 • Published 16 days ago • 5
Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Paper • 2412.04429 • Published 21 days ago
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Paper • 2412.05939 • Published 18 days ago • 12
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published 15 days ago • 51

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Paper • 2411.18499 • Published 29 days ago • 18
VLSBench: Unveiling Visual Leakage in Multimodal Safety

Paper • 2411.19939 • Published 27 days ago • 9
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published 23 days ago • 22
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Paper • 2412.03205 • Published 22 days ago • 15

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Paper • 2410.09335 • Published Oct 12 • 16
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9 • 35
Emergent properties with repeated examples

Paper • 2410.07041 • Published Oct 9 • 8
Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9 • 69

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13 • 18
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24 • 13
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20 • 14
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17 • 30

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs