Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.07112

about 2 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 39
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 20

Paper - Multimodal

Paper related to Multimodal Model - Research for a : Modular, Multimodal, Multi-Stream, Mixture of Expert, Universal Transformer, Matryoshka embedding

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published 7 days ago • 25
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published 10 days ago • 41
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 8 days ago • 103
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published 8 days ago • 13

multilingual vision models

Some papers I read for understanding vision models and also adding multilingual capabilities to them

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 87
Visual Instruction Tuning

Paper • 2304.08485 • Published Apr 17, 2023 • 13
Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 37
PALO: A Polyglot Large Multimodal Model for 5B People

Paper • 2402.14818 • Published Feb 22 • 23

Pending Classification

about 5 hours ago

Video Creation by Demonstration

Paper • 2412.09551 • Published 14 days ago • 8
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Paper • 2412.07589 • Published 16 days ago • 45
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

Paper • 2412.06531 • Published 17 days ago • 71
APOLLO: SGD-like Memory, AdamW-level Performance

Paper • 2412.05270 • Published 20 days ago • 38

microsoft/OmniParser

Image-Text-to-Text • Updated 24 days ago • 4.33k • 1.5k
Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published 16 days ago • 25

Multimodal Dataset

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Paper • 2407.09413 • Published Jul 12 • 9
MAVIS: Mathematical Visual Instruction Tuning

Paper • 2407.08739 • Published Jul 11 • 30
Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Paper • 2409.01437 • Published Sep 2 • 70
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9 • 45

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 87
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 31

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs