Edit model card

Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model with only 296M params.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

Result

POPE:

Category # Samples TP FP TN FN Accuracy Precision Recall F1 Score Yes Ratio
Adversarial 3000 1264 575 925 236 0.7297 0.6873 0.8427 0.7571 0.613
Popular 3000 1264 301 1199 236 0.8210 0.8077 0.8427 0.8248 0.5217
Random 2910 1264 290 1120 236 0.8192 0.8134 0.8427 0.8278 0.5340

TEXTVQA

Samples 5000, Accuracy 27%

SCIENCEQA

Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54%

MMMU

Category # Samples Accuracy
Overall 900 0.273
Overall-Art and Design 120 0.233
Art 30 0.233
Art Theory 30 0.167
Design 30 0.367
Music 30 0.167
Overall-Business 150 0.293
Accounting 30 0.367
Economics 30 0.467
Finance 30 0.200
Management 30 0.233
Marketing 30 0.200
Overall-Science 150 0.273
Biology 30 0.267
Chemistry 30 0.100
Geography 30 0.200
Math 30 0.433
Physics 30 0.367
Overall-Health and Medicine 150 0.293
Basic Medical Science 30 0.333
Clinical Medicine 30 0.200
Diagnostics and Laboratory Med. 30 0.233
Pharmacy 30 0.333
Public Health 30 0.367
Overall-Humanities and Soc. Sci. 120 0.267
History 30 0.333
Literature 30 0.300
Sociology 30 0.133
Psychology 30 0.300
Overall-Tech and Engineering 210 0.271
Agriculture 30 0.200
Architecture and Engineering 30 0.267
Computer Science 30 0.333
Electronics 30 0.267
Energy and Power 30 0.333
Materials 30 0.267
Mechanical Engineering 30 0.233
Downloads last month
3
Safetensors
Model size
296M params
Tensor type
FP16
·
Inference API
Unable to determine this model's library. Check the docs .