Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model with only 296M params.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

Result

Category	# Samples	TP	FP	TN	FN	Accuracy	Precision	Recall	F1 Score	Yes Ratio
Adversarial	3000	1264	575	925	236	0.7297	0.6873	0.8427	0.7571	0.613
Popular	3000	1264	301	1199	236	0.8210	0.8077	0.8427	0.8248	0.5217
Random	2910	1264	290	1120	236	0.8192	0.8134	0.8427	0.8278	0.5340

Samples 5000, Accuracy 27%

Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54%

Category	# Samples	Accuracy
Overall	900	0.273
Overall-Art and Design	120	0.233
Art	30	0.233
Art Theory	30	0.167
Design	30	0.367
Music	30	0.167
Overall-Business	150	0.293
Accounting	30	0.367
Economics	30	0.467
Finance	30	0.200
Management	30	0.233
Marketing	30	0.200
Overall-Science	150	0.273
Biology	30	0.267
Chemistry	30	0.100
Geography	30	0.200
Math	30	0.433
Physics	30	0.367
Overall-Health and Medicine	150	0.293
Basic Medical Science	30	0.333
Clinical Medicine	30	0.200
Diagnostics and Laboratory Med.	30	0.233
Pharmacy	30	0.333
Public Health	30	0.367
Overall-Humanities and Soc. Sci.	120	0.267
History	30	0.333
Literature	30	0.300
Sociology	30	0.133
Psychology	30	0.300
Overall-Tech and Engineering	210	0.271
Agriculture	30	0.200
Architecture and Engineering	30	0.267
Computer Science	30	0.333
Electronics	30	0.267
Energy and Power	30	0.333
Materials	30	0.267
Mechanical Engineering	30	0.233