Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! π€― - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! π - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!
Hugging face presents FineVideo π₯! Unlocking the next generation of Video understanding π
π€―3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs. π₯ @mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.