AI & ML interests

Feeling and building the multimodal intelligence.

  • [2024-11] 🤯🤯 We introduce Multimodal SAE, the first framework designed to interpret learned features in large-scale multimodal models using Sparse Autoencoders. Through our approach, we leverage LLaVA-OneVision-72B to analyze and explain the SAE-derived features of LLaVA-NeXT-LLaMA3-8B. Furthermore, we demonstrate the ability to steer model behavior by clamping specific features to alleviate hallucinations and avoid safety-related issues.

    GitHub | Paper

  • [2024-10] 🔥🔥 We present LLaVA-Critic, the first open-source large multimodal model as a generalist evaluator for assessing LMM-generated responses across diverse multimodal tasks and scenarios.

    GitHub | Blog

  • [2024-10] 🎬🎬 Introducing LLaVA-Video, a family of open large multimodal models designed specifically for advanced video understanding. We're open-sourcing LLaVA-Video-178K, a high-quality, synthetic dataset for video instruction tuning.

    GitHub | Blog

  • [2024-08] 🤞🤞 We present LLaVA-OneVision, a family of LMMs developed by consolidating insights into data, models, and visual representations.

    GitHub | Blog

  • [2024-06] 🧑‍🎨🧑‍🎨 We release LLaVA-NeXT-Interleave, an LMM extending capabilities to real-world settings: Multi-image, Multi-frame (videos), Multi-view (3D), and Multi-patch (single-image).

    GitHub | Blog

  • [2024-06] 🚀🚀 We release LongVA, a long language model with state-of-the-art video understanding performance.

    GitHub | Blog

Older Updates (2024-06 and earlier)
  • [2024-06] 🎬🎬 The lmms-eval/v0.2 toolkit now supports video evaluations for models like LLaVA-NeXT Video and Gemini 1.5 Pro.

    GitHub | Blog

  • [2024-05] 🚀🚀 We release LLaVA-NeXT Video, a model performing at Google's Gemini level on video understanding tasks.

    GitHub | Blog

  • [2024-05] 🚀🚀 The LLaVA-NeXT model family reaches near GPT-4V performance on multimodal benchmarks, with models up to 110B parameters.

    GitHub | Blog

  • [2024-03] We release lmms-eval, a toolkit for holistic evaluations with 50+ multimodal datasets and 10+ models.

    GitHub | Blog