Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing the solution for feature interpretation across various model scales.

This research is inspired by Anthropic's remarkable work on applying SAEs to interpret features in large-scale language models. In multimdoal models, we discovered intriguing features that correlate with diverse semantics and can be leveraged to steer model behavior, enabling more precise control and understanding of LMM functionality.

This model is the trained SAE on LLaVA-NeXT sft data with 131k features and 256 activated features. For how to use it, you can refer to instructions in the GitHub

lmms-lab
/

llama3-llava-next-8b-hf-sae-131k

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Dataset used to train lmms-lab/llama3-llava-next-8b-hf-sae-131k

Collection including lmms-lab/llama3-llava-next-8b-hf-sae-131k

Multimodal-SAE