Alpha-VLLM/Chameleon_7B_mGPT

This is the Chameleon-7b checkpoint, converted using the script convert_chameleon_weights_to_hf.py from the Lumina-mGPT repository.

This release is intended to ease the initialization of Lumina-mGPT training. Before using this model, please ensure you have obtained permission to access the official Chameleon checkpoints available at Hugging Face. Usage of this model is at the user's own risk.

Differences from the official chameleon-7B release

This model is almost the same as the official chameleon-7B release, with one important difference in the qk-norm implementation: Due to unknown reasons, for the 34B Chameleon model, where 8-way model parallelism is employed during training, the weights in the qk-norm layers, which are expected to be the same across model-parallel ranks, are found to be different (See here for details). More intuitively, this means that the attention heads can be divided into 1 group for 7B model and 8 groups for 34B model, where the qk-norm parameters are the same within the groups but different among them. To mitigate this problem, transformers has developed the implementation to copy the qk-norm parameters to the shape num_heads * head_dim, however, this means that if we want to further finetune the Chameleon model, like the case of Lumina-mGPT, the qk-norm parameters will further diverge to the extent that the parameters are different between every two attention heads, which is not ideal. To solve this problem, we slightly change the implementation so that the qk-norm parameters are instead of shape model_parallel_size x head_dim, where model_parallel_size is 1 for 7B model and 8 for 34B model, and they are expanded to num_heads * head_dim during forward time through repeat_interleave. This modification ensures that the qk-norm parameters can always be consistent within existing groups.

Alpha-VLLM
/

Chameleon_7B_mGPT

You need to agree to share your contact information to access this model

Differences from the official chameleon-7B release