lamm-mit
/

Cephalo-Idefics2-vision-3x8b-beta

Model card Files Files and versions Community

mjbuehler commited on Jun 10

Commit

a91e17a

•

1 Parent(s): be96f5f

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ This version of Cephalo, lamm-mit/Cephalo-Idefics2-vision-3x8b-beta, is a Mixtur
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/b7BK8ZtDzTMsyFDi0wP3w.png)
-This model leverages multiple expert networks to process different parts of the input, allowing for more efficient and specialized computations. For each token in the input sequence, a gating layer computes scores for all experts and selects the top-*k* experts based on these scores. We use a *softmax (..)* activation function to ensure that the weights across the chosen experts sum up to unity.  The output of the gating layer is a set of top-*k* values and their corresponding indices. The selected experts' outputs ($\mathbf{Y}$) are then computed and combined using a weighted sum, where the weights are given by the top-*k* values.  This sparse MoE mechanism allows our model to dynamically allocate computational resources, improving efficiency and performance for complex vision-language tasks. depicts an overview of the architecture.
 For this sample model, the model has 20b parameters (three experts, 8b each, and 8b active parameters during inference). The instructions below include a detailed explanation about how other models can be constructed.

 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/b7BK8ZtDzTMsyFDi0wP3w.png)
+This model leverages multiple expert networks to process different parts of the input, allowing for more efficient and specialized computations. For each token in the input sequence, a gating layer computes scores for all experts and selects the top-*k* experts based on these scores. We use a *softmax (..)* activation function to ensure that the weights across the chosen experts sum up to unity.  The output of the gating layer is a set of top-*k* values and their corresponding indices. The selected experts' outputs Y are then computed and combined using a weighted sum, where the weights are given by the top-*k* values.  This sparse MoE mechanism allows our model to dynamically allocate computational resources, improving efficiency and performance for complex vision-language tasks. depicts an overview of the architecture.
 For this sample model, the model has 20b parameters (three experts, 8b each, and 8b active parameters during inference). The instructions below include a detailed explanation about how other models can be constructed.