Update README.md
Browse files
README.md
CHANGED
@@ -40,7 +40,7 @@ This version of Cephalo, lamm-mit/Cephalo-Idefics2-vision-3x8b-beta, is a Mixtur
|
|
40 |
|
41 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/b7BK8ZtDzTMsyFDi0wP3w.png)
|
42 |
|
43 |
-
This model leverages multiple expert networks to process different parts of the input, allowing for more efficient and specialized computations. For each token in the input sequence, a gating layer computes scores for all experts and selects the top-*k* experts based on these scores. We use a *softmax (..)* activation function to ensure that the weights across the chosen experts sum up to unity. The output of the gating layer is a set of top-*k* values and their corresponding indices. The selected experts' outputs
|
44 |
|
45 |
For this sample model, the model has 20b parameters (three experts, 8b each, and 8b active parameters during inference). The instructions below include a detailed explanation about how other models can be constructed.
|
46 |
|
|
|
40 |
|
41 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/b7BK8ZtDzTMsyFDi0wP3w.png)
|
42 |
|
43 |
+
This model leverages multiple expert networks to process different parts of the input, allowing for more efficient and specialized computations. For each token in the input sequence, a gating layer computes scores for all experts and selects the top-*k* experts based on these scores. We use a *softmax (..)* activation function to ensure that the weights across the chosen experts sum up to unity. The output of the gating layer is a set of top-*k* values and their corresponding indices. The selected experts' outputs Y are then computed and combined using a weighted sum, where the weights are given by the top-*k* values. This sparse MoE mechanism allows our model to dynamically allocate computational resources, improving efficiency and performance for complex vision-language tasks. depicts an overview of the architecture.
|
44 |
|
45 |
For this sample model, the model has 20b parameters (three experts, 8b each, and 8b active parameters during inference). The instructions below include a detailed explanation about how other models can be constructed.
|
46 |
|