HuggingFaceH4
/

zephyr-7b-gemma-v0.1

@@ -49,18 +49,99 @@ Zephyr is a series of language models that are trained to act as helpful assista
 ## Performance
-|                                 Model                                 |MT Bench|IFEval|
 |-----------------------------------------------------------------------|------:|------:|
 |[zephyr-7b-gemma](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma)|  7.81 |  28.76|
 |[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)  |  7.34 |  43.81|
-|[gemma-7b-it](https://huggingface.co/google/gemma-7b-it)               |  6.38 |  38.01|
-|                                 Model                                 |AGIEval|GPT4All|TruthfulQA|BigBench|Average|
 |-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
-|[zephyr-7b-gemma](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma)|  34.22|  66.37|     52.19|   37.10|  47.47|
 |[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)  |  37.52|  71.77|     55.26|   39.77|  51.08|
-|[gemma-7b-it](https://huggingface.co/google/gemma-7b-it)               |  21.33|  40.84|     41.70|   30.25|  33.53|
 ## Intended uses & limitations
@@ -70,8 +151,7 @@ We then further aligned the model with [🤗 TRL's](https://github.com/huggingfa
 Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:
 ```python
-# Install transformers from source - only needed for versions <= v4.38.1
-# pip install git+https://github.com/huggingface/transformers.git
 # pip install accelerate
 import torch

 ## Performance
+|                                 Model                                 |MT Bench⬇️|IFEval|
 |-----------------------------------------------------------------------|------:|------:|
 |[zephyr-7b-gemma](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma)|  7.81 |  28.76|
 |[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)  |  7.34 |  43.81|
+|[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it)               |  6.38 |  38.01|
+|                                 Model                                 |AGIEval|GPT4All|TruthfulQA|BigBench|Average ⬇️|
 |-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
 |[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)  |  37.52|  71.77|     55.26|   39.77|  51.08|
+|[zephyr-7b-gemma](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma)|  34.22|  66.37|     52.19|   37.10|  47.47|
+|[mlabonne/Gemmalpaca-7B](https://huggingface.co/mlabonne/Gemmalpaca-7B)|  21.6 |  40.87|     44.85 |   30.49|  34.45|
+|[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it)        |  21.33|  40.84|     41.70|   30.25|  33.53|
+<details><summary>Details of AGIEval, GPT4All, TruthfulQA, BigBench </summary>
+### AGIEval
+|             Task             |Version| Metric |Value|   |Stderr|
+|------------------------------|------:|--------|----:|---|-----:|
+|agieval_aqua_rat              |      0|acc     |21.65|±  |  2.59|
+|                              |       |acc_norm|25.20|±  |  2.73|
+|agieval_logiqa_en             |      0|acc     |34.72|±  |  1.87|
+|                              |       |acc_norm|35.94|±  |  1.88|
+|agieval_lsat_ar               |      0|acc     |19.57|±  |  2.62|
+|                              |       |acc_norm|21.74|±  |  2.73|
+|agieval_lsat_lr               |      0|acc     |30.59|±  |  2.04|
+|                              |       |acc_norm|32.55|±  |  2.08|
+|agieval_lsat_rc               |      0|acc     |49.07|±  |  3.05|
+|                              |       |acc_norm|42.75|±  |  3.02|
+|agieval_sat_en                |      0|acc     |54.85|±  |  3.48|
+|                              |       |acc_norm|53.40|±  |  3.48|
+|agieval_sat_en_without_passage|      0|acc     |37.38|±  |  3.38|
+|                              |       |acc_norm|33.98|±  |  3.31|
+|agieval_sat_math              |      0|acc     |30.91|±  |  3.12|
+|                              |       |acc_norm|28.18|±  |  3.04|
+Average: 34.22%
+### GPT4All
+|    Task     |Version| Metric |Value|   |Stderr|
+|-------------|------:|--------|----:|---|-----:|
+|arc_challenge|      0|acc     |49.15|±  |  1.46|
+|             |       |acc_norm|52.47|±  |  1.46|
+|arc_easy     |      0|acc     |77.44|±  |  0.86|
+|             |       |acc_norm|74.75|±  |  0.89|
+|boolq        |      1|acc     |79.69|±  |  0.70|
+|hellaswag    |      0|acc     |60.59|±  |  0.49|
+|             |       |acc_norm|78.00|±  |  0.41|
+|openbookqa   |      0|acc     |29.20|±  |  2.04|
+|             |       |acc_norm|37.80|±  |  2.17|
+|piqa         |      0|acc     |76.82|±  |  0.98|
+|             |       |acc_norm|77.80|±  |  0.97|
+|winogrande   |      0|acc     |64.09|±  |  1.35|
+Average: 66.37%
+### TruthfulQA
+|    Task     |Version|Metric|Value|   |Stderr|
+|-------------|------:|------|----:|---|-----:|
+|truthfulqa_mc|      1|mc1   |35.74|±  |  1.68|
+|             |       |mc2   |52.19|±  |  1.59|
+Average: 52.19%
+### Bigbench
+|                      Task                      |Version|       Metric        |Value|   |Stderr|
+|------------------------------------------------|------:|---------------------|----:|---|-----:|
+|bigbench_causal_judgement                       |      0|multiple_choice_grade|53.68|±  |  3.63|
+|bigbench_date_understanding                     |      0|multiple_choice_grade|59.89|±  |  2.55|
+|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|30.23|±  |  2.86|
+|bigbench_geometric_shapes                       |      0|multiple_choice_grade|11.42|±  |  1.68|
+|                                                |       |exact_str_match      | 0.00|±  |  0.00|
+|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|28.40|±  |  2.02|
+|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|19.14|±  |  1.49|
+|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|44.67|±  |  2.88|
+|bigbench_movie_recommendation                   |      0|multiple_choice_grade|26.80|±  |  1.98|
+|bigbench_navigate                               |      0|multiple_choice_grade|50.00|±  |  1.58|
+|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|52.75|±  |  1.12|
+|bigbench_ruin_names                             |      0|multiple_choice_grade|33.04|±  |  2.22|
+|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|33.37|±  |  1.49|
+|bigbench_snarks                                 |      0|multiple_choice_grade|48.62|±  |  3.73|
+|bigbench_sports_understanding                   |      0|multiple_choice_grade|58.11|±  |  1.57|
+|bigbench_temporal_sequences                     |      0|multiple_choice_grade|37.20|±  |  1.53|
+|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|20.08|±  |  1.13|
+|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|15.77|±  |  0.87|
+|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|44.67|±  |  2.88|
+Average: 37.1%
+</details>
 ## Intended uses & limitations
 Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:
 ```python
+# pip install transformers>=4.38.2
 # pip install accelerate
 import torch