Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,9 @@ pipeline_tag: text-generation
|
|
12 |
|
13 |
# cerbero-7b Italian LLM π
|
14 |
|
15 |
-
>
|
|
|
|
|
16 |
|
17 |
> π’ **cerbero-7b** is the first **100% Free** and Open Source **Italian Large Language Model** (LLM) ready to be used for **research** or **commercial applications**.
|
18 |
|
@@ -42,6 +44,7 @@ The Stanford Question Answering Dataset (SQuAD) in Italian (SQuAD-it) is used to
|
|
42 |
|
43 |
| Model | F1 Score | Exact Match (EM) |
|
44 |
|----------------------------------------------|--------------|----------------------|
|
|
|
45 |
| **cerbero-7b** | **72.55%** | **55.6%** |
|
46 |
| Fauno | 44.46% | 0.00% |
|
47 |
| Camoscio | 37.42% | 0.00% |
|
@@ -53,6 +56,7 @@ EVALITA benchmarks assess the model's performance in tasks like toxicity detecti
|
|
53 |
|
54 |
| Model | Toxicity Detection | Irony Detection | Sentiment Analysis |
|
55 |
|----------------------------------------------|--------------------|-----------------|--------------------|
|
|
|
56 |
| **cerbero-7b** | **63.04%** | **48.51%** | **61.80%** |
|
57 |
| Fauno | 33.84% | 39.17% | 12.23% |
|
58 |
| Camoscio | 38.18% | 39.65% | 13.33% |
|
@@ -67,11 +71,107 @@ The name "Cerbero," inspired by the three-headed dog that guards the gates of th
|
|
67 |
cerbero-7b builds upon the formidable **mistral-7b** as its base model. This choice ensures a robust foundation, leveraging the power and capabilities of a cutting-edge language model.
|
68 |
|
69 |
- **Datasets: Cerbero Dataset** π
|
70 |
-
The Cerbero Dataset is a groundbreaking collection specifically curated to enhance the proficiency of cerbero-7b in understanding and generating Italian text. This dataset is a product of an innovative method combining dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the [paper](
|
71 |
|
72 |
- **Licensing: Apache 2.0** ποΈ
|
73 |
Released under the **permissive Apache 2.0 license**, cerbero-7b promotes openness and collaboration. This licensing choice empowers developers with the freedom for unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
## Training Details π
|
76 |
|
77 |
**cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.
|
|
|
12 |
|
13 |
# cerbero-7b Italian LLM π
|
14 |
|
15 |
+
> π **New Release**: **cerbero-7b-openchat** our latest SOTA model based on [**openchat3.5**](https://github.com/imoneoi/openchat), delivering performance **on par with** or **superior** to **ChatGPT 3.5**!
|
16 |
+
|
17 |
+
> π₯ The research paper unveiling the secrets behind **cerbero-7b** is now available on [arXiv](https://arxiv.org/abs/2311.15698)!
|
18 |
|
19 |
> π’ **cerbero-7b** is the first **100% Free** and Open Source **Italian Large Language Model** (LLM) ready to be used for **research** or **commercial applications**.
|
20 |
|
|
|
44 |
|
45 |
| Model | F1 Score | Exact Match (EM) |
|
46 |
|----------------------------------------------|--------------|----------------------|
|
47 |
+
| **cerbero-7b-openchat** | **74.09%** | **56.0%** |
|
48 |
| **cerbero-7b** | **72.55%** | **55.6%** |
|
49 |
| Fauno | 44.46% | 0.00% |
|
50 |
| Camoscio | 37.42% | 0.00% |
|
|
|
56 |
|
57 |
| Model | Toxicity Detection | Irony Detection | Sentiment Analysis |
|
58 |
|----------------------------------------------|--------------------|-----------------|--------------------|
|
59 |
+
| **cerbero-7b-openchat** | **63.33%** | **69.16%** | **66.89%** |
|
60 |
| **cerbero-7b** | **63.04%** | **48.51%** | **61.80%** |
|
61 |
| Fauno | 33.84% | 39.17% | 12.23% |
|
62 |
| Camoscio | 38.18% | 39.65% | 13.33% |
|
|
|
71 |
cerbero-7b builds upon the formidable **mistral-7b** as its base model. This choice ensures a robust foundation, leveraging the power and capabilities of a cutting-edge language model.
|
72 |
|
73 |
- **Datasets: Cerbero Dataset** π
|
74 |
+
The Cerbero Dataset is a groundbreaking collection specifically curated to enhance the proficiency of cerbero-7b in understanding and generating Italian text. This dataset is a product of an innovative method combining dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the [paper](https://arxiv.org/abs/2311.15698) for more details.
|
75 |
|
76 |
- **Licensing: Apache 2.0** ποΈ
|
77 |
Released under the **permissive Apache 2.0 license**, cerbero-7b promotes openness and collaboration. This licensing choice empowers developers with the freedom for unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
|
78 |
|
79 |
+
## Models π§¬
|
80 |
+
|
81 |
+
**cerbero-7b** is available in various flavors, each tailored for specific applications and use cases. Below is a table listing these versions along with their respective training datasets and base models:
|
82 |
+
|
83 |
+
| Model Name | Training Dataset | Base Model | Huggingface Model | Llama.cpp and Quantized Model |
|
84 |
+
|-------------------------|-------------------|-------------|-------------------|-------------------------------|
|
85 |
+
| cerbero-7b | Cerbero Dataset | mistral-7b | [link](https://huggingface.co/galatolo/cerbero-7b) | [link](https://huggingface.co/galatolo/cerbero-7b-gguf) |
|
86 |
+
| cerbero-7b-openchat | Cerbero Dataset | openchat3.5 | [link](https://huggingface.co/galatolo/cerbero-7b-openchat) | [link](https://huggingface.co/galatolo/cerbero-7b-openchat-gguf) |
|
87 |
+
|
88 |
+
|
89 |
+
Each of these models brings its unique strengths to the table, making **cerbero-7b** a versatile tool for both research and commercial applications in the Italian language AI domain.
|
90 |
+
|
91 |
+
We are committed to continuously enhancing **cerbero-7b**. Our team plans to keep training and releasing new models as advancements in the 7b SOTA occur. This ensures that **cerbero-7b** remains at the forefront of AI technology, offering the most advanced and efficient solutions in the Italian language AI sector.
|
92 |
+
|
93 |
+
If you do not have enough RAM to fit the `float32` model (for example when using Colab) we provide for each model a `float16` version using the `revision="float16"` argument
|
94 |
+
|
95 |
+
```python
|
96 |
+
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")
|
97 |
+
```
|
98 |
+
|
99 |
+
## Training Details π
|
100 |
+
|
101 |
+
**cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.
|
102 |
+
The model is trained on an expansive Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of **8192 tokens**
|
103 |
+
|
104 |
+
### Dataset Composition π
|
105 |
+
|
106 |
+
> π’ Details on the **Cerbero Dataset** will be updated shortly!
|
107 |
+
|
108 |
+
### Training Setup βοΈ
|
109 |
+
|
110 |
+
**cerbero-7b** is trained on an NVIDIA DGX H100:
|
111 |
+
|
112 |
+
- **Hardware:** Utilizing 8xH100 GPUs, each with 80 GB VRAM. π₯οΈ
|
113 |
+
- **Parallelism:** DeepSpeed Zero stage 1 parallelism for optimal training efficiency.β¨
|
114 |
+
|
115 |
+
The model has been trained for **1 epoch**, ensuring a convergence of knowledge and proficiency in handling diverse linguistic tasks.
|
116 |
+
|
117 |
+
## Getting Started π
|
118 |
+
|
119 |
+
You can load **cerbero-7b** (or **cerbero-7b-openchat**) using [π€transformers](https://huggingface.co/docs/transformers/index)
|
120 |
+
|
121 |
+
```python
|
122 |
+
import torch
|
123 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
124 |
+
|
125 |
+
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b")
|
126 |
+
tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")
|
127 |
+
|
128 |
+
prompt = """Questa Γ¨ una conversazione tra un umano ed un assistente AI.
|
129 |
+
[|Umano|] Come posso distinguere un AI da un umano?
|
130 |
+
[|Assistente|]"""
|
131 |
+
|
132 |
+
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
|
133 |
+
with torch.no_grad():
|
134 |
+
output_ids = model.generate(input_ids, max_new_tokens=128)
|
135 |
+
|
136 |
+
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
137 |
+
print(generated_text)
|
138 |
+
```
|
139 |
+
|
140 |
+
### GGUF and llama.cpp
|
141 |
+
|
142 |
+
**cerbero-7b** is fully **compatibile** with [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
143 |
+
|
144 |
+
You can find the **original** and **quantized** versions of **cerbero-7b** in the `gguf` format [here](https://huggingface.co/galatolo/cerbero-7b-gguf/tree/main)
|
145 |
+
|
146 |
+
```python
|
147 |
+
from llama_cpp import Llama
|
148 |
+
from huggingface_hub import hf_hub_download
|
149 |
+
|
150 |
+
llm = Llama(
|
151 |
+
model_path=hf_hub_download(
|
152 |
+
repo_id="galatolo/cerbero-7b-gguf",
|
153 |
+
filename="ggml-model-f16.gguf",
|
154 |
+
),
|
155 |
+
n_ctx=4086,
|
156 |
+
)
|
157 |
+
|
158 |
+
llm.generate("""Questa Γ¨ una conversazione tra un umano ed un assistente AI.
|
159 |
+
[|Umano|] Come posso distinguere un AI da un umano?
|
160 |
+
[|Assistente|]""")
|
161 |
+
```
|
162 |
+
|
163 |
+
## Citation π
|
164 |
+
|
165 |
+
If you use **cerbero-7b** in your research, please cite our paper:
|
166 |
+
|
167 |
+
```bibtex
|
168 |
+
@article{galatolo2023cerbero,
|
169 |
+
title={Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation},
|
170 |
+
author={Galatolo, Federico A and Cimino, Mario GCA},
|
171 |
+
journal={arXiv preprint arXiv:2311.15698},
|
172 |
+
year={2023}
|
173 |
+
}
|
174 |
+
```
|
175 |
## Training Details π
|
176 |
|
177 |
**cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.
|