Text Generation
Transformers
PyTorch
Italian
English
mistral
conversational
text-generation-inference
Inference Endpoints
galatolo commited on
Commit
fbea1e5
β€’
1 Parent(s): 7aafba3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -2
README.md CHANGED
@@ -12,7 +12,9 @@ pipeline_tag: text-generation
12
 
13
  # cerbero-7b Italian LLM πŸš€
14
 
15
- > πŸ”₯ Attention! The **new** and **more capable** version of **cerbero-7b** is now **available**!
 
 
16
 
17
  > πŸ“’ **cerbero-7b** is the first **100% Free** and Open Source **Italian Large Language Model** (LLM) ready to be used for **research** or **commercial applications**.
18
 
@@ -42,6 +44,7 @@ The Stanford Question Answering Dataset (SQuAD) in Italian (SQuAD-it) is used to
42
 
43
  | Model | F1 Score | Exact Match (EM) |
44
  |----------------------------------------------|--------------|----------------------|
 
45
  | **cerbero-7b** | **72.55%** | **55.6%** |
46
  | Fauno | 44.46% | 0.00% |
47
  | Camoscio | 37.42% | 0.00% |
@@ -53,6 +56,7 @@ EVALITA benchmarks assess the model's performance in tasks like toxicity detecti
53
 
54
  | Model | Toxicity Detection | Irony Detection | Sentiment Analysis |
55
  |----------------------------------------------|--------------------|-----------------|--------------------|
 
56
  | **cerbero-7b** | **63.04%** | **48.51%** | **61.80%** |
57
  | Fauno | 33.84% | 39.17% | 12.23% |
58
  | Camoscio | 38.18% | 39.65% | 13.33% |
@@ -67,11 +71,107 @@ The name "Cerbero," inspired by the three-headed dog that guards the gates of th
67
  cerbero-7b builds upon the formidable **mistral-7b** as its base model. This choice ensures a robust foundation, leveraging the power and capabilities of a cutting-edge language model.
68
 
69
  - **Datasets: Cerbero Dataset** πŸ“š
70
- The Cerbero Dataset is a groundbreaking collection specifically curated to enhance the proficiency of cerbero-7b in understanding and generating Italian text. This dataset is a product of an innovative method combining dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the [paper](README.md) for more details.
71
 
72
  - **Licensing: Apache 2.0** πŸ•ŠοΈ
73
  Released under the **permissive Apache 2.0 license**, cerbero-7b promotes openness and collaboration. This licensing choice empowers developers with the freedom for unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ## Training Details πŸš€
76
 
77
  **cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.
 
12
 
13
  # cerbero-7b Italian LLM πŸš€
14
 
15
+ > πŸš€ **New Release**: **cerbero-7b-openchat** our latest SOTA model based on [**openchat3.5**](https://github.com/imoneoi/openchat), delivering performance **on par with** or **superior** to **ChatGPT 3.5**!
16
+
17
+ > πŸ”₯ The research paper unveiling the secrets behind **cerbero-7b** is now available on [arXiv](https://arxiv.org/abs/2311.15698)!
18
 
19
  > πŸ“’ **cerbero-7b** is the first **100% Free** and Open Source **Italian Large Language Model** (LLM) ready to be used for **research** or **commercial applications**.
20
 
 
44
 
45
  | Model | F1 Score | Exact Match (EM) |
46
  |----------------------------------------------|--------------|----------------------|
47
+ | **cerbero-7b-openchat** | **74.09%** | **56.0%** |
48
  | **cerbero-7b** | **72.55%** | **55.6%** |
49
  | Fauno | 44.46% | 0.00% |
50
  | Camoscio | 37.42% | 0.00% |
 
56
 
57
  | Model | Toxicity Detection | Irony Detection | Sentiment Analysis |
58
  |----------------------------------------------|--------------------|-----------------|--------------------|
59
+ | **cerbero-7b-openchat** | **63.33%** | **69.16%** | **66.89%** |
60
  | **cerbero-7b** | **63.04%** | **48.51%** | **61.80%** |
61
  | Fauno | 33.84% | 39.17% | 12.23% |
62
  | Camoscio | 38.18% | 39.65% | 13.33% |
 
71
  cerbero-7b builds upon the formidable **mistral-7b** as its base model. This choice ensures a robust foundation, leveraging the power and capabilities of a cutting-edge language model.
72
 
73
  - **Datasets: Cerbero Dataset** πŸ“š
74
+ The Cerbero Dataset is a groundbreaking collection specifically curated to enhance the proficiency of cerbero-7b in understanding and generating Italian text. This dataset is a product of an innovative method combining dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the [paper](https://arxiv.org/abs/2311.15698) for more details.
75
 
76
  - **Licensing: Apache 2.0** πŸ•ŠοΈ
77
  Released under the **permissive Apache 2.0 license**, cerbero-7b promotes openness and collaboration. This licensing choice empowers developers with the freedom for unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
78
 
79
+ ## Models 🧬
80
+
81
+ **cerbero-7b** is available in various flavors, each tailored for specific applications and use cases. Below is a table listing these versions along with their respective training datasets and base models:
82
+
83
+ | Model Name | Training Dataset | Base Model | Huggingface Model | Llama.cpp and Quantized Model |
84
+ |-------------------------|-------------------|-------------|-------------------|-------------------------------|
85
+ | cerbero-7b | Cerbero Dataset | mistral-7b | [link](https://huggingface.co/galatolo/cerbero-7b) | [link](https://huggingface.co/galatolo/cerbero-7b-gguf) |
86
+ | cerbero-7b-openchat | Cerbero Dataset | openchat3.5 | [link](https://huggingface.co/galatolo/cerbero-7b-openchat) | [link](https://huggingface.co/galatolo/cerbero-7b-openchat-gguf) |
87
+
88
+
89
+ Each of these models brings its unique strengths to the table, making **cerbero-7b** a versatile tool for both research and commercial applications in the Italian language AI domain.
90
+
91
+ We are committed to continuously enhancing **cerbero-7b**. Our team plans to keep training and releasing new models as advancements in the 7b SOTA occur. This ensures that **cerbero-7b** remains at the forefront of AI technology, offering the most advanced and efficient solutions in the Italian language AI sector.
92
+
93
+ If you do not have enough RAM to fit the `float32` model (for example when using Colab) we provide for each model a `float16` version using the `revision="float16"` argument
94
+
95
+ ```python
96
+ model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")
97
+ ```
98
+
99
+ ## Training Details πŸš€
100
+
101
+ **cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.
102
+ The model is trained on an expansive Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of **8192 tokens**
103
+
104
+ ### Dataset Composition πŸ“Š
105
+
106
+ > πŸ“’ Details on the **Cerbero Dataset** will be updated shortly!
107
+
108
+ ### Training Setup βš™οΈ
109
+
110
+ **cerbero-7b** is trained on an NVIDIA DGX H100:
111
+
112
+ - **Hardware:** Utilizing 8xH100 GPUs, each with 80 GB VRAM. πŸ–₯️
113
+ - **Parallelism:** DeepSpeed Zero stage 1 parallelism for optimal training efficiency.✨
114
+
115
+ The model has been trained for **1 epoch**, ensuring a convergence of knowledge and proficiency in handling diverse linguistic tasks.
116
+
117
+ ## Getting Started πŸš€
118
+
119
+ You can load **cerbero-7b** (or **cerbero-7b-openchat**) using [πŸ€—transformers](https://huggingface.co/docs/transformers/index)
120
+
121
+ ```python
122
+ import torch
123
+ from transformers import AutoModelForCausalLM, AutoTokenizer
124
+
125
+ model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b")
126
+ tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")
127
+
128
+ prompt = """Questa Γ¨ una conversazione tra un umano ed un assistente AI.
129
+ [|Umano|] Come posso distinguere un AI da un umano?
130
+ [|Assistente|]"""
131
+
132
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids
133
+ with torch.no_grad():
134
+ output_ids = model.generate(input_ids, max_new_tokens=128)
135
+
136
+ generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
137
+ print(generated_text)
138
+ ```
139
+
140
+ ### GGUF and llama.cpp
141
+
142
+ **cerbero-7b** is fully **compatibile** with [llama.cpp](https://github.com/ggerganov/llama.cpp)
143
+
144
+ You can find the **original** and **quantized** versions of **cerbero-7b** in the `gguf` format [here](https://huggingface.co/galatolo/cerbero-7b-gguf/tree/main)
145
+
146
+ ```python
147
+ from llama_cpp import Llama
148
+ from huggingface_hub import hf_hub_download
149
+
150
+ llm = Llama(
151
+ model_path=hf_hub_download(
152
+ repo_id="galatolo/cerbero-7b-gguf",
153
+ filename="ggml-model-f16.gguf",
154
+ ),
155
+ n_ctx=4086,
156
+ )
157
+
158
+ llm.generate("""Questa Γ¨ una conversazione tra un umano ed un assistente AI.
159
+ [|Umano|] Come posso distinguere un AI da un umano?
160
+ [|Assistente|]""")
161
+ ```
162
+
163
+ ## Citation πŸ“–
164
+
165
+ If you use **cerbero-7b** in your research, please cite our paper:
166
+
167
+ ```bibtex
168
+ @article{galatolo2023cerbero,
169
+ title={Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation},
170
+ author={Galatolo, Federico A and Cimino, Mario GCA},
171
+ journal={arXiv preprint arXiv:2311.15698},
172
+ year={2023}
173
+ }
174
+ ```
175
  ## Training Details πŸš€
176
 
177
  **cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.