Text Generation
Transformers
Safetensors
zamba2
Inference Endpoints
File size: 4,471 Bytes
de74f0f
 
 
51d1471
de74f0f
51d1471
de74f0f
b274e6c
 
 
 
 
de74f0f
b274e6c
de74f0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51d1471
019014a
de74f0f
 
 
 
 
 
 
 
b274e6c
de74f0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b274e6c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: apache-2.0
---
# Model Card for Zamba v2 2.7B

Zamba-2-2.7B is a hybrid model between state-space models and transformers. It broadly follows the [Zamba architecture](https://huggingface.co/Zyphra/Zamba-7B-v1) which consists of a Mamba backbone alternating with shared transformer blocks. Zamba-2-2.7B possesses three major improvements over Zamba1:

1.) Mamba1 blocks have been replaced with Mamba2 blocks.
2.) Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern through the network.
3.) We apply a LoRA projector to each shared MLP block allowing the network to specialize the MLPs at each shared layer with a minimal increase in total parameter count.

Zamba was trained using next-token prediction. It uses the Mistral v0.1 tokenizer. Zamba2-2.7B was pre-trained on 3T tokens of text and code data sourced from open web-datasets. Subsequently in a second phase, Zamba was annealed on a mixture of 100B high-quality tokens.

Note: this is a temporary HuggingFace implementation of Zamba 3B and is designed for specific use cases. It may not be fully compatible with all frameworks and tools intended to interface with HuggingFace models.

## Quick start

### Presequities

To download Zamba 3B, clone Zyphra's fork of transformers:
1. `git clone https://github.com/Zyphra/transformers_zamba2.git`
2. `cd transformers_zamba2`
3. Install the repository: `pip install -e .`


You can run the model without using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly higher latency. 

To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``.


### Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-2.7B")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-2.7B", device_map="cuda", torch_dtype=torch.bfloat16)

input_text = "A funny prompt would be "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

## Model Details [to update!]

Zamba utilizes a unique hybrid SSM architecture. This architecture consists of a backbone of Mamba layers interspersed with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. 


<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/IGK562oVTFSOQbpLavu7E.png" width="300" alt="Zamba architecture">
</center>


## Performance [to update!]

We find that Zamba performs significantly better than existing open models (with open datasets and training details) at this scale. However, it performs slightly worse than the leading open-weight models at the 7B scale. Most of this difference derives from MMLU and reasoning evaluations. Zamba, however, is trained on significantly fewer tokens than these models and is the most sample efficient model in terms of performance per training tokens.


<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/FG73iXpiDGSX_opbDJxKo.png" width="700" alt="Zamba performance">
</center>


Due to its SSM architecture, Zamba is extremely efficient in inference, substantially outperforming comparable 7B and 8B models in inference latency as well as memory cost of generation due to its substantially diminished KV cache.

<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/cghYPnDbdzweT1b2RyiXA.png" width="400" alt="Zamba performance">
</center>

## Citation

If you find Zamba useful in your work please cite it as:

```
@article{glorioso2024zamba,
  title={Zamba: A Compact 7B SSM Hybrid Model},
  author={Glorioso, Paolo and Anthony, Quentin and Tokpanov, Yury and Whittington, James and Pilault, Jonathan and Ibrahim, Adam and Millidge, Beren},
  journal={arXiv preprint arXiv:2405.16712},
  year={2024}
}
```

## Notice

Zamba2-2.7B is a pretrained base model and therefore does not have any moderation mechanism. In addition, one should not expect good chat performance, as this model was not fine-tuned for chat.