File size: 4,344 Bytes
bcf87a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
library_name: transformers
tags: []
---

# Falcon-11B-Base-V1.1
The Falcon-11B-Base-V1 Large Language Model (LLM) is a pretrained generative text model with 11.1 billion parameters. 

## Model Specifications
- Base Model (not instruct tuned)
- Flash Attention 2
- Untied LM-Head and Word Embeddings (This adds 300M parameters over the 10.8B)
- 11.1B Parameters
- Rope Theta 500,042



### Inference Model
Inference the model with `trust_remote_code=True` to use our modeling code. We show an example below with the most basic hyperparameters.

```python
import os
import sys
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

#Load Model and Tokenizer
base_model_id = "ruliadai/falcon-base-v1.1"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id, 
    device_map="auto",
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    )

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    device_map="auto",
    )
tokenizer.pad_token = tokenizer.eos_token

#Run Inference
while True:
    prompt = input("Instruction: ")
    model_input = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)
    model.eval()
    print(model.generation_config)
    with torch.no_grad():
        print(tokenizer.decode(
            model.generate(**model_input,max_new_tokens=800, temperature=0.0, do_sample=False, repetition_penalty=1.15)[0], use_cache=True)
            )
```

### How to run inference

Setup and activate your venv/or conda env

```bash
python3 -m venv env \
  && source env/bin/activate
```

Install torch:
```bash
pip3 install torch torchvision torchaudio
```
Note that you may need to install torch according to your system req/drivers (https://pytorch.org/get-started/locally/)


Install requirements:
```bash
pip3 install --upgrade --force-reinstall transformers accelerate flash-attn hf_transfer
```

Run script:

```bash

HF_HUB_ENABLE_HF_TRANSFER=1 HF_TOKEN=<YOUR_HF_TOKEN> python3 inference.py
```


If flash-attn is broken:
```bash
pip3 uninstall flash-attn
pip3 cache purge
pip3 install flash-attn
```


## Model Evaluation

### Measured Benchmarks (by Ruliad)

| MODEL           | AVERAGE    | MMLU (5-s) | TQA (0-s)  | ARC (25-s) | GSM8K (5-s)| HS (10-s)  | WG (5-s)   |
| --------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Falcon-Base-v1.1  | 0.6440     | 0.5683     | 0.5263     | 0.6041     | 0.5542     | 0.8280     | 0.7806     |
| Llama-3-8B      | 0.6300     | 0.6513     | 0.4385     | 0.5904     | 0.5034     | 0.8223     | 0.7751     |
| Mistral-7B-v0.1 | 0.6130     | 0.6233     | 0.4258     | 0.6220     | 0.3859     | 0.8332     | 0.7861     |

### Evaluation Replication

**Install Eval Harness**

To install the `lm-eval` package from the github repository, run:
```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install hf_transfer accelerate transformers flash_attn
```
**Benchmarking**

To evaluate our model:

Evaluating MMLU, GSM8K and WG on 5-Shot
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 HF_TOKEN=<YOUR_HF_TOKEN> accelerate launch -m lm_eval --model hf-auto \
    --model_args pretrained=ruliadai/falcon-base-v1.1,trust_remote_code=True \
    --tasks mmlu,gsm8k,winogrande \
    --device cuda:0 \
    --num_fewshot 5 \
    --batch_size 1
```

Evaluating TQA on 0-Shot
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 HF_TOKEN=<YOUR_HF_TOKEN> accelerate launch -m lm_eval --model hf-auto \
    --model_args pretrained=ruliadai/falcon-base-v1.1,trust_remote_code=True \
    --tasks truthfulqa_mc2 \
    --device cuda:0 \
    --batch_size 1
```

Evaluating HS on 10-Shot
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 HF_TOKEN=<YOUR_HF_TOKEN> accelerate launch -m lm_eval --model hf-auto \
    --model_args pretrained=ruliadai/falcon-base-v1.1,trust_remote_code=True \
    --tasks hellaswag \
    --device cuda:0 \
    --num_fewshot 10 \
    --batch_size 1
```

Evaluating ARC on 25-Shot
```bash
HF_HUB_ENABLE_HF_TRANSFER=1 HF_TOKEN=<YOUR_HF_TOKEN> accelerate launch -m lm_eval --model hf-auto \
    --model_args pretrained=ruliadai/falcon-base-v1.1,trust_remote_code=True \
    --tasks arc_challenge \
    --device cuda:0 \
    --num_fewshot 25 \
    --batch_size 1
```