amanrangapur
commited on
Commit
•
43f5c98
1
Parent(s):
627e96b
Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
datasets:
|
4 |
-
- allenai/
|
5 |
language:
|
6 |
- en
|
7 |
---
|
@@ -16,7 +16,7 @@ language:
|
|
16 |
OLMo2 7B November 2024 is an updated version of the original [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) model rocking a ____ point increase in ____, among other evaluations improvements, from an improved version of the Dolma dataset and staged training.
|
17 |
|
18 |
OLMo is a series of **O**pen **L**anguage **Mo**dels designed to enable the science of language models.
|
19 |
-
The OLMo models are trained on the [
|
20 |
We release all code, checkpoints, logs (coming soon), and details involved in training these models.
|
21 |
|
22 |
|
@@ -27,6 +27,26 @@ The core models released in this batch are the following:
|
|
27 |
| [OLMo2-7B July 2024](https://huggingface.co/allenai/OLMo2-7B-1124) | 4 Trillion | 32 | 4096 | 32 | 4096 |
|
28 |
| [OLMo2- 13B July 2024](https://huggingface.co/allenai/OLMo2-13B-1124) | 5 Trillion | 40 | 5120 | 42 | 4096 |
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
We have released checkpoints for these models, for every 1000 training steps.
|
31 |
The naming convention is `stepXXX-tokensYYYB`.
|
32 |
|
@@ -42,6 +62,20 @@ out = list_repo_refs("allenai/OLMo2-7B-1124")
|
|
42 |
branches = [b.name for b in out.branches]
|
43 |
```
|
44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
### Model Description
|
46 |
|
47 |
- **Developed by:** Allen Institute for AI (Ai2)
|
@@ -65,42 +99,6 @@ branches = [b.name for b in out.branches]
|
|
65 |
- **W&B Logs:** [pretraining](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B), [annealing](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B-anneal)
|
66 |
|
67 |
|
68 |
-
## Uses
|
69 |
-
|
70 |
-
### Inference
|
71 |
-
|
72 |
-
Proceed as usual with HuggingFace:
|
73 |
-
```python
|
74 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
75 |
-
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo2-7B-1124")
|
76 |
-
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo2-7B-1124")
|
77 |
-
message = ["Language modeling is "]
|
78 |
-
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
|
79 |
-
# optional verifying cuda
|
80 |
-
# inputs = {k: v.to('cuda') for k,v in inputs.items()}
|
81 |
-
# olmo = olmo.to('cuda')
|
82 |
-
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
|
83 |
-
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
|
84 |
-
>> 'Language modeling is the first step to build natural language generation...'
|
85 |
-
```
|
86 |
-
|
87 |
-
Or, you can make this slightly faster by quantizing the model, e.g. `AutoModelForCausalLM.from_pretrained("allenai/OLMo2-7B-1124", torch_dtype=torch.float16, load_in_8bit=True)` (requires `bitsandbytes`).
|
88 |
-
The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as `inputs.input_ids.to('cuda')` to avoid potential issues.
|
89 |
-
|
90 |
-
### Fine-tuning
|
91 |
-
Model fine-tuning can be done from the final checkpoint (the `main` revision of this model) or many intermediate checkpoints. Two recipes for tuning are available.
|
92 |
-
1. Fine-tune with the OLMo repository:
|
93 |
-
```bash
|
94 |
-
torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \
|
95 |
-
--data.paths=[{path_to_data}/input_ids.npy] \
|
96 |
-
--data.label_mask_paths=[{path_to_data}/label_mask.npy] \
|
97 |
-
--load_path={path_to_checkpoint} \
|
98 |
-
--reset_trainer_state
|
99 |
-
```
|
100 |
-
For more documentation, see the [GitHub readme](https://github.com/allenai/OLMo?tab=readme-ov-file#fine-tuning).
|
101 |
-
|
102 |
-
2. Further fine-tuning support is being developing in AI2's Open Instruct repository. Details are [here](https://github.com/allenai/open-instruct).
|
103 |
-
|
104 |
<!-- TODO -->
|
105 |
## Evaluation
|
106 |
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
datasets:
|
4 |
+
- allenai/dolmino-mix-1124
|
5 |
language:
|
6 |
- en
|
7 |
---
|
|
|
16 |
OLMo2 7B November 2024 is an updated version of the original [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) model rocking a ____ point increase in ____, among other evaluations improvements, from an improved version of the Dolma dataset and staged training.
|
17 |
|
18 |
OLMo is a series of **O**pen **L**anguage **Mo**dels designed to enable the science of language models.
|
19 |
+
The OLMo models are trained on the [Dolmino](https://huggingface.co/datasets/allenai/dolmino-mix-1124) dataset.
|
20 |
We release all code, checkpoints, logs (coming soon), and details involved in training these models.
|
21 |
|
22 |
|
|
|
27 |
| [OLMo2-7B July 2024](https://huggingface.co/allenai/OLMo2-7B-1124) | 4 Trillion | 32 | 4096 | 32 | 4096 |
|
28 |
| [OLMo2- 13B July 2024](https://huggingface.co/allenai/OLMo2-13B-1124) | 5 Trillion | 40 | 5120 | 42 | 4096 |
|
29 |
|
30 |
+
## Inference
|
31 |
+
|
32 |
+
Proceed as usual with HuggingFace:
|
33 |
+
```python
|
34 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
35 |
+
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo2-7B-1124")
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo2-7B-1124")
|
37 |
+
message = ["Language modeling is "]
|
38 |
+
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
|
39 |
+
# optional verifying cuda
|
40 |
+
# inputs = {k: v.to('cuda') for k,v in inputs.items()}
|
41 |
+
# olmo = olmo.to('cuda')
|
42 |
+
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
|
43 |
+
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
|
44 |
+
>> 'Language modeling is the first step to build natural language generation...'
|
45 |
+
```
|
46 |
+
|
47 |
+
Or, you can make this slightly faster by quantizing the model, e.g. `AutoModelForCausalLM.from_pretrained("allenai/OLMo2-7B-1124", torch_dtype=torch.float16, load_in_8bit=True)` (requires `bitsandbytes`).
|
48 |
+
The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as `inputs.input_ids.to('cuda')` to avoid potential issues.
|
49 |
+
|
50 |
We have released checkpoints for these models, for every 1000 training steps.
|
51 |
The naming convention is `stepXXX-tokensYYYB`.
|
52 |
|
|
|
62 |
branches = [b.name for b in out.branches]
|
63 |
```
|
64 |
|
65 |
+
### Fine-tuning
|
66 |
+
Model fine-tuning can be done from the final checkpoint (the `main` revision of this model) or many intermediate checkpoints. Two recipes for tuning are available.
|
67 |
+
1. Fine-tune with the OLMo repository:
|
68 |
+
```bash
|
69 |
+
torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \
|
70 |
+
--data.paths=[{path_to_data}/input_ids.npy] \
|
71 |
+
--data.label_mask_paths=[{path_to_data}/label_mask.npy] \
|
72 |
+
--load_path={path_to_checkpoint} \
|
73 |
+
--reset_trainer_state
|
74 |
+
```
|
75 |
+
For more documentation, see the [GitHub readme](https://github.com/allenai/OLMo?tab=readme-ov-file#fine-tuning).
|
76 |
+
|
77 |
+
2. Further fine-tuning support is being developing in AI2's Open Instruct repository. Details are [here](https://github.com/allenai/open-instruct).
|
78 |
+
|
79 |
### Model Description
|
80 |
|
81 |
- **Developed by:** Allen Institute for AI (Ai2)
|
|
|
99 |
- **W&B Logs:** [pretraining](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B), [annealing](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B-anneal)
|
100 |
|
101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
<!-- TODO -->
|
103 |
## Evaluation
|
104 |
|