sippycoder
commited on
Commit
•
2e48d83
1
Parent(s):
527a0bf
initial commit
Browse files
README.md
CHANGED
@@ -1,3 +1,93 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
---
|
6 |
+
|
7 |
+
# 🚀 Nucleus-22B-token-350B
|
8 |
+
|
9 |
+
**Nucleus-22B-token-350B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the Apache 2.0 license.**
|
10 |
+
|
11 |
+
*1T-token model coming soon* 😊.
|
12 |
+
|
13 |
+
|
14 |
+
## What about Nucleus-22B-token-350B?
|
15 |
+
|
16 |
+
* **It performs well compared to similar-size open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
|
17 |
+
* **It is made available under an MIT license**.
|
18 |
+
* **It is trained by a small team of four passionate for Open Source**
|
19 |
+
|
20 |
+
⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
|
21 |
+
|
22 |
+
# Model Card for Nucleus-22B-token-350B
|
23 |
+
|
24 |
+
## Model Details
|
25 |
+
|
26 |
+
### Model Description
|
27 |
+
|
28 |
+
- **Developed by:** NucleusAI;
|
29 |
+
- **Model type:** Causal decoder-only;
|
30 |
+
- **Language(s) (NLP):** English;
|
31 |
+
- **License:** MIT.
|
32 |
+
|
33 |
+
### Model Source
|
34 |
+
|
35 |
+
- **Paper:** *coming soon*.
|
36 |
+
|
37 |
+
## Uses
|
38 |
+
|
39 |
+
### Direct Use
|
40 |
+
|
41 |
+
Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)
|
42 |
+
|
43 |
+
### Out-of-Scope Use
|
44 |
+
|
45 |
+
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
|
46 |
+
|
47 |
+
## Bias, Risks, and Limitations
|
48 |
+
|
49 |
+
Nucleus-22B-token-350B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
|
50 |
+
|
51 |
+
### Recommendations
|
52 |
+
|
53 |
+
We recommend users of Nucleus-22B-token-350B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
|
54 |
+
|
55 |
+
## How to Get Started with the Mode
|
56 |
+
|
57 |
+
|
58 |
+
## Training Details
|
59 |
+
|
60 |
+
### Training Data
|
61 |
+
|
62 |
+
Nucleus-22B-token-350B was trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.
|
63 |
+
|
64 |
+
| **Data source** | **Fraction** | **Tokens** | **Sources** |
|
65 |
+
|--------------------|--------------|------------|-----------------------------------|
|
66 |
+
| [RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 75% | 200B | massive web crawl |
|
67 |
+
| Books | 7% | 21B | |
|
68 |
+
| Code | 7% | 21B | Big Code, CodeNet |
|
69 |
+
| Technical | 6% | 19B | arXiv |
|
70 |
+
| Math | 5% | 17B | Mathematica, Khan Academy |
|
71 |
+
|
72 |
+
|
73 |
+
The data was tokenized with the tokenizer similar to Llama-[7B](https://huggingface.co/meta-llama/Llama-2-7b).
|
74 |
+
|
75 |
+
### Training Procedure
|
76 |
+
|
77 |
+
Nucleus-22B-token-350B was trained on 256 A100 80GB GPUs, using a FSDP
|
78 |
+
|
79 |
+
#### Training Hyperparameters
|
80 |
+
|
81 |
+
| **Hyperparameter** | **Value** | **Comment** |
|
82 |
+
|--------------------|------------|-------------------------------------------|
|
83 |
+
| Precision | `bfloat16` | |
|
84 |
+
| Optimizer | AdamW | |
|
85 |
+
| Learning rate | 2e-4 | 8B tokens warm-up, cosine decay to 1.e-5 |
|
86 |
+
| Weight decay | 1e-1 | |
|
87 |
+
| Batch size | 2048 | constant |
|
88 |
+
| Context length | 2048 | constant |
|
89 |
+
|
90 |
+
|
91 |
+
#### Speeds, Sizes, Times
|
92 |
+
|
93 |
+
Training happened in early August 2023 and took about two weeks.
|