sippycoder commited on
Commit
2e48d83
1 Parent(s): 527a0bf

initial commit

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md CHANGED
@@ -1,3 +1,93 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
  ---
6
+
7
+ # 🚀 Nucleus-22B-token-350B
8
+
9
+ **Nucleus-22B-token-350B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the Apache 2.0 license.**
10
+
11
+ *1T-token model coming soon* 😊.
12
+
13
+
14
+ ## What about Nucleus-22B-token-350B?
15
+
16
+ * **It performs well compared to similar-size open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
17
+ * **It is made available under an MIT license**.
18
+ * **It is trained by a small team of four passionate for Open Source**
19
+
20
+ ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
21
+
22
+ # Model Card for Nucleus-22B-token-350B
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ - **Developed by:** NucleusAI;
29
+ - **Model type:** Causal decoder-only;
30
+ - **Language(s) (NLP):** English;
31
+ - **License:** MIT.
32
+
33
+ ### Model Source
34
+
35
+ - **Paper:** *coming soon*.
36
+
37
+ ## Uses
38
+
39
+ ### Direct Use
40
+
41
+ Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)
42
+
43
+ ### Out-of-Scope Use
44
+
45
+ Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
46
+
47
+ ## Bias, Risks, and Limitations
48
+
49
+ Nucleus-22B-token-350B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
50
+
51
+ ### Recommendations
52
+
53
+ We recommend users of Nucleus-22B-token-350B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
54
+
55
+ ## How to Get Started with the Mode
56
+
57
+
58
+ ## Training Details
59
+
60
+ ### Training Data
61
+
62
+ Nucleus-22B-token-350B was trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.
63
+
64
+ | **Data source** | **Fraction** | **Tokens** | **Sources** |
65
+ |--------------------|--------------|------------|-----------------------------------|
66
+ | [RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 75% | 200B | massive web crawl |
67
+ | Books | 7% | 21B | |
68
+ | Code | 7% | 21B | Big Code, CodeNet |
69
+ | Technical | 6% | 19B | arXiv |
70
+ | Math | 5% | 17B | Mathematica, Khan Academy |
71
+
72
+
73
+ The data was tokenized with the tokenizer similar to Llama-[7B](https://huggingface.co/meta-llama/Llama-2-7b).
74
+
75
+ ### Training Procedure
76
+
77
+ Nucleus-22B-token-350B was trained on 256 A100 80GB GPUs, using a FSDP
78
+
79
+ #### Training Hyperparameters
80
+
81
+ | **Hyperparameter** | **Value** | **Comment** |
82
+ |--------------------|------------|-------------------------------------------|
83
+ | Precision | `bfloat16` | |
84
+ | Optimizer | AdamW | |
85
+ | Learning rate | 2e-4 | 8B tokens warm-up, cosine decay to 1.e-5 |
86
+ | Weight decay | 1e-1 | |
87
+ | Batch size | 2048 | constant |
88
+ | Context length | 2048 | constant |
89
+
90
+
91
+ #### Speeds, Sizes, Times
92
+
93
+ Training happened in early August 2023 and took about two weeks.