File size: 4,110 Bytes
76240a0 2e48d83 76240a0 2e48d83 883ff9f 2e48d83 ac6e7c1 2e48d83 883ff9f 2e48d83 9cafadd 2e48d83 883ff9f 2e48d83 883ff9f 2e48d83 883ff9f 2e48d83 883ff9f 2e48d83 883ff9f 2e48d83 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
---
license: mit
language:
- en
---
# 🚀 Nucleus-22B-token-500B
**Nucleus-22B-token-500B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the MIT license.**
*1T-token model coming soon* 😊.
## What about Nucleus-22B-token-500B?
* **It performs well compared to similar-size open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
* **It is made available under an MIT license**.
* **It is trained by a small team of four passionate for Open Source**
⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
# Model Card for Nucleus-22B-token-500B
## Model Details
### Model Description
- **Developed by:** NucleusAI;
- **Model type:** Causal decoder-only;
- **Language(s) (NLP):** English;
- **License:** MIT.
### Model Source
- **Paper:** *coming soon*.
## Uses
### Direct Use
Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)
### Out-of-Scope Use
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
## Bias, Risks, and Limitations
Nucleus-22B-token-500B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
### Recommendations
We recommend users of Nucleus-22B-token-500B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
## How to Get Started with the Mode
## Training Details
### Training Data
Nucleus-22B-token-500B was trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.
| **Data source** | **Fraction** | **Tokens** | **Sources** |
|--------------------|--------------|------------|-----------------------------------|
| [RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 75% | 200B | massive web crawl |
| Books | 7% | 21B | |
| Code | 7% | 21B | Big Code, CodeNet |
| Technical | 6% | 19B | arXiv |
| Math | 5% | 17B | Mathematica, Khan Academy |
The data was tokenized with the tokenizer similar to Llama-[7B](https://huggingface.co/meta-llama/Llama-2-7b).
### Training Procedure
Nucleus-22B-token-500B was trained on 256 A100 80GB GPUs, using a FSDP
#### Training Hyperparameters
| **Hyperparameter** | **Value** | **Comment** |
|--------------------|------------|-------------------------------------------|
| Precision | `bfloat16` | |
| Optimizer | AdamW | |
| Learning rate | 2e-4 | 8B tokens warm-up, cosine decay to 1.e-5 |
| Weight decay | 1e-1 | |
| Batch size | 2048 | constant |
| Context length | 2048 | constant |
#### Speeds, Sizes, Times
Training happened in early August 2023 and took about two weeks. |