license: mit
language:
- en
π Nucleus-22B-token-500B
Nucleus-22B-token-500B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 500B tokens of RefinedWeb along with curated corpora. It is made available under the MIT license.
1T-token model coming soon π.
What about Nucleus-22B-token-500B?
- It performs well compared to similar-size open-source models (e.g., MPT-7B, StableLM, RedPajama etc.), thanks to being trained on 500B tokens of RefinedWeb enhanced with curated corpora. See the OpenLLM Leaderboard.
- It is made available under an MIT license.
- It is trained by a small team of four passionate for Open Source
β οΈ This is a raw, pretrained model, which should be further finetuned for most usecases.
Model Card for Nucleus-22B-token-500B
Model Details
Model Description
- Developed by: NucleusAI;
- Model type: Causal decoder-only;
- Language(s) (NLP): English;
- License: MIT.
Model Source
- Paper: coming soon.
Uses
Direct Use
Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)
Out-of-Scope Use
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
Bias, Risks, and Limitations
Nucleus-22B-token-500B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
Recommendations
We recommend users of Nucleus-22B-token-500B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
How to Get Started with the Mode
Training Details
Training Data
Nucleus-22B-token-500B was trained on 500B tokens of RefinedWeb, along with other corpora.
Data source | Fraction | Tokens | Sources |
---|---|---|---|
RefinedWeb-English | 75% | 200B | massive web crawl |
Books | 7% | 21B | |
Code | 7% | 21B | Big Code, CodeNet |
Technical | 6% | 19B | arXiv |
Math | 5% | 17B | Mathematica, Khan Academy |
The data was tokenized with the tokenizer similar to Llama-7B.
Training Procedure
Nucleus-22B-token-500B was trained on 256 A100 80GB GPUs, using a FSDP
Training Hyperparameters
Hyperparameter | Value | Comment |
---|---|---|
Precision | bfloat16 |
|
Optimizer | AdamW | |
Learning rate | 2e-4 | 8B tokens warm-up, cosine decay to 1.e-5 |
Weight decay | 1e-1 | |
Batch size | 2048 | constant |
Context length | 2048 | constant |
Speeds, Sizes, Times
Training happened in early August 2023 and took about two weeks.