NucleusAI
/

nucleus-22B-token-500B

 ---
 license: mit
+language:
+- en
 ---
+# 🚀 Nucleus-22B-token-350B
+**Nucleus-22B-token-350B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the Apache 2.0 license.**
+*1T-token model coming soon* 😊.
+## What about Nucleus-22B-token-350B?
+* **It performs well compared to similar-size open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
+* **It is made available under an MIT license**.
+* **It is trained by a small team of four passionate for Open Source**
+⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
+# Model Card for Nucleus-22B-token-350B
+## Model Details
+### Model Description
+- **Developed by:** NucleusAI;
+- **Model type:** Causal decoder-only;
+- **Language(s) (NLP):** English;
+- **License:** MIT.
+### Model Source
+- **Paper:** *coming soon*.
+## Uses
+### Direct Use
+Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)
+### Out-of-Scope Use
+Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
+## Bias, Risks, and Limitations
+Nucleus-22B-token-350B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
+### Recommendations
+We recommend users of Nucleus-22B-token-350B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
+## How to Get Started with the Mode
+## Training Details
+### Training Data
+Nucleus-22B-token-350B was trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.
+| **Data source**    | **Fraction** | **Tokens** | **Sources**                       |
+|--------------------|--------------|------------|-----------------------------------|
+| [RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 75%          | 200B     | massive web crawl                 |
+| Books              | 7%           | 21B       |                                   |
+| Code               | 7%           | 21B        | Big Code, CodeNet                                  |
+| Technical          | 6%           | 19B        | arXiv        |
+| Math          | 5%           | 17B        | Mathematica, Khan Academy        |
+The data was tokenized with the tokenizer similar to Llama-[7B](https://huggingface.co/meta-llama/Llama-2-7b).
+### Training Procedure
+Nucleus-22B-token-350B was trained on 256 A100 80GB GPUs, using a FSDP
+#### Training Hyperparameters
+| **Hyperparameter** | **Value**  | **Comment**                               |
+|--------------------|------------|-------------------------------------------|
+| Precision          | `bfloat16` |                                           |
+| Optimizer          | AdamW      |                                           |
+| Learning rate      | 2e-4       | 8B tokens warm-up, cosine decay to 1.e-5 |
+| Weight decay       | 1e-1       |                                           |
+| Batch size         | 2048        | constant                         |
+| Context length         | 2048        | constant                         |
+#### Speeds, Sizes, Times
+Training happened in early August 2023 and took about two weeks.