|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- llama |
|
- entigraph |
|
- synthetic-continued-pretraining |
|
--- |
|
|
|
# EntiGraph CPT Model (based on Llama 3 8B) |
|
|
|
## Model Description |
|
|
|
The EntiGraph CPT model is a continuation of the Llama 3 8B base model, trained using the [Synthetic Continued Pretraining by Yang et al. (2024)](https://arxiv.org/pdf/2409.07431) approach with the EntiGraph algorithm. This model has been trained on a synthetic corpus generated from the QuALITY dataset to acquire domain-specific knowledge efficiently. |
|
The code used to train the model is available at the [Synthetic Continued Pretraining GitHub repo](https://github.com/ZitongYang/Synthetic_Continued_Pretraining). |
|
|
|
### Model Details |
|
|
|
- **Developed by:** Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto |
|
- **Model type:** Causal Language Model |
|
- **Language(s):** English |
|
- **License:** Apache 2.0 |
|
- **Finetuned from model:** Llama 3 8B |
|
|
|
## Uses |
|
|
|
### Intended Use |
|
|
|
This model is intended for research purposes and applications requiring domain-specific knowledge related to the QuALITY dataset. It can be used for tasks such as closed-book question answering, summarization, and other NLP tasks within the domain of the training data. |
|
|
|
### Out-of-Scope Use |
|
|
|
This model should not be used for generating factual information outside the scope of its training data or for any malicious purposes. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on a 455M token synthetic corpus generated by the EntiGraph algorithm from the QuALITY dataset. |
|
|
|
### Training Procedure |
|
|
|
- **Pretraining:** Continued pretraining on the EntiGraph synthetic corpus |
|
- **Hyperparameters:** |
|
- Learning rate: 5e-06 |
|
- Batch size: 16 |
|
- Weight decay: 0.01 |
|
- Warmup: 0.05 |
|
- Epochs: 2 |
|
- RedPajama replay rate: 0.1 |
|
|
|
## Evaluation |
|
|
|
The model has been evaluated on the QuALITY question answering dataset, demonstrating improved performance in closed-book QA tasks compared to the base model. |
|
|
|
## Limitations and Biases |
|
|
|
While the EntiGraph CPT model shows improved performance on domain-specific tasks, it may inherit biases present in the original Llama 3 8B model and the QuALITY dataset. Users should be aware of potential limitations in generating content outside its training domain. |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the original paper: |
|
|
|
``` |
|
@misc{yang2024syntheticcontinuedpretraining, |
|
title={Synthetic continued pretraining}, |
|
author={Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto}, |
|
year={2024}, |
|
eprint={2409.07431}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2409.07431}, |
|
} |
|
``` |
|
|
|
## Ethical Considerations |
|
|
|
Users of this model should be aware of the ethical implications of using large language models and ensure responsible use in applications. |