zitongyang's picture
Update README.md
7aab58b verified
---
language:
- en
license: apache-2.0
tags:
- llama
- entigraph
- synthetic-continued-pretraining
---
# EntiGraph CPT Model (based on Llama 3 8B)
## Model Description
The EntiGraph CPT model is a continuation of the Llama 3 8B base model, trained using the [Synthetic Continued Pretraining by Yang et al. (2024)](https://arxiv.org/pdf/2409.07431) approach with the EntiGraph algorithm. This model has been trained on a synthetic corpus generated from the QuALITY dataset to acquire domain-specific knowledge efficiently.
The code used to train the model is available at the [Synthetic Continued Pretraining GitHub repo](https://github.com/ZitongYang/Synthetic_Continued_Pretraining).
### Model Details
- **Developed by:** Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** Llama 3 8B
## Uses
### Intended Use
This model is intended for research purposes and applications requiring domain-specific knowledge related to the QuALITY dataset. It can be used for tasks such as closed-book question answering, summarization, and other NLP tasks within the domain of the training data.
### Out-of-Scope Use
This model should not be used for generating factual information outside the scope of its training data or for any malicious purposes.
## Training Details
### Training Data
The model was trained on a 455M token synthetic corpus generated by the EntiGraph algorithm from the QuALITY dataset.
### Training Procedure
- **Pretraining:** Continued pretraining on the EntiGraph synthetic corpus
- **Hyperparameters:**
- Learning rate: 5e-06
- Batch size: 16
- Weight decay: 0.01
- Warmup: 0.05
- Epochs: 2
- RedPajama replay rate: 0.1
## Evaluation
The model has been evaluated on the QuALITY question answering dataset, demonstrating improved performance in closed-book QA tasks compared to the base model.
## Limitations and Biases
While the EntiGraph CPT model shows improved performance on domain-specific tasks, it may inherit biases present in the original Llama 3 8B model and the QuALITY dataset. Users should be aware of potential limitations in generating content outside its training domain.
## Citation
If you use this model, please cite the original paper:
```
@misc{yang2024syntheticcontinuedpretraining,
title={Synthetic continued pretraining},
author={Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto},
year={2024},
eprint={2409.07431},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.07431},
}
```
## Ethical Considerations
Users of this model should be aware of the ethical implications of using large language models and ensure responsible use in applications.