--- language: - en license: apache-2.0 tags: - llama - entigraph - synthetic-continued-pretraining --- # EntiGraph CPT Model (based on Llama 3 8B) ## Model Description The EntiGraph CPT model is a continuation of the Llama 3 8B base model, trained using the [Synthetic Continued Pretraining by Yang et al. (2024)](https://arxiv.org/pdf/2409.07431) approach with the EntiGraph algorithm. This model has been trained on a synthetic corpus generated from the QuALITY dataset to acquire domain-specific knowledge efficiently. The code used to train the model is available at the [Synthetic Continued Pretraining GitHub repo](https://github.com/ZitongYang/Synthetic_Continued_Pretraining). ### Model Details - **Developed by:** Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto - **Model type:** Causal Language Model - **Language(s):** English - **License:** Apache 2.0 - **Finetuned from model:** Llama 3 8B ## Uses ### Intended Use This model is intended for research purposes and applications requiring domain-specific knowledge related to the QuALITY dataset. It can be used for tasks such as closed-book question answering, summarization, and other NLP tasks within the domain of the training data. ### Out-of-Scope Use This model should not be used for generating factual information outside the scope of its training data or for any malicious purposes. ## Training Details ### Training Data The model was trained on a 455M token synthetic corpus generated by the EntiGraph algorithm from the QuALITY dataset. ### Training Procedure - **Pretraining:** Continued pretraining on the EntiGraph synthetic corpus - **Hyperparameters:** - Learning rate: 5e-06 - Batch size: 16 - Weight decay: 0.01 - Warmup: 0.05 - Epochs: 2 - RedPajama replay rate: 0.1 ## Evaluation The model has been evaluated on the QuALITY question answering dataset, demonstrating improved performance in closed-book QA tasks compared to the base model. ## Limitations and Biases While the EntiGraph CPT model shows improved performance on domain-specific tasks, it may inherit biases present in the original Llama 3 8B model and the QuALITY dataset. Users should be aware of potential limitations in generating content outside its training domain. ## Citation If you use this model, please cite the original paper: ``` @misc{yang2024syntheticcontinuedpretraining, title={Synthetic continued pretraining}, author={Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto}, year={2024}, eprint={2409.07431}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2409.07431}, } ``` ## Ethical Considerations Users of this model should be aware of the ethical implications of using large language models and ensure responsible use in applications.