EntiGraph CPT Model (based on Llama 3 8B)

Model Description

The EntiGraph CPT model is a continuation of the Llama 3 8B base model, trained using the Synthetic Continued Pretraining by Yang et al. (2024) approach with the EntiGraph algorithm. This model has been trained on a synthetic corpus generated from the QuALITY dataset to acquire domain-specific knowledge efficiently. The code used to train the model is available at the Synthetic Continued Pretraining GitHub repo.

Model Details

Developed by: Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto
Model type: Causal Language Model
Language(s): English
License: Apache 2.0
Finetuned from model: Llama 3 8B

Uses

Intended Use

This model is intended for research purposes and applications requiring domain-specific knowledge related to the QuALITY dataset. It can be used for tasks such as closed-book question answering, summarization, and other NLP tasks within the domain of the training data.

Out-of-Scope Use

This model should not be used for generating factual information outside the scope of its training data or for any malicious purposes.

Training Details

Training Data

The model was trained on a 455M token synthetic corpus generated by the EntiGraph algorithm from the QuALITY dataset.

Training Procedure

Pretraining: Continued pretraining on the EntiGraph synthetic corpus
Hyperparameters:
- Learning rate: 5e-06
- Batch size: 16
- Weight decay: 0.01
- Warmup: 0.05
- Epochs: 2
- RedPajama replay rate: 0.1

Evaluation

The model has been evaluated on the QuALITY question answering dataset, demonstrating improved performance in closed-book QA tasks compared to the base model.

Limitations and Biases

While the EntiGraph CPT model shows improved performance on domain-specific tasks, it may inherit biases present in the original Llama 3 8B model and the QuALITY dataset. Users should be aware of potential limitations in generating content outside its training domain.

Citation

If you use this model, please cite the original paper:

@misc{yang2024syntheticcontinuedpretraining,
      title={Synthetic continued pretraining}, 
      author={Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto},
      year={2024},
      eprint={2409.07431},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2409.07431}, 
}

Ethical Considerations

Users of this model should be aware of the ethical implications of using large language models and ensure responsible use in applications.