zitongyang
/

llama-3-8b-entigraph-quality

synthetic-continued-pretraining

Model card Files Files and versions Community

llama-3-8b-entigraph-quality / README.md

zitongyang's picture

Update README.md

7aab58b verified about 1 month ago

|

history blame contribute delete

2.88 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- llama
	- entigraph
	- synthetic-continued-pretraining
	---

	# EntiGraph CPT Model (based on Llama 3 8B)

	## Model Description

	The EntiGraph CPT model is a continuation of the Llama 3 8B base model, trained using the [Synthetic Continued Pretraining by Yang et al. (2024)](https://arxiv.org/pdf/2409.07431) approach with the EntiGraph algorithm. This model has been trained on a synthetic corpus generated from the QuALITY dataset to acquire domain-specific knowledge efficiently.
	The code used to train the model is available at the [Synthetic Continued Pretraining GitHub repo](https://github.com/ZitongYang/Synthetic_Continued_Pretraining).

	### Model Details

	- Developed by: Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto
	- Model type: Causal Language Model
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from model: Llama 3 8B

	## Uses

	### Intended Use

	This model is intended for research purposes and applications requiring domain-specific knowledge related to the QuALITY dataset. It can be used for tasks such as closed-book question answering, summarization, and other NLP tasks within the domain of the training data.

	### Out-of-Scope Use

	This model should not be used for generating factual information outside the scope of its training data or for any malicious purposes.

	## Training Details

	### Training Data

	The model was trained on a 455M token synthetic corpus generated by the EntiGraph algorithm from the QuALITY dataset.

	### Training Procedure

	- Pretraining: Continued pretraining on the EntiGraph synthetic corpus
	- Hyperparameters:
	- Learning rate: 5e-06
	- Batch size: 16
	- Weight decay: 0.01
	- Warmup: 0.05
	- Epochs: 2
	- RedPajama replay rate: 0.1

	## Evaluation

	The model has been evaluated on the QuALITY question answering dataset, demonstrating improved performance in closed-book QA tasks compared to the base model.

	## Limitations and Biases

	While the EntiGraph CPT model shows improved performance on domain-specific tasks, it may inherit biases present in the original Llama 3 8B model and the QuALITY dataset. Users should be aware of potential limitations in generating content outside its training domain.

	## Citation

	If you use this model, please cite the original paper:

	```
	@misc{yang2024syntheticcontinuedpretraining,
	title={Synthetic continued pretraining},
	author={Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto},
	year={2024},
	eprint={2409.07431},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2409.07431},
	}
	```

	## Ethical Considerations

	Users of this model should be aware of the ethical implications of using large language models and ensure responsible use in applications.