Update README.md

1120d63 verified 20 days ago

7.49 kB

	---
	license: apache-2.0
	tags:
	- merge
	- mergekit
	- lazymergekit
	- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
	language:
	- en
	base_model:
	- princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
	- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
	library_name: transformers
	---
	Disclaimer: This model merge has not been thoroughly tested and is experimental. Expect further versions , with improvements, in the coming days.

	# ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512

	ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512 is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the `mergekit` library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions.

	## Model Components and Sources

	This model is a merge of the following:

	1. [princeton-nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)
	Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.

	2. [Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)
	This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.

	## 🧩 Configuration and Merge Details

	The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios.

	### YAML Configuration
	```yaml
	models:
	- model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
	# Base model: optimized for long-context interactions
	- model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
	parameters:
	weight: 0.5 # Emphasizes roleplay elements without overshadowing the base
	density: 0.6 # Retains 60% of the significant parameters from the roleplay model

	merge_method: della # Ensures balanced integration of long-context and roleplay features
	base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
	parameters:
	epsilon: 0.05 # Fine-tunes the granularity of pruning, maintaining key model features
	lambda: 1.0 # Harmonizes parameter influence from both models
	normalize: true # Ensures stable alignment of merged parameters
	int8_mask: true # Enhances memory efficiency for extended contexts

	dtype: float32
	out_dtype: bfloat16 # Balances precision and efficiency for versatile deployments
	```

	## Intended Usage

	The ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K model is designed for:

	- Extended Conversations: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue.
	- Roleplay and Storytelling: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing.
	- General Instruction Following: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation.

	---

	## 📚 Dataset Details for ProLong 8B Training

	The ProLong-8B model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios.

	### Continued Long-context Training
	1. Data Composition:
	- 30% Code Repositories: This includes diverse sources to enhance technical comprehension and code-related dialogue.
	- 30% Books: A mix of general and specialized literature to improve narrative and comprehension abilities.
	- 3% Textbooks: Technical textbooks for specialized and academic context handling.
	- 37% ShortMix: A balanced blend of various online sources for comprehensive topic coverage.
	- ShortMix Components:
	- 27% FineWeb-Edu
	- 27% FineWeb
	- 11% Tulu-v2
	- 11% StackExchange
	- 8% Wikipedia
	- 8% OpenWebMath
	- 8% ArXiv

	2. Training Stages:
	- Stage 1 (64K Context Window):
	- Utilized code repositories, books, and textbooks.
	- Training Steps: 20B tokens over approximately 2.2K H100 GPU hours.
	- Stage 2 (512K Context Window):
	- Code repositories (50% at 512K length and 50% at 64K length).
	- Books (17% at 512K and 83% at 64K).
	- Textbooks primarily focused on a 512K length.
	- Training Steps: 20B tokens over approximately 12.2K H100 GPU hours.

	3. Optimization and Model Configuration:
	- Optimizer: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95.
	- Learning Rate:
	- Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6.
	- Batch Size: 4M tokens for Stage 1 and 8M tokens for Stage 2.
	- Attention Mechanism: Full attention with cross-document attention masking to effectively handle extensive context windows.

	### Supervised Fine-tuning (SFT)
	1. Data Source:
	- UltraChat: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness.
	2. Optimization:
	- Optimizer: AdamW with parameters as above.
	- Learning Rate: 2e-5 with a 5% warmup and cosine decay to 2e-6.
	- Batch Size: 4M tokens for efficient training on high-context tasks.

	---


	## Key Features

	- Long Context Capability: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions.
	- Immersive Roleplay Dynamics: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions.
	- Enhanced Memory Efficiency: Configured to utilize `int8_mask`, which aids in managing larger context sizes efficiently on limited hardware resources.

	## Acknowledgments

	- Princeton NLP: For creating the [ProLong](https://huggingface.co/princeton-nlp) models, which bring unprecedented long-context handling capabilities to the Llama series.
	- Casual-Autopsy: For providing F32 quants of [L3-bluuwhale-SAO-MIX](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc), a rich roleplay model that adds thematic depth and interaction diversity.
	- Bluuwhale: For merging [L3-SAO-MIX-8B-V1](https://huggingface.co/bluuwhale/L3-SAO-MIX-8B-V1).
	- Sao10K: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. [SAO10K](https://huggingface.co/Sao10K).

	## Citation

	If you use this model, please consider citing the work of the ProLong developers:
	```bibtex
	@article{gao2024prolong,
	title={How to Train Long-Context Language Models (Effectively)},
	author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
	journal={arXiv preprint arXiv:2410.02660},
	year={2024}
	}
	```