|
--- |
|
license: apache-2.0 |
|
tags: |
|
- merge |
|
- mergekit |
|
- lazymergekit |
|
- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc |
|
language: |
|
- en |
|
base_model: |
|
- princeton-nlp/Llama-3-8B-ProLong-512k-Instruct |
|
- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc |
|
library_name: transformers |
|
--- |
|
Disclaimer: This model merge has not been thoroughly tested and is experimental. Expect further versions , with improvements, in the coming days. |
|
|
|
# ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512 |
|
|
|
**ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512** is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the `mergekit` library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions. |
|
|
|
## Model Components and Sources |
|
|
|
This model is a merge of the following: |
|
|
|
1. **[princeton-nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)** |
|
*Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.* |
|
|
|
2. **[Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)** |
|
*This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.* |
|
|
|
## 🧩 Configuration and Merge Details |
|
|
|
The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios. |
|
|
|
### YAML Configuration |
|
```yaml |
|
models: |
|
- model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct |
|
# Base model: optimized for long-context interactions |
|
- model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc |
|
parameters: |
|
weight: 0.5 # Emphasizes roleplay elements without overshadowing the base |
|
density: 0.6 # Retains 60% of the significant parameters from the roleplay model |
|
|
|
merge_method: della # Ensures balanced integration of long-context and roleplay features |
|
base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct |
|
parameters: |
|
epsilon: 0.05 # Fine-tunes the granularity of pruning, maintaining key model features |
|
lambda: 1.0 # Harmonizes parameter influence from both models |
|
normalize: true # Ensures stable alignment of merged parameters |
|
int8_mask: true # Enhances memory efficiency for extended contexts |
|
|
|
dtype: float32 |
|
out_dtype: bfloat16 # Balances precision and efficiency for versatile deployments |
|
``` |
|
|
|
## Intended Usage |
|
|
|
The **ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K** model is designed for: |
|
|
|
- **Extended Conversations**: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue. |
|
- **Roleplay and Storytelling**: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing. |
|
- **General Instruction Following**: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation. |
|
|
|
--- |
|
|
|
## 📚 Dataset Details for ProLong 8B Training |
|
|
|
The **ProLong-8B** model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios. |
|
|
|
### Continued Long-context Training |
|
1. **Data Composition**: |
|
- **30% Code Repositories**: This includes diverse sources to enhance technical comprehension and code-related dialogue. |
|
- **30% Books**: A mix of general and specialized literature to improve narrative and comprehension abilities. |
|
- **3% Textbooks**: Technical textbooks for specialized and academic context handling. |
|
- **37% ShortMix**: A balanced blend of various online sources for comprehensive topic coverage. |
|
- **ShortMix Components**: |
|
- 27% FineWeb-Edu |
|
- 27% FineWeb |
|
- 11% Tulu-v2 |
|
- 11% StackExchange |
|
- 8% Wikipedia |
|
- 8% OpenWebMath |
|
- 8% ArXiv |
|
|
|
2. **Training Stages**: |
|
- **Stage 1 (64K Context Window)**: |
|
- Utilized code repositories, books, and textbooks. |
|
- Training Steps: 20B tokens over approximately 2.2K H100 GPU hours. |
|
- **Stage 2 (512K Context Window)**: |
|
- Code repositories (50% at 512K length and 50% at 64K length). |
|
- Books (17% at 512K and 83% at 64K). |
|
- Textbooks primarily focused on a 512K length. |
|
- Training Steps: 20B tokens over approximately 12.2K H100 GPU hours. |
|
|
|
3. **Optimization and Model Configuration**: |
|
- **Optimizer**: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95. |
|
- **Learning Rate**: |
|
- Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6. |
|
- **Batch Size**: 4M tokens for Stage 1 and 8M tokens for Stage 2. |
|
- **Attention Mechanism**: Full attention with cross-document attention masking to effectively handle extensive context windows. |
|
|
|
### Supervised Fine-tuning (SFT) |
|
1. **Data Source**: |
|
- **UltraChat**: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness. |
|
2. **Optimization**: |
|
- **Optimizer**: AdamW with parameters as above. |
|
- **Learning Rate**: 2e-5 with a 5% warmup and cosine decay to 2e-6. |
|
- **Batch Size**: 4M tokens for efficient training on high-context tasks. |
|
|
|
--- |
|
|
|
|
|
## Key Features |
|
|
|
- **Long Context Capability**: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions. |
|
- **Immersive Roleplay Dynamics**: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions. |
|
- **Enhanced Memory Efficiency**: Configured to utilize `int8_mask`, which aids in managing larger context sizes efficiently on limited hardware resources. |
|
|
|
## Acknowledgments |
|
|
|
- **Princeton NLP**: For creating the [ProLong](https://huggingface.co/princeton-nlp) models, which bring unprecedented long-context handling capabilities to the Llama series. |
|
- **Casual-Autopsy**: For providing F32 quants of [L3-bluuwhale-SAO-MIX](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc), a rich roleplay model that adds thematic depth and interaction diversity. |
|
- **Bluuwhale**: For merging [L3-SAO-MIX-8B-V1](https://huggingface.co/bluuwhale/L3-SAO-MIX-8B-V1). |
|
- **Sao10K**: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. [SAO10K](https://huggingface.co/Sao10K). |
|
|
|
## Citation |
|
|
|
If you use this model, please consider citing the work of the ProLong developers: |
|
```bibtex |
|
@article{gao2024prolong, |
|
title={How to Train Long-Context Language Models (Effectively)}, |
|
author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi}, |
|
journal={arXiv preprint arXiv:2410.02660}, |
|
year={2024} |
|
} |
|
``` |