File size: 7,485 Bytes
4b8adea d0f20a5 4b8adea 1120d63 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c 4b8adea a4cf20c d0f20a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
license: apache-2.0
tags:
- merge
- mergekit
- lazymergekit
- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
language:
- en
base_model:
- princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
library_name: transformers
---
Disclaimer: This model merge has not been thoroughly tested and is experimental. Expect further versions , with improvements, in the coming days.
# ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512
**ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512** is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the `mergekit` library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions.
## Model Components and Sources
This model is a merge of the following:
1. **[princeton-nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)**
*Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.*
2. **[Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)**
*This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.*
## 🧩 Configuration and Merge Details
The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios.
### YAML Configuration
```yaml
models:
- model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
# Base model: optimized for long-context interactions
- model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
parameters:
weight: 0.5 # Emphasizes roleplay elements without overshadowing the base
density: 0.6 # Retains 60% of the significant parameters from the roleplay model
merge_method: della # Ensures balanced integration of long-context and roleplay features
base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
parameters:
epsilon: 0.05 # Fine-tunes the granularity of pruning, maintaining key model features
lambda: 1.0 # Harmonizes parameter influence from both models
normalize: true # Ensures stable alignment of merged parameters
int8_mask: true # Enhances memory efficiency for extended contexts
dtype: float32
out_dtype: bfloat16 # Balances precision and efficiency for versatile deployments
```
## Intended Usage
The **ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K** model is designed for:
- **Extended Conversations**: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue.
- **Roleplay and Storytelling**: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing.
- **General Instruction Following**: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation.
---
## 📚 Dataset Details for ProLong 8B Training
The **ProLong-8B** model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios.
### Continued Long-context Training
1. **Data Composition**:
- **30% Code Repositories**: This includes diverse sources to enhance technical comprehension and code-related dialogue.
- **30% Books**: A mix of general and specialized literature to improve narrative and comprehension abilities.
- **3% Textbooks**: Technical textbooks for specialized and academic context handling.
- **37% ShortMix**: A balanced blend of various online sources for comprehensive topic coverage.
- **ShortMix Components**:
- 27% FineWeb-Edu
- 27% FineWeb
- 11% Tulu-v2
- 11% StackExchange
- 8% Wikipedia
- 8% OpenWebMath
- 8% ArXiv
2. **Training Stages**:
- **Stage 1 (64K Context Window)**:
- Utilized code repositories, books, and textbooks.
- Training Steps: 20B tokens over approximately 2.2K H100 GPU hours.
- **Stage 2 (512K Context Window)**:
- Code repositories (50% at 512K length and 50% at 64K length).
- Books (17% at 512K and 83% at 64K).
- Textbooks primarily focused on a 512K length.
- Training Steps: 20B tokens over approximately 12.2K H100 GPU hours.
3. **Optimization and Model Configuration**:
- **Optimizer**: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95.
- **Learning Rate**:
- Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6.
- **Batch Size**: 4M tokens for Stage 1 and 8M tokens for Stage 2.
- **Attention Mechanism**: Full attention with cross-document attention masking to effectively handle extensive context windows.
### Supervised Fine-tuning (SFT)
1. **Data Source**:
- **UltraChat**: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness.
2. **Optimization**:
- **Optimizer**: AdamW with parameters as above.
- **Learning Rate**: 2e-5 with a 5% warmup and cosine decay to 2e-6.
- **Batch Size**: 4M tokens for efficient training on high-context tasks.
---
## Key Features
- **Long Context Capability**: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions.
- **Immersive Roleplay Dynamics**: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions.
- **Enhanced Memory Efficiency**: Configured to utilize `int8_mask`, which aids in managing larger context sizes efficiently on limited hardware resources.
## Acknowledgments
- **Princeton NLP**: For creating the [ProLong](https://huggingface.co/princeton-nlp) models, which bring unprecedented long-context handling capabilities to the Llama series.
- **Casual-Autopsy**: For providing F32 quants of [L3-bluuwhale-SAO-MIX](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc), a rich roleplay model that adds thematic depth and interaction diversity.
- **Bluuwhale**: For merging [L3-SAO-MIX-8B-V1](https://huggingface.co/bluuwhale/L3-SAO-MIX-8B-V1).
- **Sao10K**: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. [SAO10K](https://huggingface.co/Sao10K).
## Citation
If you use this model, please consider citing the work of the ProLong developers:
```bibtex
@article{gao2024prolong,
title={How to Train Long-Context Language Models (Effectively)},
author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
journal={arXiv preprint arXiv:2410.02660},
year={2024}
}
``` |