Update README.md
Browse files
README.md
CHANGED
@@ -9,29 +9,122 @@ tags:
|
|
9 |
|
10 |
# ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512
|
11 |
|
12 |
-
ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512 is a
|
13 |
-
* [Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)
|
14 |
|
15 |
-
##
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
```yaml
|
18 |
models:
|
19 |
- model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
|
20 |
-
# Base model:
|
21 |
- model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
|
22 |
parameters:
|
23 |
-
weight: 0.5 #
|
24 |
-
density: 0.6 #
|
25 |
|
26 |
-
merge_method: della
|
27 |
base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
|
28 |
parameters:
|
29 |
-
epsilon: 0.05 # Fine-tunes the granularity of pruning
|
30 |
-
lambda: 1.0 #
|
31 |
-
normalize: true # Ensures
|
32 |
-
int8_mask: true #
|
33 |
|
34 |
dtype: float32
|
35 |
-
out_dtype: bfloat16 #
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
# ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512
|
11 |
|
12 |
+
**ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512** is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the `mergekit` library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions.
|
|
|
13 |
|
14 |
+
## Model Components and Sources
|
15 |
|
16 |
+
This model is a merge of the following:
|
17 |
+
|
18 |
+
1. **[princeton-nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)**
|
19 |
+
*Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.*
|
20 |
+
|
21 |
+
2. **[Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)**
|
22 |
+
*This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.*
|
23 |
+
|
24 |
+
## 🧩 Configuration and Merge Details
|
25 |
+
|
26 |
+
The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios.
|
27 |
+
|
28 |
+
### YAML Configuration
|
29 |
```yaml
|
30 |
models:
|
31 |
- model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
|
32 |
+
# Base model: optimized for long-context interactions
|
33 |
- model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
|
34 |
parameters:
|
35 |
+
weight: 0.5 # Emphasizes roleplay elements without overshadowing the base
|
36 |
+
density: 0.6 # Retains 60% of the significant parameters from the roleplay model
|
37 |
|
38 |
+
merge_method: della # Ensures balanced integration of long-context and roleplay features
|
39 |
base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
|
40 |
parameters:
|
41 |
+
epsilon: 0.05 # Fine-tunes the granularity of pruning, maintaining key model features
|
42 |
+
lambda: 1.0 # Harmonizes parameter influence from both models
|
43 |
+
normalize: true # Ensures stable alignment of merged parameters
|
44 |
+
int8_mask: true # Enhances memory efficiency for extended contexts
|
45 |
|
46 |
dtype: float32
|
47 |
+
out_dtype: bfloat16 # Balances precision and efficiency for versatile deployments
|
48 |
+
```
|
49 |
+
|
50 |
+
## Intended Usage
|
51 |
+
|
52 |
+
The **ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K** model is designed for:
|
53 |
+
|
54 |
+
- **Extended Conversations**: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue.
|
55 |
+
- **Roleplay and Storytelling**: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing.
|
56 |
+
- **General Instruction Following**: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation.
|
57 |
+
|
58 |
+
---
|
59 |
+
|
60 |
+
## 📚 Dataset Details for ProLong 8B Training
|
61 |
+
|
62 |
+
The **ProLong-8B** model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios.
|
63 |
+
|
64 |
+
### Continued Long-context Training
|
65 |
+
1. **Data Composition**:
|
66 |
+
- **30% Code Repositories**: This includes diverse sources to enhance technical comprehension and code-related dialogue.
|
67 |
+
- **30% Books**: A mix of general and specialized literature to improve narrative and comprehension abilities.
|
68 |
+
- **3% Textbooks**: Technical textbooks for specialized and academic context handling.
|
69 |
+
- **37% ShortMix**: A balanced blend of various online sources for comprehensive topic coverage.
|
70 |
+
- **ShortMix Components**:
|
71 |
+
- 27% FineWeb-Edu
|
72 |
+
- 27% FineWeb
|
73 |
+
- 11% Tulu-v2
|
74 |
+
- 11% StackExchange
|
75 |
+
- 8% Wikipedia
|
76 |
+
- 8% OpenWebMath
|
77 |
+
- 8% ArXiv
|
78 |
+
|
79 |
+
2. **Training Stages**:
|
80 |
+
- **Stage 1 (64K Context Window)**:
|
81 |
+
- Utilized code repositories, books, and textbooks.
|
82 |
+
- Training Steps: 20B tokens over approximately 2.2K H100 GPU hours.
|
83 |
+
- **Stage 2 (512K Context Window)**:
|
84 |
+
- Code repositories (50% at 512K length and 50% at 64K length).
|
85 |
+
- Books (17% at 512K and 83% at 64K).
|
86 |
+
- Textbooks primarily focused on a 512K length.
|
87 |
+
- Training Steps: 20B tokens over approximately 12.2K H100 GPU hours.
|
88 |
+
|
89 |
+
3. **Optimization and Model Configuration**:
|
90 |
+
- **Optimizer**: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95.
|
91 |
+
- **Learning Rate**:
|
92 |
+
- Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6.
|
93 |
+
- **Batch Size**: 4M tokens for Stage 1 and 8M tokens for Stage 2.
|
94 |
+
- **Attention Mechanism**: Full attention with cross-document attention masking to effectively handle extensive context windows.
|
95 |
+
|
96 |
+
### Supervised Fine-tuning (SFT)
|
97 |
+
1. **Data Source**:
|
98 |
+
- **UltraChat**: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness.
|
99 |
+
2. **Optimization**:
|
100 |
+
- **Optimizer**: AdamW with parameters as above.
|
101 |
+
- **Learning Rate**: 2e-5 with a 5% warmup and cosine decay to 2e-6.
|
102 |
+
- **Batch Size**: 4M tokens for efficient training on high-context tasks.
|
103 |
+
|
104 |
+
---
|
105 |
+
|
106 |
+
|
107 |
+
## Key Features
|
108 |
+
|
109 |
+
- **Long Context Capability**: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions.
|
110 |
+
- **Immersive Roleplay Dynamics**: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions.
|
111 |
+
- **Enhanced Memory Efficiency**: Configured to utilize `int8_mask`, which aids in managing larger context sizes efficiently on limited hardware resources.
|
112 |
+
|
113 |
+
## Acknowledgments
|
114 |
+
|
115 |
+
- **Princeton NLP**: For creating the [ProLong](https://huggingface.co/princeton-nlp) models, which bring unprecedented long-context handling capabilities to the Llama series.
|
116 |
+
- **Casual-Autopsy**: For providing F32 quants of [L3-bluuwhale-SAO-MIX](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc), a rich roleplay model that adds thematic depth and interaction diversity.
|
117 |
+
- **Bluuwhale**: For merging [L3-SAO-MIX-8B-V1](https://huggingface.co/bluuwhale/L3-SAO-MIX-8B-V1).
|
118 |
+
- **Sao10K**: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. [SAO10K](https://huggingface.co/Sao10K).
|
119 |
+
|
120 |
+
## Citation
|
121 |
|
122 |
+
If you use this model, please consider citing the work of the ProLong developers:
|
123 |
+
```bibtex
|
124 |
+
@article{gao2024prolong,
|
125 |
+
title={How to Train Long-Context Language Models (Effectively)},
|
126 |
+
author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
|
127 |
+
journal={arXiv preprint arXiv:2410.02660},
|
128 |
+
year={2024}
|
129 |
+
}
|
130 |
+
```
|