File size: 7,485 Bytes
4b8adea
 
 
 
 
 
 
d0f20a5
 
 
 
 
 
4b8adea
1120d63
4b8adea
 
 
a4cf20c
4b8adea
a4cf20c
4b8adea
a4cf20c
 
 
 
 
 
 
 
 
 
 
 
 
4b8adea
 
 
a4cf20c
4b8adea
 
a4cf20c
 
4b8adea
a4cf20c
4b8adea
 
a4cf20c
 
 
 
4b8adea
 
a4cf20c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b8adea
a4cf20c
 
 
 
 
 
 
 
d0f20a5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
tags:
- merge
- mergekit
- lazymergekit
- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
language:
- en
base_model:
- princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
- Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
library_name: transformers
---
Disclaimer: This model merge has not been thoroughly tested and is experimental. Expect further versions , with improvements, in the coming days. 

# ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512

**ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512** is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the `mergekit` library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions.

## Model Components and Sources

This model is a merge of the following:

1. **[princeton-nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)**  
   *Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.*

2. **[Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)**  
   *This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.*

## 🧩 Configuration and Merge Details

The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios.

### YAML Configuration
```yaml
models:
  - model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
    # Base model: optimized for long-context interactions
  - model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
    parameters:
      weight: 0.5  # Emphasizes roleplay elements without overshadowing the base
      density: 0.6  # Retains 60% of the significant parameters from the roleplay model

merge_method: della  # Ensures balanced integration of long-context and roleplay features
base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
parameters:
  epsilon: 0.05  # Fine-tunes the granularity of pruning, maintaining key model features
  lambda: 1.0  # Harmonizes parameter influence from both models
  normalize: true  # Ensures stable alignment of merged parameters
  int8_mask: true  # Enhances memory efficiency for extended contexts

dtype: float32
out_dtype: bfloat16  # Balances precision and efficiency for versatile deployments
```

## Intended Usage

The **ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K** model is designed for:

- **Extended Conversations**: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue.
- **Roleplay and Storytelling**: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing.
- **General Instruction Following**: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation.

---

## 📚 Dataset Details for ProLong 8B Training

The **ProLong-8B** model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios.

### Continued Long-context Training
1. **Data Composition**:
   - **30% Code Repositories**: This includes diverse sources to enhance technical comprehension and code-related dialogue.
   - **30% Books**: A mix of general and specialized literature to improve narrative and comprehension abilities.
   - **3% Textbooks**: Technical textbooks for specialized and academic context handling.
   - **37% ShortMix**: A balanced blend of various online sources for comprehensive topic coverage.
     - **ShortMix Components**:
       - 27% FineWeb-Edu
       - 27% FineWeb
       - 11% Tulu-v2
       - 11% StackExchange
       - 8% Wikipedia
       - 8% OpenWebMath
       - 8% ArXiv

2. **Training Stages**:
   - **Stage 1 (64K Context Window)**:
     - Utilized code repositories, books, and textbooks.
     - Training Steps: 20B tokens over approximately 2.2K H100 GPU hours.
   - **Stage 2 (512K Context Window)**:
     - Code repositories (50% at 512K length and 50% at 64K length).
     - Books (17% at 512K and 83% at 64K).
     - Textbooks primarily focused on a 512K length.
     - Training Steps: 20B tokens over approximately 12.2K H100 GPU hours.

3. **Optimization and Model Configuration**:
   - **Optimizer**: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95.
   - **Learning Rate**: 
     - Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6.
     - **Batch Size**: 4M tokens for Stage 1 and 8M tokens for Stage 2.
   - **Attention Mechanism**: Full attention with cross-document attention masking to effectively handle extensive context windows.

### Supervised Fine-tuning (SFT)
1. **Data Source**:
   - **UltraChat**: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness.
2. **Optimization**:
   - **Optimizer**: AdamW with parameters as above.
   - **Learning Rate**: 2e-5 with a 5% warmup and cosine decay to 2e-6.
   - **Batch Size**: 4M tokens for efficient training on high-context tasks.

--- 


## Key Features

- **Long Context Capability**: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions.
- **Immersive Roleplay Dynamics**: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions.
- **Enhanced Memory Efficiency**: Configured to utilize `int8_mask`, which aids in managing larger context sizes efficiently on limited hardware resources.

## Acknowledgments

- **Princeton NLP**: For creating the [ProLong](https://huggingface.co/princeton-nlp) models, which bring unprecedented long-context handling capabilities to the Llama series.
- **Casual-Autopsy**: For providing F32 quants of [L3-bluuwhale-SAO-MIX](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc), a rich roleplay model that adds thematic depth and interaction diversity.
- **Bluuwhale**: For merging [L3-SAO-MIX-8B-V1](https://huggingface.co/bluuwhale/L3-SAO-MIX-8B-V1). 
- **Sao10K**: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. [SAO10K](https://huggingface.co/Sao10K).
  
## Citation

If you use this model, please consider citing the work of the ProLong developers:
```bibtex
@article{gao2024prolong,
  title={How to Train Long-Context Language Models (Effectively)},
  author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
  journal={arXiv preprint arXiv:2410.02660},
  year={2024}
}
```