ZeroXClem commited on
Commit
a4cf20c
1 Parent(s): 4b8adea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -13
README.md CHANGED
@@ -9,29 +9,122 @@ tags:
9
 
10
  # ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512
11
 
12
- ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512 is a merge of the following models using [mergekit](https://github.com/cg123/mergekit):
13
- * [Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)
14
 
15
- ## 🧩 Configuration
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ```yaml
18
  models:
19
  - model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
20
- # Base model: no additional parameters necessary
21
  - model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
22
  parameters:
23
- weight: 0.5 # Adjusts influence of roleplay features from L3-bluuwhale
24
- density: 0.6 # Preserves around 60% of significant parameters from the roleplay model
25
 
26
- merge_method: della
27
  base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
28
  parameters:
29
- epsilon: 0.05 # Fine-tunes the granularity of pruning
30
- lambda: 1.0 # Scaling factor to harmonize parameter influence
31
- normalize: true # Ensures parameters align without large deviations
32
- int8_mask: true # Uses an efficient format to handle larger context
33
 
34
  dtype: float32
35
- out_dtype: bfloat16 # Output type to balance precision and efficiency
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- ```
 
 
 
 
 
 
 
 
 
9
 
10
  # ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512
11
 
12
+ **ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512** is a powerful, versatile merged model combining the long-context capabilities of Princeton's ProLong model and the rich, immersive roleplay features from Casual-Autopsy's L3-bluuwhale-SAO-MIX. The merge was performed with the `mergekit` library using advanced configuration to balance efficiency, roleplay fidelity, and long-context capabilities, aiming to provide an unparalleled user experience for extended interactions.
 
13
 
14
+ ## Model Components and Sources
15
 
16
+ This model is a merge of the following:
17
+
18
+ 1. **[princeton-nlp/Llama-3-8B-ProLong-512k-Instruct](https://huggingface.co/princeton-nlp/Llama-3-8B-ProLong-512k-Instruct)**
19
+ *Developed by Princeton NLP, ProLong brings long-context capabilities up to 512,000 tokens, optimized for detailed and extended conversations. Continued training on extensive datasets equips it for high-quality retrieval, while offering coherent responses even in lengthy contexts.*
20
+
21
+ 2. **[Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc)**
22
+ *This model introduces roleplay and immersive storytelling, building on creative datasets to create compelling interactions. Role-specific configurations support vibrant and in-depth character simulations.*
23
+
24
+ ## 🧩 Configuration and Merge Details
25
+
26
+ The model merge was executed using a carefully crafted YAML configuration on MergeKit. Key aspects of the configuration ensure that each component's strengths are preserved while optimizing for performance in complex, long-context scenarios.
27
+
28
+ ### YAML Configuration
29
  ```yaml
30
  models:
31
  - model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
32
+ # Base model: optimized for long-context interactions
33
  - model: Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc
34
  parameters:
35
+ weight: 0.5 # Emphasizes roleplay elements without overshadowing the base
36
+ density: 0.6 # Retains 60% of the significant parameters from the roleplay model
37
 
38
+ merge_method: della # Ensures balanced integration of long-context and roleplay features
39
  base_model: princeton-nlp/Llama-3-8B-ProLong-512k-Instruct
40
  parameters:
41
+ epsilon: 0.05 # Fine-tunes the granularity of pruning, maintaining key model features
42
+ lambda: 1.0 # Harmonizes parameter influence from both models
43
+ normalize: true # Ensures stable alignment of merged parameters
44
+ int8_mask: true # Enhances memory efficiency for extended contexts
45
 
46
  dtype: float32
47
+ out_dtype: bfloat16 # Balances precision and efficiency for versatile deployments
48
+ ```
49
+
50
+ ## Intended Usage
51
+
52
+ The **ZeroXClem/Llama-3-8B-ProLong-SAO-Roleplay-512K** model is designed for:
53
+
54
+ - **Extended Conversations**: With a 512K token context window, it is ideal for scenarios requiring sustained, cohesive dialogue.
55
+ - **Roleplay and Storytelling**: The integration of SAO-themed and roleplay-focused datasets creates a rich and immersive storytelling experience, perfect for applications in interactive fiction, virtual characters, and creative writing.
56
+ - **General Instruction Following**: Fine-tuned on UltraChat, the model maintains a helpful and instructive demeanor, making it suitable for Q&A, assistance, and knowledge generation.
57
+
58
+ ---
59
+
60
+ ## 📚 Dataset Details for ProLong 8B Training
61
+
62
+ The **ProLong-8B** model was rigorously trained with a carefully curated dataset, ensuring versatility across long-context scenarios.
63
+
64
+ ### Continued Long-context Training
65
+ 1. **Data Composition**:
66
+ - **30% Code Repositories**: This includes diverse sources to enhance technical comprehension and code-related dialogue.
67
+ - **30% Books**: A mix of general and specialized literature to improve narrative and comprehension abilities.
68
+ - **3% Textbooks**: Technical textbooks for specialized and academic context handling.
69
+ - **37% ShortMix**: A balanced blend of various online sources for comprehensive topic coverage.
70
+ - **ShortMix Components**:
71
+ - 27% FineWeb-Edu
72
+ - 27% FineWeb
73
+ - 11% Tulu-v2
74
+ - 11% StackExchange
75
+ - 8% Wikipedia
76
+ - 8% OpenWebMath
77
+ - 8% ArXiv
78
+
79
+ 2. **Training Stages**:
80
+ - **Stage 1 (64K Context Window)**:
81
+ - Utilized code repositories, books, and textbooks.
82
+ - Training Steps: 20B tokens over approximately 2.2K H100 GPU hours.
83
+ - **Stage 2 (512K Context Window)**:
84
+ - Code repositories (50% at 512K length and 50% at 64K length).
85
+ - Books (17% at 512K and 83% at 64K).
86
+ - Textbooks primarily focused on a 512K length.
87
+ - Training Steps: 20B tokens over approximately 12.2K H100 GPU hours.
88
+
89
+ 3. **Optimization and Model Configuration**:
90
+ - **Optimizer**: AdamW with a weight decay of 0.1, β₁ = 0.9, and β₂ = 0.95.
91
+ - **Learning Rate**:
92
+ - Stage 1: Initial rate of 1e-5 with 10% warmup and cosine decay to 1e-6.
93
+ - **Batch Size**: 4M tokens for Stage 1 and 8M tokens for Stage 2.
94
+ - **Attention Mechanism**: Full attention with cross-document attention masking to effectively handle extensive context windows.
95
+
96
+ ### Supervised Fine-tuning (SFT)
97
+ 1. **Data Source**:
98
+ - **UltraChat**: A robust dataset with 1B tokens specifically selected to enhance conversational depth and responsiveness.
99
+ 2. **Optimization**:
100
+ - **Optimizer**: AdamW with parameters as above.
101
+ - **Learning Rate**: 2e-5 with a 5% warmup and cosine decay to 2e-6.
102
+ - **Batch Size**: 4M tokens for efficient training on high-context tasks.
103
+
104
+ ---
105
+
106
+
107
+ ## Key Features
108
+
109
+ - **Long Context Capability**: Leveraging Princeton’s ProLong model, this model can handle up to 512K tokens, enabling consistent and detailed responses even in lengthy interactions.
110
+ - **Immersive Roleplay Dynamics**: The influence of L3-bluuwhale-SAO-MIX adds depth to character responses, with support for a variety of personalities and nuanced interactions.
111
+ - **Enhanced Memory Efficiency**: Configured to utilize `int8_mask`, which aids in managing larger context sizes efficiently on limited hardware resources.
112
+
113
+ ## Acknowledgments
114
+
115
+ - **Princeton NLP**: For creating the [ProLong](https://huggingface.co/princeton-nlp) models, which bring unprecedented long-context handling capabilities to the Llama series.
116
+ - **Casual-Autopsy**: For providing F32 quants of [L3-bluuwhale-SAO-MIX](https://huggingface.co/Casual-Autopsy/L3-bluuwhale-SAO-MIX-8B-V1_fp32-merge-calc), a rich roleplay model that adds thematic depth and interaction diversity.
117
+ - **Bluuwhale**: For merging [L3-SAO-MIX-8B-V1](https://huggingface.co/bluuwhale/L3-SAO-MIX-8B-V1).
118
+ - **Sao10K**: For creating these wonderful models, adding rich roleplay models that adds thematic depth and character continuity. [SAO10K](https://huggingface.co/Sao10K).
119
+
120
+ ## Citation
121
 
122
+ If you use this model, please consider citing the work of the ProLong developers:
123
+ ```bibtex
124
+ @article{gao2024prolong,
125
+ title={How to Train Long-Context Language Models (Effectively)},
126
+ author={Gao, Tianyu and Wettig, Alexander and Yen, Howard and Chen, Danqi},
127
+ journal={arXiv preprint arXiv:2410.02660},
128
+ year={2024}
129
+ }
130
+ ```