Crystalcareai commited on
Commit
437725d
1 Parent(s): 07a7bf9

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +786 -0
  2. added_tokens.json +4 -0
  3. config.json +39 -0
  4. configuration_dbrx.py +264 -0
  5. generation_config.json +5 -0
  6. model-00001-of-00054.safetensors +3 -0
  7. model-00002-of-00054.safetensors +3 -0
  8. model-00003-of-00054.safetensors +3 -0
  9. model-00004-of-00054.safetensors +3 -0
  10. model-00005-of-00054.safetensors +3 -0
  11. model-00006-of-00054.safetensors +3 -0
  12. model-00007-of-00054.safetensors +3 -0
  13. model-00008-of-00054.safetensors +3 -0
  14. model-00009-of-00054.safetensors +3 -0
  15. model-00010-of-00054.safetensors +3 -0
  16. model-00011-of-00054.safetensors +3 -0
  17. model-00012-of-00054.safetensors +3 -0
  18. model-00013-of-00054.safetensors +3 -0
  19. model-00014-of-00054.safetensors +3 -0
  20. model-00015-of-00054.safetensors +3 -0
  21. model-00016-of-00054.safetensors +3 -0
  22. model-00017-of-00054.safetensors +3 -0
  23. model-00018-of-00054.safetensors +3 -0
  24. model-00019-of-00054.safetensors +3 -0
  25. model-00020-of-00054.safetensors +3 -0
  26. model-00021-of-00054.safetensors +3 -0
  27. model-00022-of-00054.safetensors +3 -0
  28. model-00023-of-00054.safetensors +3 -0
  29. model-00024-of-00054.safetensors +3 -0
  30. model-00025-of-00054.safetensors +3 -0
  31. model-00026-of-00054.safetensors +3 -0
  32. model-00027-of-00054.safetensors +3 -0
  33. model-00028-of-00054.safetensors +3 -0
  34. model-00029-of-00054.safetensors +3 -0
  35. model-00030-of-00054.safetensors +3 -0
  36. model-00031-of-00054.safetensors +3 -0
  37. model-00032-of-00054.safetensors +3 -0
  38. model-00033-of-00054.safetensors +3 -0
  39. model-00034-of-00054.safetensors +3 -0
  40. model-00035-of-00054.safetensors +3 -0
  41. model-00036-of-00054.safetensors +3 -0
  42. model-00037-of-00054.safetensors +3 -0
  43. model-00038-of-00054.safetensors +3 -0
  44. model-00039-of-00054.safetensors +3 -0
  45. model-00040-of-00054.safetensors +3 -0
  46. model-00041-of-00054.safetensors +3 -0
  47. model-00042-of-00054.safetensors +3 -0
  48. model-00043-of-00054.safetensors +3 -0
  49. model-00044-of-00054.safetensors +3 -0
  50. model-00045-of-00054.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,786 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - generated_from_trainer
4
+ model-index:
5
+ - name: out
6
+ results: []
7
+ ---
8
+
9
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
+ should probably proofread and complete it, then remove this comment. -->
11
+
12
+ [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
13
+ <details><summary>See axolotl config</summary>
14
+
15
+ axolotl version: `0.4.0`
16
+ ```yaml
17
+ base_model: /workspace/axolotl/dbrx-checkpoint
18
+ model_type: AutoModelForCausalLM
19
+ tokenizer_type: AutoTokenizer
20
+ trust_remote_code: true
21
+
22
+ load_in_8bit: false
23
+ # load_in_4bit: true
24
+ strict: false
25
+
26
+ # adapter: qlora
27
+ # lora_modules_to_save: [embed_tokens, lm_head]
28
+
29
+ # lora_r: 32
30
+ # lora_alpha: 16
31
+ # lora_dropout: 0.05
32
+ # lora_target_linear: false
33
+ # lora_fan_in_fan_out:
34
+
35
+ datasets:
36
+ - path: /workspace/datasets/dolphin-2.9/dolphin201-sharegpt2.jsonl
37
+ type: sharegpt
38
+ conversation: chatml
39
+ # - path: /workspace/datasets/dolphin-2.9/Ultrachat200kunfiltered.jsonl
40
+ # type: sharegpt
41
+ # conversation: chatml
42
+ - path: /workspace/datasets/dolphin-2.9/dolphin-coder-translate-sharegpt2.jsonl
43
+ type: sharegpt
44
+ conversation: chatml
45
+ - path: /workspace/datasets/dolphin-2.9/dolphin-coder-codegen-sharegpt2.jsonl
46
+ type: sharegpt
47
+ conversation: chatml
48
+ - path: /workspace/datasets/dolphin-2.9/m-a-p_Code-Feedback-sharegpt-unfiltered.jsonl
49
+ type: sharegpt
50
+ conversation: chatml
51
+ - path: /workspace/datasets/dolphin-2.9/m-a-p_CodeFeedback-Filtered-Instruction-sharegpt-unfiltered.jsonl
52
+ type: sharegpt
53
+ conversation: chatml
54
+ - path: /workspace/datasets/dolphin-2.9/not_samantha_norefusals.jsonl
55
+ type: sharegpt
56
+ conversation: chatml
57
+ - path: /workspace/datasets/dolphin-2.9/Orca-Math-resort-unfiltered.jsonl
58
+ type: sharegpt
59
+ conversation: chatml
60
+ - path: /workspace/datasets/dolphin-2.9/agent_instruct_react_unfiltered.jsonl
61
+ type: sharegpt
62
+ conversation: chatml
63
+ - path: /workspace/datasets/dolphin-2.9/toolbench_instruct_j1s1_3k_unfiltered.jsonl
64
+ type: sharegpt
65
+ conversation: chatml
66
+ - path: /workspace/datasets/dolphin-2.9/toolbench_negative_unfiltered.jsonl
67
+ type: sharegpt
68
+ conversation: chatml
69
+ - path: /workspace/datasets/dolphin-2.9/toolbench_react_10p_unfiltered.jsonl
70
+ type: sharegpt
71
+ conversation: chatml
72
+ - path: /workspace/datasets/dolphin-2.9/toolbench_tflan_cot_30p_unfiltered.jsonl
73
+ type: sharegpt
74
+ conversation: chatml
75
+ - path: /workspace/datasets/dolphin-2.9/openhermes200k_unfiltered.jsonl
76
+ type: sharegpt
77
+ conversation: chatml
78
+ # - path: /workspace/datasets/dolphin-2.9/SystemConversations.jsonl
79
+ # type: sharegpt
80
+ # conversation: chatml
81
+
82
+ chat_template: chatml
83
+
84
+ unfrozen_parameters:
85
+ - ^lm_head.weight$
86
+ # ffn.experts.mlp_experts.0.v1 layers
87
+ - transformer.blocks.30.ffn.experts.mlp_experts.0.v1
88
+ - transformer.blocks.32.ffn.experts.mlp_experts.0.v1
89
+ - transformer.blocks.25.ffn.experts.mlp_experts.0.v1
90
+ - transformer.blocks.15.ffn.experts.mlp_experts.0.v1
91
+ - transformer.blocks.22.ffn.experts.mlp_experts.0.v1
92
+ - transformer.blocks.31.ffn.experts.mlp_experts.0.v1
93
+ - transformer.blocks.7.ffn.experts.mlp_experts.0.v1
94
+ - transformer.blocks.21.ffn.experts.mlp_experts.0.v1
95
+ - transformer.blocks.8.ffn.experts.mlp_experts.0.v1
96
+ - transformer.blocks.23.ffn.experts.mlp_experts.0.v1
97
+ # ffn.experts.mlp_experts.0.w1 layers
98
+ - transformer.blocks.7.ffn.experts.mlp_experts.0.w1
99
+ - transformer.blocks.8.ffn.experts.mlp_experts.0.w1
100
+ - transformer.blocks.30.ffn.experts.mlp_experts.0.w1
101
+ - transformer.blocks.4.ffn.experts.mlp_experts.0.w1
102
+ - transformer.blocks.0.ffn.experts.mlp_experts.0.w1
103
+ - transformer.blocks.32.ffn.experts.mlp_experts.0.w1
104
+ - transformer.blocks.6.ffn.experts.mlp_experts.0.w1
105
+ - transformer.blocks.3.ffn.experts.mlp_experts.0.w1
106
+ - transformer.blocks.25.ffn.experts.mlp_experts.0.w1
107
+ - transformer.blocks.5.ffn.experts.mlp_experts.0.w1
108
+ # ffn.experts.mlp_experts.0.w2 layers
109
+ - transformer.blocks.25.ffn.experts.mlp_experts.0.w2
110
+ - transformer.blocks.22.ffn.experts.mlp_experts.0.w2
111
+ - transformer.blocks.27.ffn.experts.mlp_experts.0.w2
112
+ - transformer.blocks.26.ffn.experts.mlp_experts.0.w2
113
+ - transformer.blocks.4.ffn.experts.mlp_experts.0.w2
114
+ - transformer.blocks.29.ffn.experts.mlp_experts.0.w2
115
+ - transformer.blocks.32.ffn.experts.mlp_experts.0.w2
116
+ - transformer.blocks.5.ffn.experts.mlp_experts.0.w2
117
+ - transformer.blocks.7.ffn.experts.mlp_experts.0.w2
118
+ - transformer.blocks.3.ffn.experts.mlp_experts.0.w2
119
+ # ffn.experts.mlp_experts.1.v1 layers
120
+ - transformer.blocks.27.ffn.experts.mlp_experts.1.v1
121
+ - transformer.blocks.25.ffn.experts.mlp_experts.1.v1
122
+ - transformer.blocks.29.ffn.experts.mlp_experts.1.v1
123
+ - transformer.blocks.33.ffn.experts.mlp_experts.1.v1
124
+ - transformer.blocks.23.ffn.experts.mlp_experts.1.v1
125
+ - transformer.blocks.30.ffn.experts.mlp_experts.1.v1
126
+ - transformer.blocks.6.ffn.experts.mlp_experts.1.v1
127
+ - transformer.blocks.21.ffn.experts.mlp_experts.1.v1
128
+ - transformer.blocks.15.ffn.experts.mlp_experts.1.v1
129
+ - transformer.blocks.7.ffn.experts.mlp_experts.1.v1
130
+ # ffn.experts.mlp_experts.1.w1 layers
131
+ - transformer.blocks.0.ffn.experts.mlp_experts.1.w1
132
+ - transformer.blocks.6.ffn.experts.mlp_experts.1.w1
133
+ - transformer.blocks.7.ffn.experts.mlp_experts.1.w1
134
+ - transformer.blocks.4.ffn.experts.mlp_experts.1.w1
135
+ - transformer.blocks.8.ffn.experts.mlp_experts.1.w1
136
+ - transformer.blocks.29.ffn.experts.mlp_experts.1.w1
137
+ - transformer.blocks.33.ffn.experts.mlp_experts.1.w1
138
+ - transformer.blocks.27.ffn.experts.mlp_experts.1.w1
139
+ - transformer.blocks.1.ffn.experts.mlp_experts.1.w1
140
+ - transformer.blocks.10.ffn.experts.mlp_experts.1.w1
141
+ # ffn.experts.mlp_experts.1.w2 layers
142
+ - transformer.blocks.25.ffn.experts.mlp_experts.1.w2
143
+ - transformer.blocks.23.ffn.experts.mlp_experts.1.w2
144
+ - transformer.blocks.27.ffn.experts.mlp_experts.1.w2
145
+ - transformer.blocks.29.ffn.experts.mlp_experts.1.w2
146
+ - transformer.blocks.31.ffn.experts.mlp_experts.1.w2
147
+ - transformer.blocks.4.ffn.experts.mlp_experts.1.w2
148
+ - transformer.blocks.32.ffn.experts.mlp_experts.1.w2
149
+ - transformer.blocks.30.ffn.experts.mlp_experts.1.w2
150
+ - transformer.blocks.21.ffn.experts.mlp_experts.1.w2
151
+ - transformer.blocks.33.ffn.experts.mlp_experts.1.w2
152
+ # ffn.experts.mlp_experts.10.v1 layers
153
+ - transformer.blocks.28.ffn.experts.mlp_experts.10.v1
154
+ - transformer.blocks.34.ffn.experts.mlp_experts.10.v1
155
+ - transformer.blocks.33.ffn.experts.mlp_experts.10.v1
156
+ - transformer.blocks.26.ffn.experts.mlp_experts.10.v1
157
+ - transformer.blocks.32.ffn.experts.mlp_experts.10.v1
158
+ - transformer.blocks.30.ffn.experts.mlp_experts.10.v1
159
+ - transformer.blocks.36.ffn.experts.mlp_experts.10.v1
160
+ - transformer.blocks.24.ffn.experts.mlp_experts.10.v1
161
+ - transformer.blocks.20.ffn.experts.mlp_experts.10.v1
162
+ - transformer.blocks.35.ffn.experts.mlp_experts.10.v1
163
+ # ffn.experts.mlp_experts.10.w1 layers
164
+ - transformer.blocks.24.ffn.experts.mlp_experts.10.w1
165
+ - transformer.blocks.33.ffn.experts.mlp_experts.10.w1
166
+ - transformer.blocks.8.ffn.experts.mlp_experts.10.w1
167
+ - transformer.blocks.7.ffn.experts.mlp_experts.10.w1
168
+ - transformer.blocks.34.ffn.experts.mlp_experts.10.w1
169
+ - transformer.blocks.28.ffn.experts.mlp_experts.10.w1
170
+ - transformer.blocks.30.ffn.experts.mlp_experts.10.w1
171
+ - transformer.blocks.1.ffn.experts.mlp_experts.10.w1
172
+ - transformer.blocks.3.ffn.experts.mlp_experts.10.w1
173
+ - transformer.blocks.5.ffn.experts.mlp_experts.10.w1
174
+ # ffn.experts.mlp_experts.10.w2 layers
175
+ - transformer.blocks.24.ffn.experts.mlp_experts.10.w2
176
+ - transformer.blocks.28.ffn.experts.mlp_experts.10.w2
177
+ - transformer.blocks.23.ffn.experts.mlp_experts.10.w2
178
+ - transformer.blocks.30.ffn.experts.mlp_experts.10.w2
179
+ - transformer.blocks.32.ffn.experts.mlp_experts.10.w2
180
+ - transformer.blocks.3.ffn.experts.mlp_experts.10.w2
181
+ - transformer.blocks.33.ffn.experts.mlp_experts.10.w2
182
+ - transformer.blocks.26.ffn.experts.mlp_experts.10.w2
183
+ - transformer.blocks.2.ffn.experts.mlp_experts.10.w2
184
+ - transformer.blocks.20.ffn.experts.mlp_experts.10.w2
185
+ # ffn.experts.mlp_experts.11.w1 layers
186
+ - transformer.blocks.6.ffn.experts.mlp_experts.11.w1
187
+ - transformer.blocks.8.ffn.experts.mlp_experts.11.w1
188
+ - transformer.blocks.9.ffn.experts.mlp_experts.11.w1
189
+ - transformer.blocks.0.ffn.experts.mlp_experts.11.w1
190
+ - transformer.blocks.10.ffn.experts.mlp_experts.11.w1
191
+ - transformer.blocks.28.ffn.experts.mlp_experts.11.w1
192
+ - transformer.blocks.3.ffn.experts.mlp_experts.11.w1
193
+ - transformer.blocks.5.ffn.experts.mlp_experts.11.w1
194
+ - transformer.blocks.33.ffn.experts.mlp_experts.11.w1
195
+ - transformer.blocks.13.ffn.experts.mlp_experts.11.w1
196
+ # ffn.experts.mlp_experts.11.w2 layers
197
+ - transformer.blocks.27.ffn.experts.mlp_experts.11.w2
198
+ - transformer.blocks.24.ffn.experts.mlp_experts.11.w2
199
+ - transformer.blocks.29.ffn.experts.mlp_experts.11.w2
200
+ - transformer.blocks.30.ffn.experts.mlp_experts.11.w2
201
+ - transformer.blocks.22.ffn.experts.mlp_experts.11.w2
202
+ - transformer.blocks.6.ffn.experts.mlp_experts.11.w2
203
+ - transformer.blocks.25.ffn.experts.mlp_experts.11.w2
204
+ - transformer.blocks.7.ffn.experts.mlp_experts.11.w2
205
+ - transformer.blocks.28.ffn.experts.mlp_experts.11.w2
206
+ - transformer.blocks.5.ffn.experts.mlp_experts.11.w2
207
+ # ffn.experts.mlp_experts.12.v1 layers
208
+ - transformer.blocks.30.ffn.experts.mlp_experts.12.v1
209
+ - transformer.blocks.21.ffn.experts.mlp_experts.12.v1
210
+ - transformer.blocks.27.ffn.experts.mlp_experts.12.v1
211
+ - transformer.blocks.28.ffn.experts.mlp_experts.12.v1
212
+ - transformer.blocks.29.ffn.experts.mlp_experts.12.v1
213
+ - transformer.blocks.8.ffn.experts.mlp_experts.12.v1
214
+ - transformer.blocks.10.ffn.experts.mlp_experts.12.v1
215
+ - transformer.blocks.23.ffn.experts.mlp_experts.12.v1
216
+ - transformer.blocks.6.ffn.experts.mlp_experts.12.v1
217
+ - transformer.blocks.20.ffn.experts.mlp_experts.12.v1
218
+ # ffn.experts.mlp_experts.12.w1 layers
219
+ - transformer.blocks.8.ffn.experts.mlp_experts.12.w1
220
+ - transformer.blocks.1.ffn.experts.mlp_experts.12.w1
221
+ - transformer.blocks.0.ffn.experts.mlp_experts.12.w1
222
+ - transformer.blocks.6.ffn.experts.mlp_experts.12.w1
223
+ - transformer.blocks.9.ffn.experts.mlp_experts.12.w1
224
+ - transformer.blocks.2.ffn.experts.mlp_experts.12.w1
225
+ - transformer.blocks.10.ffn.experts.mlp_experts.12.w1
226
+ - transformer.blocks.17.ffn.experts.mlp_experts.12.w1
227
+ - transformer.blocks.29.ffn.experts.mlp_experts.12.w1
228
+ - transformer.blocks.21.ffn.experts.mlp_experts.12.w1
229
+ # ffn.experts.mlp_experts.12.w2 layers
230
+ - transformer.blocks.6.ffn.experts.mlp_experts.12.w2
231
+ - transformer.blocks.25.ffn.experts.mlp_experts.12.w2
232
+ - transformer.blocks.27.ffn.experts.mlp_experts.12.w2
233
+ - transformer.blocks.8.ffn.experts.mlp_experts.12.w2
234
+ - transformer.blocks.31.ffn.experts.mlp_experts.12.w2
235
+ - transformer.blocks.21.ffn.experts.mlp_experts.12.w2
236
+ - transformer.blocks.2.ffn.experts.mlp_experts.12.w2
237
+ - transformer.blocks.29.ffn.experts.mlp_experts.12.w2
238
+ - transformer.blocks.32.ffn.experts.mlp_experts.12.w2
239
+ - transformer.blocks.30.ffn.experts.mlp_experts.12.w2
240
+ # ffn.experts.mlp_experts.13.v1 layers
241
+ - transformer.blocks.31.ffn.experts.mlp_experts.13.v1
242
+ - transformer.blocks.24.ffn.experts.mlp_experts.13.v1
243
+ - transformer.blocks.30.ffn.experts.mlp_experts.13.v1
244
+ - transformer.blocks.29.ffn.experts.mlp_experts.13.v1
245
+ - transformer.blocks.8.ffn.experts.mlp_experts.13.v1
246
+ - transformer.blocks.10.ffn.experts.mlp_experts.13.v1
247
+ - transformer.blocks.11.ffn.experts.mlp_experts.13.v1
248
+ - transformer.blocks.27.ffn.experts.mlp_experts.13.v1
249
+ - transformer.blocks.25.ffn.experts.mlp_experts.13.v1
250
+ - transformer.blocks.36.ffn.experts.mlp_experts.13.v1
251
+ # ffn.experts.mlp_experts.13.w1 layers
252
+ - transformer.blocks.4.ffn.experts.mlp_experts.13.w1
253
+ - transformer.blocks.10.ffn.experts.mlp_experts.13.w1
254
+ - transformer.blocks.6.ffn.experts.mlp_experts.13.w1
255
+ - transformer.blocks.0.ffn.experts.mlp_experts.13.w1
256
+ - transformer.blocks.3.ffn.experts.mlp_experts.13.w1
257
+ - transformer.blocks.24.ffn.experts.mlp_experts.13.w1
258
+ - transformer.blocks.8.ffn.experts.mlp_experts.13.w1
259
+ - transformer.blocks.1.ffn.experts.mlp_experts.13.w1
260
+ - transformer.blocks.30.ffn.experts.mlp_experts.13.w1
261
+ - transformer.blocks.11.ffn.experts.mlp_experts.13.w1
262
+ # ffn.experts.mlp_experts.13.w2 layers
263
+ - transformer.blocks.24.ffn.experts.mlp_experts.13.w2
264
+ - transformer.blocks.20.ffn.experts.mlp_experts.13.w2
265
+ - transformer.blocks.25.ffn.experts.mlp_experts.13.w2
266
+ - transformer.blocks.27.ffn.experts.mlp_experts.13.w2
267
+ - transformer.blocks.3.ffn.experts.mlp_experts.13.w2
268
+ - transformer.blocks.4.ffn.experts.mlp_experts.13.w2
269
+ - transformer.blocks.29.ffn.experts.mlp_experts.13.w2
270
+ - transformer.blocks.6.ffn.experts.mlp_experts.13.w2
271
+ - transformer.blocks.30.ffn.experts.mlp_experts.13.w2
272
+ - transformer.blocks.31.ffn.experts.mlp_experts.13.w2
273
+ # ffn.experts.mlp_experts.14.v1 layers
274
+ - transformer.blocks.28.ffn.experts.mlp_experts.14.v1
275
+ - transformer.blocks.26.ffn.experts.mlp_experts.14.v1
276
+ - transformer.blocks.29.ffn.experts.mlp_experts.14.v1
277
+ - transformer.blocks.35.ffn.experts.mlp_experts.14.v1
278
+ - transformer.blocks.24.ffn.experts.mlp_experts.14.v1
279
+ - transformer.blocks.8.ffn.experts.mlp_experts.14.v1
280
+ - transformer.blocks.32.ffn.experts.mlp_experts.14.v1
281
+ - transformer.blocks.15.ffn.experts.mlp_experts.14.v1
282
+ - transformer.blocks.11.ffn.experts.mlp_experts.14.v1
283
+ - transformer.blocks.22.ffn.experts.mlp_experts.14.v1
284
+ # ffn.experts.mlp_experts.14.w1 layers
285
+ - transformer.blocks.8.ffn.experts.mlp_experts.14.w1
286
+ - transformer.blocks.4.ffn.experts.mlp_experts.14.w1
287
+ - transformer.blocks.5.ffn.experts.mlp_experts.14.w1
288
+ - transformer.blocks.7.ffn.experts.mlp_experts.14.w1
289
+ - transformer.blocks.3.ffn.experts.mlp_experts.14.w1
290
+ - transformer.blocks.13.ffn.experts.mlp_experts.14.w1
291
+ - transformer.blocks.29.ffn.experts.mlp_experts.14.w1
292
+ - transformer.blocks.6.ffn.experts.mlp_experts.14.w1
293
+ - transformer.blocks.28.ffn.experts.mlp_experts.14.w1
294
+ - transformer.blocks.9.ffn.experts.mlp_experts.14.w1
295
+ # ffn.experts.mlp_experts.14.w2 layers
296
+ - transformer.blocks.26.ffn.experts.mlp_experts.14.w2
297
+ - transformer.blocks.24.ffn.experts.mlp_experts.14.w2
298
+ - transformer.blocks.29.ffn.experts.mlp_experts.14.w2
299
+ - transformer.blocks.28.ffn.experts.mlp_experts.14.w2
300
+ - transformer.blocks.31.ffn.experts.mlp_experts.14.w2
301
+ - transformer.blocks.5.ffn.experts.mlp_experts.14.w2
302
+ - transformer.blocks.4.ffn.experts.mlp_experts.14.w2
303
+ - transformer.blocks.32.ffn.experts.mlp_experts.14.w2
304
+ - transformer.blocks.6.ffn.experts.mlp_experts.14.w2
305
+ - transformer.blocks.22.ffn.experts.mlp_experts.14.w2
306
+ # ffn.experts.mlp_experts.15.v1 layers
307
+ - transformer.blocks.33.ffn.experts.mlp_experts.15.v1
308
+ - transformer.blocks.26.ffn.experts.mlp_experts.15.v1
309
+ - transformer.blocks.31.ffn.experts.mlp_experts.15.v1
310
+ - transformer.blocks.28.ffn.experts.mlp_experts.15.v1
311
+ - transformer.blocks.9.ffn.experts.mlp_experts.15.v1
312
+ - transformer.blocks.34.ffn.experts.mlp_experts.15.v1
313
+ - transformer.blocks.29.ffn.experts.mlp_experts.15.v1
314
+ - transformer.blocks.7.ffn.experts.mlp_experts.15.v1
315
+ - transformer.blocks.17.ffn.experts.mlp_experts.15.v1
316
+ - transformer.blocks.15.ffn.experts.mlp_experts.15.v1
317
+ # ffn.experts.mlp_experts.15.w1 layers
318
+ - transformer.blocks.6.ffn.experts.mlp_experts.15.w1
319
+ - transformer.blocks.9.ffn.experts.mlp_experts.15.w1
320
+ - transformer.blocks.0.ffn.experts.mlp_experts.15.w1
321
+ - transformer.blocks.7.ffn.experts.mlp_experts.15.w1
322
+ - transformer.blocks.14.ffn.experts.mlp_experts.15.w1
323
+ - transformer.blocks.33.ffn.experts.mlp_experts.15.w1
324
+ - transformer.blocks.34.ffn.experts.mlp_experts.15.w1
325
+ - transformer.blocks.10.ffn.experts.mlp_experts.15.w1
326
+ - transformer.blocks.5.ffn.experts.mlp_experts.15.w1
327
+ - transformer.blocks.29.ffn.experts.mlp_experts.15.w1
328
+ # ffn.experts.mlp_experts.15.w2 layers
329
+ - transformer.blocks.28.ffn.experts.mlp_experts.15.w2
330
+ - transformer.blocks.26.ffn.experts.mlp_experts.15.w2
331
+ - transformer.blocks.27.ffn.experts.mlp_experts.15.w2
332
+ - transformer.blocks.29.ffn.experts.mlp_experts.15.w2
333
+ - transformer.blocks.6.ffn.experts.mlp_experts.15.w2
334
+ - transformer.blocks.31.ffn.experts.mlp_experts.15.w2
335
+ - transformer.blocks.7.ffn.experts.mlp_experts.15.w2
336
+ - transformer.blocks.33.ffn.experts.mlp_experts.15.w2
337
+ - transformer.blocks.32.ffn.experts.mlp_experts.15.w2
338
+ - transformer.blocks.25.ffn.experts.mlp_experts.15.w2
339
+ # ffn.experts.mlp_experts.2.v1 layers
340
+ - transformer.blocks.31.ffn.experts.mlp_experts.2.v1
341
+ - transformer.blocks.27.ffn.experts.mlp_experts.2.v1
342
+ - transformer.blocks.28.ffn.experts.mlp_experts.2.v1
343
+ - transformer.blocks.30.ffn.experts.mlp_experts.2.v1
344
+ - transformer.blocks.23.ffn.experts.mlp_experts.2.v1
345
+ - transformer.blocks.32.ffn.experts.mlp_experts.2.v1
346
+ - transformer.blocks.35.ffn.experts.mlp_experts.2.v1
347
+ - transformer.blocks.7.ffn.experts.mlp_experts.2.v1
348
+ - transformer.blocks.21.ffn.experts.mlp_experts.2.v1
349
+ - transformer.blocks.15.ffn.experts.mlp_experts.2.v1
350
+ # ffn.experts.mlp_experts.2.w1 layers
351
+ - transformer.blocks.7.ffn.experts.mlp_experts.2.w1
352
+ - transformer.blocks.6.ffn.experts.mlp_experts.2.w1
353
+ - transformer.blocks.1.ffn.experts.mlp_experts.2.w1
354
+ - transformer.blocks.4.ffn.experts.mlp_experts.2.w1
355
+ - transformer.blocks.5.ffn.experts.mlp_experts.2.w1
356
+ - transformer.blocks.29.ffn.experts.mlp_experts.2.w1
357
+ - transformer.blocks.0.ffn.experts.mlp_experts.2.w1
358
+ - transformer.blocks.9.ffn.experts.mlp_experts.2.w1
359
+ - transformer.blocks.31.ffn.experts.mlp_experts.2.w1
360
+ - transformer.blocks.30.ffn.experts.mlp_experts.2.w1
361
+ # ffn.experts.mlp_experts.2.w2 layers
362
+ - transformer.blocks.26.ffn.experts.mlp_experts.2.w2
363
+ - transformer.blocks.27.ffn.experts.mlp_experts.2.w2
364
+ - transformer.blocks.33.ffn.experts.mlp_experts.2.w2
365
+ - transformer.blocks.5.ffn.experts.mlp_experts.2.w2
366
+ - transformer.blocks.23.ffn.experts.mlp_experts.2.w2
367
+ - transformer.blocks.32.ffn.experts.mlp_experts.2.w2
368
+ - transformer.blocks.28.ffn.experts.mlp_experts.2.w2
369
+ - transformer.blocks.4.ffn.experts.mlp_experts.2.w2
370
+ - transformer.blocks.29.ffn.experts.mlp_experts.2.w2
371
+ - transformer.blocks.30.ffn.experts.mlp_experts.2.w2
372
+ # ffn.experts.mlp_experts.3.v1 layers
373
+ - transformer.blocks.28.ffn.experts.mlp_experts.3.v1
374
+ - transformer.blocks.33.ffn.experts.mlp_experts.3.v1
375
+ - transformer.blocks.36.ffn.experts.mlp_experts.3.v1
376
+ - transformer.blocks.29.ffn.experts.mlp_experts.3.v1
377
+ - transformer.blocks.30.ffn.experts.mlp_experts.3.v1
378
+ - transformer.blocks.7.ffn.experts.mlp_experts.3.v1
379
+ - transformer.blocks.14.ffn.experts.mlp_experts.3.v1
380
+ - transformer.blocks.10.ffn.experts.mlp_experts.3.v1
381
+ - transformer.blocks.31.ffn.experts.mlp_experts.3.v1
382
+ - transformer.blocks.21.ffn.experts.mlp_experts.3.v1
383
+ # ffn.experts.mlp_experts.3.w1 layers
384
+ - transformer.blocks.7.ffn.experts.mlp_experts.3.w1
385
+ - transformer.blocks.0.ffn.experts.mlp_experts.3.w1
386
+ - transformer.blocks.10.ffn.experts.mlp_experts.3.w1
387
+ - transformer.blocks.9.ffn.experts.mlp_experts.3.w1
388
+ - transformer.blocks.29.ffn.experts.mlp_experts.3.w1
389
+ - transformer.blocks.5.ffn.experts.mlp_experts.3.w1
390
+ - transformer.blocks.30.ffn.experts.mlp_experts.3.w1
391
+ - transformer.blocks.4.ffn.experts.mlp_experts.3.w1
392
+ - transformer.blocks.33.ffn.experts.mlp_experts.3.w1
393
+ - transformer.blocks.1.ffn.experts.mlp_experts.3.w1
394
+ # ffn.experts.mlp_experts.3.w2 layers
395
+ - transformer.blocks.28.ffn.experts.mlp_experts.3.w2
396
+ - transformer.blocks.5.ffn.experts.mlp_experts.3.w2
397
+ - transformer.blocks.24.ffn.experts.mlp_experts.3.w2
398
+ - transformer.blocks.31.ffn.experts.mlp_experts.3.w2
399
+ - transformer.blocks.30.ffn.experts.mlp_experts.3.w2
400
+ - transformer.blocks.21.ffn.experts.mlp_experts.3.w2
401
+ - transformer.blocks.32.ffn.experts.mlp_experts.3.w2
402
+ - transformer.blocks.29.ffn.experts.mlp_experts.3.w2
403
+ - transformer.blocks.26.ffn.experts.mlp_experts.3.w2
404
+ - transformer.blocks.2.ffn.experts.mlp_experts.3.w2
405
+ # ffn.experts.mlp_experts.4.v1 layers
406
+ - transformer.blocks.34.ffn.experts.mlp_experts.4.v1
407
+ - transformer.blocks.31.ffn.experts.mlp_experts.4.v1
408
+ - transformer.blocks.26.ffn.experts.mlp_experts.4.v1
409
+ - transformer.blocks.24.ffn.experts.mlp_experts.4.v1
410
+ - transformer.blocks.14.ffn.experts.mlp_experts.4.v1
411
+ - transformer.blocks.32.ffn.experts.mlp_experts.4.v1
412
+ - transformer.blocks.7.ffn.experts.mlp_experts.4.v1
413
+ - transformer.blocks.6.ffn.experts.mlp_experts.4.v1
414
+ - transformer.blocks.20.ffn.experts.mlp_experts.4.v1
415
+ - transformer.blocks.9.ffn.experts.mlp_experts.4.v1
416
+ # ffn.experts.mlp_experts.4.w1 layers
417
+ - transformer.blocks.6.ffn.experts.mlp_experts.4.w1
418
+ - transformer.blocks.4.ffn.experts.mlp_experts.4.w1
419
+ - transformer.blocks.7.ffn.experts.mlp_experts.4.w1
420
+ - transformer.blocks.9.ffn.experts.mlp_experts.4.w1
421
+ - transformer.blocks.0.ffn.experts.mlp_experts.4.w1
422
+ - transformer.blocks.5.ffn.experts.mlp_experts.4.w1
423
+ - transformer.blocks.14.ffn.experts.mlp_experts.4.w1
424
+ - transformer.blocks.34.ffn.experts.mlp_experts.4.w1
425
+ - transformer.blocks.8.ffn.experts.mlp_experts.4.w1
426
+ - transformer.blocks.29.ffn.experts.mlp_experts.4.w1
427
+ # ffn.experts.mlp_experts.4.w2 layers
428
+ - transformer.blocks.25.ffn.experts.mlp_experts.4.w2
429
+ - transformer.blocks.24.ffn.experts.mlp_experts.4.w2
430
+ - transformer.blocks.26.ffn.experts.mlp_experts.4.w2
431
+ - transformer.blocks.5.ffn.experts.mlp_experts.4.w2
432
+ - transformer.blocks.6.ffn.experts.mlp_experts.4.w2
433
+ - transformer.blocks.32.ffn.experts.mlp_experts.4.w2
434
+ - transformer.blocks.4.ffn.experts.mlp_experts.4.w2
435
+ - transformer.blocks.36.ffn.experts.mlp_experts.4.w2
436
+ - transformer.blocks.29.ffn.experts.mlp_experts.4.w2
437
+ - transformer.blocks.27.ffn.experts.mlp_experts.4.w2
438
+ # ffn.experts.mlp_experts.5.v1 layers
439
+ - transformer.blocks.35.ffn.experts.mlp_experts.5.v1
440
+ - transformer.blocks.30.ffn.experts.mlp_experts.5.v1
441
+ - transformer.blocks.28.ffn.experts.mlp_experts.5.v1
442
+ - transformer.blocks.32.ffn.experts.mlp_experts.5.v1
443
+ - transformer.blocks.27.ffn.experts.mlp_experts.5.v1
444
+ - transformer.blocks.26.ffn.experts.mlp_experts.5.v1
445
+ - transformer.blocks.33.ffn.experts.mlp_experts.5.v1
446
+ - transformer.blocks.29.ffn.experts.mlp_experts.5.v1
447
+ - transformer.blocks.8.ffn.experts.mlp_experts.5.v1
448
+ - transformer.blocks.7.ffn.experts.mlp_experts.5.v1
449
+ # ffn.experts.mlp_experts.5.w1 layers
450
+ - transformer.blocks.0.ffn.experts.mlp_experts.5.w1
451
+ - transformer.blocks.6.ffn.experts.mlp_experts.5.w1
452
+ - transformer.blocks.7.ffn.experts.mlp_experts.5.w1
453
+ - transformer.blocks.9.ffn.experts.mlp_experts.5.w1
454
+ - transformer.blocks.8.ffn.experts.mlp_experts.5.w1
455
+ - transformer.blocks.12.ffn.experts.mlp_experts.5.w1
456
+ - transformer.blocks.3.ffn.experts.mlp_experts.5.w1
457
+ - transformer.blocks.5.ffn.experts.mlp_experts.5.w1
458
+ - transformer.blocks.4.ffn.experts.mlp_experts.5.w1
459
+ - transformer.blocks.33.ffn.experts.mlp_experts.5.w1
460
+ # ffn.experts.mlp_experts.5.w2 layers
461
+ - transformer.blocks.26.ffn.experts.mlp_experts.5.w2
462
+ - transformer.blocks.28.ffn.experts.mlp_experts.5.w2
463
+ - transformer.blocks.6.ffn.experts.mlp_experts.5.w2
464
+ - transformer.blocks.33.ffn.experts.mlp_experts.5.w2
465
+ - transformer.blocks.5.ffn.experts.mlp_experts.5.w2
466
+ - transformer.blocks.27.ffn.experts.mlp_experts.5.w2
467
+ - transformer.blocks.3.ffn.experts.mlp_experts.5.w2
468
+ - transformer.blocks.29.ffn.experts.mlp_experts.5.w2
469
+ - transformer.blocks.25.ffn.experts.mlp_experts.5.w2
470
+ - transformer.blocks.7.ffn.experts.mlp_experts.5.w2
471
+ # ffn.experts.mlp_experts.6.v1 layers
472
+ - transformer.blocks.34.ffn.experts.mlp_experts.6.v1
473
+ - transformer.blocks.31.ffn.experts.mlp_experts.6.v1
474
+ - transformer.blocks.30.ffn.experts.mlp_experts.6.v1
475
+ - transformer.blocks.26.ffn.experts.mlp_experts.6.v1
476
+ - transformer.blocks.35.ffn.experts.mlp_experts.6.v1
477
+ - transformer.blocks.20.ffn.experts.mlp_experts.6.v1
478
+ - transformer.blocks.15.ffn.experts.mlp_experts.6.v1
479
+ - transformer.blocks.29.ffn.experts.mlp_experts.6.v1
480
+ - transformer.blocks.10.ffn.experts.mlp_experts.6.v1
481
+ - transformer.blocks.24.ffn.experts.mlp_experts.6.v1
482
+ # ffn.experts.mlp_experts.6.w1 layers
483
+ - transformer.blocks.0.ffn.experts.mlp_experts.6.w1
484
+ - transformer.blocks.10.ffn.experts.mlp_experts.6.w1
485
+ - transformer.blocks.9.ffn.experts.mlp_experts.6.w1
486
+ - transformer.blocks.30.ffn.experts.mlp_experts.6.w1
487
+ - transformer.blocks.4.ffn.experts.mlp_experts.6.w1
488
+ - transformer.blocks.34.ffn.experts.mlp_experts.6.w1
489
+ - transformer.blocks.26.ffn.experts.mlp_experts.6.w1
490
+ - transformer.blocks.2.ffn.experts.mlp_experts.6.w1
491
+ - transformer.blocks.29.ffn.experts.mlp_experts.6.w1
492
+ - transformer.blocks.8.ffn.experts.mlp_experts.6.w1
493
+ # ffn.experts.mlp_experts.6.w2 layers
494
+ - transformer.blocks.24.ffn.experts.mlp_experts.6.w2
495
+ - transformer.blocks.26.ffn.experts.mlp_experts.6.w2
496
+ - transformer.blocks.32.ffn.experts.mlp_experts.6.w2
497
+ - transformer.blocks.30.ffn.experts.mlp_experts.6.w2
498
+ - transformer.blocks.25.ffn.experts.mlp_experts.6.w2
499
+ - transformer.blocks.31.ffn.experts.mlp_experts.6.w2
500
+ - transformer.blocks.20.ffn.experts.mlp_experts.6.w2
501
+ - transformer.blocks.4.ffn.experts.mlp_experts.6.w2
502
+ - transformer.blocks.2.ffn.experts.mlp_experts.6.w2
503
+ - transformer.blocks.9.ffn.experts.mlp_experts.6.w2
504
+ # ffn.experts.mlp_experts.7.v1 layers
505
+ - transformer.blocks.27.ffn.experts.mlp_experts.7.v1
506
+ - transformer.blocks.28.ffn.experts.mlp_experts.7.v1
507
+ - transformer.blocks.33.ffn.experts.mlp_experts.7.v1
508
+ - transformer.blocks.29.ffn.experts.mlp_experts.7.v1
509
+ - transformer.blocks.24.ffn.experts.mlp_experts.7.v1
510
+ - transformer.blocks.11.ffn.experts.mlp_experts.7.v1
511
+ - transformer.blocks.12.ffn.experts.mlp_experts.7.v1
512
+ - transformer.blocks.10.ffn.experts.mlp_experts.7.v1
513
+ - transformer.blocks.23.ffn.experts.mlp_experts.7.v1
514
+ - transformer.blocks.34.ffn.experts.mlp_experts.7.v1
515
+ # ffn.experts.mlp_experts.7.w1 layers
516
+ - transformer.blocks.12.ffn.experts.mlp_experts.7.w1
517
+ - transformer.blocks.0.ffn.experts.mlp_experts.7.w1
518
+ - transformer.blocks.5.ffn.experts.mlp_experts.7.w1
519
+ - transformer.blocks.29.ffn.experts.mlp_experts.7.w1
520
+ - transformer.blocks.10.ffn.experts.mlp_experts.7.w1
521
+ - transformer.blocks.4.ffn.experts.mlp_experts.7.w1
522
+ - transformer.blocks.3.ffn.experts.mlp_experts.7.w1
523
+ - transformer.blocks.8.ffn.experts.mlp_experts.7.w1
524
+ - transformer.blocks.34.ffn.experts.mlp_experts.7.w1
525
+ - transformer.blocks.33.ffn.experts.mlp_experts.7.w1
526
+ # ffn.experts.mlp_experts.7.w2 layers
527
+ - transformer.blocks.23.ffn.experts.mlp_experts.7.w2
528
+ - transformer.blocks.24.ffn.experts.mlp_experts.7.w2
529
+ - transformer.blocks.31.ffn.experts.mlp_experts.7.w2
530
+ - transformer.blocks.28.ffn.experts.mlp_experts.7.w2
531
+ - transformer.blocks.27.ffn.experts.mlp_experts.7.w2
532
+ - transformer.blocks.5.ffn.experts.mlp_experts.7.w2
533
+ - transformer.blocks.25.ffn.experts.mlp_experts.7.w2
534
+ - transformer.blocks.29.ffn.experts.mlp_experts.7.w2
535
+ - transformer.blocks.3.ffn.experts.mlp_experts.7.w2
536
+ - transformer.blocks.33.ffn.experts.mlp_experts.7.w2
537
+ # ffn.experts.mlp_experts.8.v1 layers
538
+ - transformer.blocks.30.ffn.experts.mlp_experts.8.v1
539
+ - transformer.blocks.27.ffn.experts.mlp_experts.8.v1
540
+ - transformer.blocks.20.ffn.experts.mlp_experts.8.v1
541
+ - transformer.blocks.32.ffn.experts.mlp_experts.8.v1
542
+ - transformer.blocks.34.ffn.experts.mlp_experts.8.v1
543
+ - transformer.blocks.33.ffn.experts.mlp_experts.8.v1
544
+ - transformer.blocks.9.ffn.experts.mlp_experts.8.v1
545
+ - transformer.blocks.7.ffn.experts.mlp_experts.8.v1
546
+ - transformer.blocks.6.ffn.experts.mlp_experts.8.v1
547
+ - transformer.blocks.24.ffn.experts.mlp_experts.8.v1
548
+ # ffn.experts.mlp_experts.8.w1 layers
549
+ - transformer.blocks.7.ffn.experts.mlp_experts.8.w1
550
+ - transformer.blocks.6.ffn.experts.mlp_experts.8.w1
551
+ - transformer.blocks.0.ffn.experts.mlp_experts.8.w1
552
+ - transformer.blocks.9.ffn.experts.mlp_experts.8.w1
553
+ - transformer.blocks.3.ffn.experts.mlp_experts.8.w1
554
+ - transformer.blocks.2.ffn.experts.mlp_experts.8.w1
555
+ - transformer.blocks.8.ffn.experts.mlp_experts.8.w1
556
+ - transformer.blocks.30.ffn.experts.mlp_experts.8.w1
557
+ - transformer.blocks.24.ffn.experts.mlp_experts.8.w1
558
+ - transformer.blocks.1.ffn.experts.mlp_experts.8.w1
559
+ # ffn.experts.mlp_experts.8.w2 layers
560
+ - transformer.blocks.32.ffn.experts.mlp_experts.8.w2
561
+ - transformer.blocks.24.ffn.experts.mlp_experts.8.w2
562
+ - transformer.blocks.27.ffn.experts.mlp_experts.8.w2
563
+ - transformer.blocks.30.ffn.experts.mlp_experts.8.w2
564
+ - transformer.blocks.31.ffn.experts.mlp_experts.8.w2
565
+ - transformer.blocks.28.ffn.experts.mlp_experts.8.w2
566
+ - transformer.blocks.2.ffn.experts.mlp_experts.8.w2
567
+ - transformer.blocks.3.ffn.experts.mlp_experts.8.w2
568
+ - transformer.blocks.23.ffn.experts.mlp_experts.8.w2
569
+ - transformer.blocks.29.ffn.experts.mlp_experts.8.w2
570
+ # ffn.experts.mlp_experts.9.v1 layers
571
+ - transformer.blocks.31.ffn.experts.mlp_experts.9.v1
572
+ - transformer.blocks.27.ffn.experts.mlp_experts.9.v1
573
+ - transformer.blocks.29.ffn.experts.mlp_experts.9.v1
574
+ - transformer.blocks.33.ffn.experts.mlp_experts.9.v1
575
+ - transformer.blocks.25.ffn.experts.mlp_experts.9.v1
576
+ - transformer.blocks.14.ffn.experts.mlp_experts.9.v1
577
+ - transformer.blocks.32.ffn.experts.mlp_experts.9.v1
578
+ - transformer.blocks.7.ffn.experts.mlp_experts.9.v1
579
+ - transformer.blocks.9.ffn.experts.mlp_experts.9.v1
580
+ - transformer.blocks.34.ffn.experts.mlp_experts.9.v1
581
+ # ffn.experts.mlp_experts.9.w1 layers
582
+ - transformer.blocks.7.ffn.experts.mlp_experts.9.w1
583
+ - transformer.blocks.1.ffn.experts.mlp_experts.9.w1
584
+ - transformer.blocks.9.ffn.experts.mlp_experts.9.w1
585
+ - transformer.blocks.2.ffn.experts.mlp_experts.9.w1
586
+ - transformer.blocks.27.ffn.experts.mlp_experts.9.w1
587
+ - transformer.blocks.12.ffn.experts.mlp_experts.9.w1
588
+ - transformer.blocks.4.ffn.experts.mlp_experts.9.w1
589
+ - transformer.blocks.6.ffn.experts.mlp_experts.9.w1
590
+ - transformer.blocks.19.ffn.experts.mlp_experts.9.w1
591
+ - transformer.blocks.8.ffn.experts.mlp_experts.9.w1
592
+ # ffn.experts.mlp_experts.9.w2 layers
593
+ - transformer.blocks.26.ffn.experts.mlp_experts.9.w2
594
+ - transformer.blocks.25.ffn.experts.mlp_experts.9.w2
595
+ - transformer.blocks.28.ffn.experts.mlp_experts.9.w2
596
+ - transformer.blocks.27.ffn.experts.mlp_experts.9.w2
597
+ - transformer.blocks.31.ffn.experts.mlp_experts.9.w2
598
+ - transformer.blocks.29.ffn.experts.mlp_experts.9.w2
599
+ - transformer.blocks.7.ffn.experts.mlp_experts.9.w2
600
+ - transformer.blocks.34.ffn.experts.mlp_experts.9.w2
601
+ - transformer.blocks.2.ffn.experts.mlp_experts.9.w2
602
+ - transformer.blocks.33.ffn.experts.mlp_experts.9.w2
603
+ # ffn.router.layer layers
604
+ - transformer.blocks.2.ffn.router.layer
605
+ - transformer.blocks.3.ffn.router.layer
606
+ - transformer.blocks.4.ffn.router.layer
607
+ - transformer.blocks.5.ffn.router.layer
608
+ - transformer.blocks.6.ffn.router.layer
609
+ - transformer.blocks.7.ffn.router.layer
610
+ - transformer.blocks.8.ffn.router.layer
611
+ - transformer.blocks.9.ffn.router.layer
612
+ - transformer.blocks.10.ffn.router.layer
613
+ - transformer.blocks.11.ffn.router.layer
614
+ # norm_attn_norm.attn.Wqkv layers
615
+ - transformer.blocks.16.norm_attn_norm.attn.Wqkv
616
+ - transformer.blocks.15.norm_attn_norm.attn.Wqkv
617
+ - transformer.blocks.11.norm_attn_norm.attn.Wqkv
618
+ - transformer.blocks.14.norm_attn_norm.attn.Wqkv
619
+ - transformer.blocks.12.norm_attn_norm.attn.Wqkv
620
+ - transformer.blocks.20.norm_attn_norm.attn.Wqkv
621
+ - transformer.blocks.10.norm_attn_norm.attn.Wqkv
622
+ - transformer.blocks.9.norm_attn_norm.attn.Wqkv
623
+ - transformer.blocks.19.norm_attn_norm.attn.Wqkv
624
+ - transformer.blocks.18.norm_attn_norm.attn.Wqkv
625
+ # norm_attn_norm.attn.out_proj layers
626
+ - transformer.blocks.1.norm_attn_norm.attn.out_proj
627
+ - transformer.blocks.18.norm_attn_norm.attn.out_proj
628
+ - transformer.blocks.2.norm_attn_norm.attn.out_proj
629
+ - transformer.blocks.16.norm_attn_norm.attn.out_proj
630
+ - transformer.blocks.0.norm_attn_norm.attn.out_proj
631
+ - transformer.blocks.39.norm_attn_norm.attn.out_proj
632
+ - transformer.blocks.23.norm_attn_norm.attn.out_proj
633
+ - transformer.blocks.8.norm_attn_norm.attn.out_proj
634
+ - transformer.blocks.24.norm_attn_norm.attn.out_proj
635
+ - transformer.blocks.19.norm_attn_norm.attn.out_proj
636
+ # norm_attn_norm.norm_1 layers
637
+ - transformer.blocks.0.norm_attn_norm.norm_1
638
+ - transformer.blocks.1.norm_attn_norm.norm_1
639
+ - transformer.blocks.2.norm_attn_norm.norm_1
640
+ - transformer.blocks.3.norm_attn_norm.norm_1
641
+ - transformer.blocks.4.norm_attn_norm.norm_1
642
+ - transformer.blocks.5.norm_attn_norm.norm_1
643
+ - transformer.blocks.6.norm_attn_norm.norm_1
644
+ - transformer.blocks.7.norm_attn_norm.norm_1
645
+ - transformer.blocks.8.norm_attn_norm.norm_1
646
+ - transformer.blocks.9.norm_attn_norm.norm_1
647
+ # norm_attn_norm.norm_2 layers
648
+ - transformer.blocks.0.norm_attn_norm.norm_2
649
+ - transformer.blocks.1.norm_attn_norm.norm_2
650
+ - transformer.blocks.2.norm_attn_norm.norm_2
651
+ - transformer.blocks.3.norm_attn_norm.norm_2
652
+ - transformer.blocks.4.norm_attn_norm.norm_2
653
+ - transformer.blocks.5.norm_attn_norm.norm_2
654
+ - transformer.blocks.6.norm_attn_norm.norm_2
655
+ - transformer.blocks.7.norm_attn_norm.norm_2
656
+ - transformer.blocks.8.norm_attn_norm.norm_2
657
+ - transformer.blocks.9.norm_attn_norm.norm_2
658
+ # transformer.norm_f layers
659
+ # transformer.wte layers
660
+ # ffn.experts.mlp_experts.11.v1 layers
661
+ - transformer.blocks.29.ffn.experts.mlp_experts.11.v1
662
+ - transformer.blocks.27.ffn.experts.mlp_experts.11.v1
663
+ - transformer.blocks.30.ffn.experts.mlp_experts.11.v1
664
+ - transformer.blocks.28.ffn.experts.mlp_experts.11.v1
665
+ - transformer.blocks.22.ffn.experts.mlp_experts.11.v1
666
+ - transformer.blocks.7.ffn.experts.mlp_experts.11.v1
667
+ - transformer.blocks.24.ffn.experts.mlp_experts.11.v1
668
+ - transformer.blocks.8.ffn.experts.mlp_experts.11.v1
669
+ - transformer.blocks.6.ffn.experts.mlp_experts.11.v1
670
+ - transformer.blocks.12.ffn.experts.mlp_experts.11.v1
671
+
672
+
673
+
674
+ dataset_prepared_path: dbrx2
675
+ val_set_size: 0.01
676
+ output_dir: ./out
677
+
678
+ sequence_len: 4096
679
+ sample_packing: true
680
+ pad_to_sequence_len: true
681
+
682
+ wandb_project: dolphin-2.9-Dbrx
683
+ wandb_watch:
684
+ wandb_run_id:
685
+ wandb_log_model:
686
+
687
+ gradient_accumulation_steps: 8
688
+ micro_batch_size: 1
689
+ num_epochs: 1
690
+ optimizer: paged_adamw_8bit
691
+ lr_scheduler: cosine
692
+ learning_rate: 1e-5
693
+
694
+ train_on_inputs: false
695
+ group_by_length: false
696
+ bf16: auto
697
+ fp16:
698
+ tf32: true
699
+
700
+ gradient_checkpointing: true
701
+ gradient_checkpointing_kwargs:
702
+ use_reentrant: false
703
+ early_stopping_patience:
704
+ # resume_from_checkpoint: /workspace/axolotl/dbrx-checkpoint
705
+ logging_steps: 1
706
+ xformers_attention:
707
+ flash_attention: true
708
+
709
+ warmup_steps: 10
710
+ evals_per_epoch: 4
711
+ eval_table_size:
712
+ saves_per_epoch: 4
713
+ save_total_limit: 2
714
+ save_steps:
715
+ debug:
716
+ deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_params.json
717
+ weight_decay: 0.05
718
+ fsdp:
719
+ fsdp_config:
720
+ special_tokens:
721
+ bos_token: "<|endoftext|>"
722
+ eos_token: "<|im_end|>"
723
+ pad_token: "<|pad|>"
724
+ unk_token: "<|endoftext|>"
725
+ tokens:
726
+ - "<|im_start|>"
727
+ - "<|im_end|>"
728
+
729
+
730
+ ```
731
+
732
+ </details><br>
733
+
734
+ # out
735
+
736
+ This model was trained from scratch on the None dataset.
737
+ It achieves the following results on the evaluation set:
738
+ - Loss: 0.4336
739
+
740
+ ## Model description
741
+
742
+ More information needed
743
+
744
+ ## Intended uses & limitations
745
+
746
+ More information needed
747
+
748
+ ## Training and evaluation data
749
+
750
+ More information needed
751
+
752
+ ## Training procedure
753
+
754
+ ### Training hyperparameters
755
+
756
+ The following hyperparameters were used during training:
757
+ - learning_rate: 1e-05
758
+ - train_batch_size: 1
759
+ - eval_batch_size: 1
760
+ - seed: 42
761
+ - distributed_type: multi-GPU
762
+ - num_devices: 8
763
+ - gradient_accumulation_steps: 8
764
+ - total_train_batch_size: 64
765
+ - total_eval_batch_size: 8
766
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
767
+ - lr_scheduler_type: cosine
768
+ - lr_scheduler_warmup_steps: 10
769
+ - num_epochs: 1
770
+
771
+ ### Training results
772
+
773
+ | Training Loss | Epoch | Step | Validation Loss |
774
+ |:-------------:|:-----:|:----:|:---------------:|
775
+ | 0.4009 | 0.0 | 1 | 0.4328 |
776
+ | 0.413 | 0.25 | 587 | 0.4408 |
777
+ | 0.3626 | 0.5 | 1174 | 0.4368 |
778
+ | 0.3896 | 0.75 | 1761 | 0.4336 |
779
+
780
+
781
+ ### Framework versions
782
+
783
+ - Transformers 4.40.0.dev0
784
+ - Pytorch 2.2.2+cu121
785
+ - Datasets 2.15.0
786
+ - Tokenizers 0.15.0
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<|im_end|>": 100278,
3
+ "<|im_start|>": 100277
4
+ }
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/workspace/axolotl/dbrx-checkpoint",
3
+ "architectures": [
4
+ "DbrxForCausalLM"
5
+ ],
6
+ "attn_config": {
7
+ "clip_qkv": 8,
8
+ "kv_n_heads": 8,
9
+ "model_type": "",
10
+ "rope_theta": 500000
11
+ },
12
+ "auto_map": {
13
+ "AutoConfig": "configuration_dbrx.DbrxConfig",
14
+ "AutoModelForCausalLM": "modeling_dbrx.DbrxForCausalLM"
15
+ },
16
+ "d_model": 6144,
17
+ "emb_pdrop": 0.0,
18
+ "ffn_config": {
19
+ "ffn_hidden_size": 10752,
20
+ "model_type": "",
21
+ "moe_jitter_eps": 0.01,
22
+ "moe_loss_weight": 0.05,
23
+ "moe_num_experts": 16,
24
+ "moe_top_k": 4
25
+ },
26
+ "initializer_range": 0.02,
27
+ "max_seq_len": 32768,
28
+ "model_type": "dbrx",
29
+ "n_heads": 48,
30
+ "n_layers": 40,
31
+ "output_router_logits": false,
32
+ "resid_pdrop": 0.0,
33
+ "router_aux_loss_coef": 0.05,
34
+ "tie_word_embeddings": false,
35
+ "torch_dtype": "bfloat16",
36
+ "transformers_version": "4.40.0.dev0",
37
+ "use_cache": false,
38
+ "vocab_size": 100352
39
+ }
configuration_dbrx.py ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Dbrx configuration."""
2
+ from typing import Any, Optional
3
+
4
+ from transformers.configuration_utils import PretrainedConfig
5
+ from transformers.utils import logging
6
+
7
+ logger = logging.get_logger(__name__)
8
+
9
+ DBRX_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
10
+
11
+
12
+ class DbrxAttentionConfig(PretrainedConfig):
13
+ """Configuration class for Dbrx Attention.
14
+
15
+ [`DbrxAttention`] class. It is used to instantiate attention layers
16
+ according to the specified arguments, defining the layers architecture.
17
+
18
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
19
+ documentation from [`PretrainedConfig`] for more information.
20
+
21
+ Args:
22
+ attn_pdrop (`float`, *optional*, defaults to 0.0):
23
+ The dropout probability for the attention layers.
24
+ clip_qkv (`float`, *optional*, defualts to None):
25
+ If not `None`, clip the queries, keys, and values in the attention layer to this value.
26
+ kv_n_heads (Optional[int]): For grouped_query_attention only, allow user to specify number of kv heads.
27
+ rope_theta (float): The base frequency for rope.
28
+ """
29
+
30
+ def __init__(
31
+ self,
32
+ attn_pdrop: float = 0,
33
+ clip_qkv: Optional[float] = None,
34
+ kv_n_heads: int = 1,
35
+ rope_theta: float = 10000.0,
36
+ **kwargs: Any,
37
+ ):
38
+ super().__init__(**kwargs)
39
+ self.attn_pdrop = attn_pdrop
40
+ self.clip_qkv = clip_qkv
41
+ self.kv_n_heads = kv_n_heads
42
+ self.rope_theta = rope_theta
43
+
44
+ for k in ['model_type']:
45
+ if k in kwargs:
46
+ kwargs.pop(k)
47
+ if len(kwargs) != 0:
48
+ raise ValueError(f'Found unknown {kwargs=}')
49
+
50
+ @classmethod
51
+ def from_pretrained(cls, pretrained_model_name_or_path: str,
52
+ **kwargs: Any) -> 'PretrainedConfig':
53
+ cls._set_token_in_kwargs(kwargs)
54
+
55
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path,
56
+ **kwargs)
57
+
58
+ if config_dict.get('model_type') == 'dbrx':
59
+ config_dict = config_dict['attn_config']
60
+
61
+ if 'model_type' in config_dict and hasattr(
62
+ cls,
63
+ 'model_type') and config_dict['model_type'] != cls.model_type:
64
+ logger.warning(
65
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
66
+ +
67
+ f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
68
+ )
69
+
70
+ return cls.from_dict(config_dict, **kwargs)
71
+
72
+
73
+ class DbrxFFNConfig(PretrainedConfig):
74
+ """Configuration class for Dbrx FFN.
75
+
76
+ [`DbrxFFN`] class. It is used to instantiate feedforward layers according to
77
+ the specified arguments, defining the layers architecture.
78
+
79
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
80
+ documentation from [`PretrainedConfig`] for more information.
81
+
82
+ Args:
83
+ ffn_act_fn (dict, optional): A dict specifying activation function for the FFN.
84
+ The dict should have a key 'name' with the value being the name of
85
+ the activation function along with any additional keyword arguments.
86
+ ffn_hidden_size (int, optional): The hidden size of the feedforward network.
87
+ moe_num_experts (int, optional): The number of experts in the mixture of experts layer.
88
+ moe_top_k (int, optional): The number of experts to use in the mixture of experts layer.
89
+ moe_jitter_eps (float, optional): The jitter epsilon for the mixture of experts layer.
90
+ moe_loss_weight (float, optional): The loss weight for the mixture of experts layer.
91
+ moe_normalize_expert_weights (float, optional): The normalization factor for the expert weights.
92
+ uniform_expert_assignment (bool, optional): Whether to use uniform expert assignment.
93
+ This should only be used for benchmarking purposes.
94
+ """
95
+
96
+ def __init__(
97
+ self,
98
+ ffn_act_fn: Optional[dict] = None,
99
+ ffn_hidden_size: int = 3584,
100
+ moe_num_experts: int = 4,
101
+ moe_top_k: int = 1,
102
+ moe_jitter_eps: Optional[float] = None,
103
+ moe_loss_weight: float = 0.01,
104
+ moe_normalize_expert_weights: Optional[float] = 1,
105
+ uniform_expert_assignment: bool = False,
106
+ **kwargs: Any,
107
+ ):
108
+ super().__init__()
109
+ if ffn_act_fn is None:
110
+ ffn_act_fn = {'name': 'silu'}
111
+ self.ffn_act_fn = ffn_act_fn
112
+ self.ffn_hidden_size = ffn_hidden_size
113
+ self.moe_num_experts = moe_num_experts
114
+ self.moe_top_k = moe_top_k
115
+ self.moe_jitter_eps = moe_jitter_eps
116
+ self.moe_loss_weight = moe_loss_weight
117
+ self.moe_normalize_expert_weights = moe_normalize_expert_weights
118
+ self.uniform_expert_assignment = uniform_expert_assignment
119
+
120
+ for k in ['model_type']:
121
+ if k in kwargs:
122
+ kwargs.pop(k)
123
+ if len(kwargs) != 0:
124
+ raise ValueError(f'Found unknown {kwargs=}')
125
+
126
+ @classmethod
127
+ def from_pretrained(cls, pretrained_model_name_or_path: str,
128
+ **kwargs: Any) -> 'PretrainedConfig':
129
+ cls._set_token_in_kwargs(kwargs)
130
+
131
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path,
132
+ **kwargs)
133
+
134
+ if config_dict.get('model_type') == 'dbrx':
135
+ config_dict = config_dict['ffn_config']
136
+
137
+ if 'model_type' in config_dict and hasattr(
138
+ cls,
139
+ 'model_type') and config_dict['model_type'] != cls.model_type:
140
+ logger.warning(
141
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
142
+ +
143
+ f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
144
+ )
145
+
146
+ return cls.from_dict(config_dict, **kwargs)
147
+
148
+
149
+ class DbrxConfig(PretrainedConfig):
150
+ """Configuration class for Dbrx.
151
+
152
+ [`DbrxModel`]. It is used to instantiate a Dbrx model according to the
153
+ specified arguments, defining the model architecture.
154
+
155
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
156
+ documentation from [`PretrainedConfig`] for more information.
157
+
158
+
159
+ Args:
160
+ d_model (`int`, *optional*, defaults to 6144):
161
+ Dimensionality of the embeddings and hidden states.
162
+ n_heads (`int`, *optional*, defaults to 48):
163
+ Number of attention heads for each attention layer in the Transformer encoder.
164
+ n_layers (`int`, *optional*, defaults to 40):
165
+ Number of hidden layers in the Transformer encoder.
166
+ max_seq_len (`int`, *optional*, defaults to 32768):
167
+ The maximum sequence length of the model.
168
+ vocab_size (`int`, *optional*, defaults to 100352):
169
+ Vocabulary size of the Dbrx model. Defines the maximum number of different tokens that can be represented by
170
+ the `inputs_ids` passed when calling [`DbrxModel`].
171
+ resid_pdrop (`float`, *optional*, defaults to 0.0):
172
+ The dropout probability applied to the attention output before combining with residual.
173
+ emb_pdrop (`float`, *optional*, defaults to 0.0):
174
+ The dropout probability for the embedding layer.
175
+ attn_config (`dict`, *optional*):
176
+ A dictionary used to configure the model's attention module.
177
+ ffn_config (`dict`, *optional*):
178
+ A dictionary used to configure the model's FFN module.
179
+ use_cache (`bool`, *optional*, defaults to `False`):
180
+ Whether or not the model should return the last key/values attentions (not used by all models).
181
+ initializer_range (`float`, *optional*, defaults to 0.02):
182
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
183
+ output_router_logits (`bool`, *optional*, defaults to `False`):
184
+ Whether or not the router logits should be returned by the model. Enabling this will also
185
+ allow the model to output the auxiliary loss. See [here]() for more details
186
+ router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
187
+ The aux loss factor for the total loss.
188
+
189
+
190
+ Example:
191
+ ```python
192
+ >>> from transformers import DbrxConfig, DbrxModel
193
+
194
+ >>> # Initializing a Dbrx configuration
195
+ >>> configuration = DbrxConfig()
196
+
197
+ >>> # Initializing a model (with random weights) from the configuration
198
+ >>> model = DbrxModel(configuration)
199
+
200
+ >>> # Accessing the model configuration
201
+ >>> configuration = model.config
202
+ ```
203
+ """
204
+
205
+ model_type = 'dbrx'
206
+ attribute_map = {
207
+ 'num_attention_heads': 'n_heads',
208
+ 'hidden_size': 'd_model',
209
+ 'num_hidden_layers': 'n_layers',
210
+ 'max_position_embeddings': 'max_seq_len'
211
+ }
212
+
213
+ def __init__(
214
+ self,
215
+ d_model: int = 2048,
216
+ n_heads: int = 16,
217
+ n_layers: int = 24,
218
+ max_seq_len: int = 2048,
219
+ vocab_size: int = 32000,
220
+ resid_pdrop: float = 0.0,
221
+ emb_pdrop: float = 0.0,
222
+ attn_config: Optional[DbrxAttentionConfig] = None,
223
+ ffn_config: Optional[DbrxFFNConfig] = None,
224
+ use_cache: bool = True,
225
+ initializer_range: float = 0.02,
226
+ output_router_logits: bool = False,
227
+ router_aux_loss_coef: float = 0.05,
228
+ **kwargs: Any,
229
+ ):
230
+ if attn_config is None:
231
+ self.attn_config = DbrxAttentionConfig()
232
+ elif isinstance(attn_config, dict):
233
+ self.attn_config = DbrxAttentionConfig(**attn_config)
234
+ else:
235
+ self.attn_config = attn_config
236
+
237
+ if ffn_config is None:
238
+ self.ffn_config = DbrxFFNConfig()
239
+ elif isinstance(ffn_config, dict):
240
+ self.ffn_config = DbrxFFNConfig(**ffn_config)
241
+ else:
242
+ self.ffn_config = ffn_config
243
+
244
+ self.d_model = d_model
245
+ self.n_heads = n_heads
246
+ self.n_layers = n_layers
247
+ self.max_seq_len = max_seq_len
248
+ self.vocab_size = vocab_size
249
+ self.resid_pdrop = resid_pdrop
250
+ self.emb_pdrop = emb_pdrop
251
+ self.use_cache = use_cache
252
+ self.initializer_range = initializer_range
253
+ self.output_router_logits = output_router_logits
254
+ self.router_aux_loss_coef = router_aux_loss_coef
255
+
256
+ tie_word_embeddings = kwargs.pop('tie_word_embeddings', False)
257
+ if tie_word_embeddings:
258
+ raise ValueError(
259
+ 'tie_word_embeddings is not supported for Dbrx models.')
260
+
261
+ super().__init__(
262
+ tie_word_embeddings=tie_word_embeddings,
263
+ **kwargs,
264
+ )
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "do_sample": true,
4
+ "transformers_version": "4.40.0.dev0"
5
+ }
model-00001-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:844d6ee310e60f776437f34b532a89764c869474fa771babcc88f457a1a41b49
3
+ size 4976767312
model-00002-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67edbaf8ce22cef89b551054543eaf80c2a65f71702a0a6e818300e19a7d9883
3
+ size 4932728256
model-00003-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f9432fdf70381f02519e595cc1a32171a24cdff89817bab9b9162d9261df8cb
3
+ size 4932728256
model-00004-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42a92ceadf870e5641c61925094356618bbccb078ee8ddadf6e0fe7000f02a22
3
+ size 4888466376
model-00005-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fbaeb5f57d35a71310802168cc1c7d0c40adff467da82b4a40bd7923a9ee35e4
3
+ size 4932728248
model-00006-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b968c9e9d8e880db7517ed6d6fce3688ba186bd6bf52813cb5be6d2ca9d94bd
3
+ size 4932728256
model-00007-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc898a1265501c5d7816d0d5a5bec727c82ff398e34b01216d19a08d1276441
3
+ size 4932728256
model-00008-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a03a9d91c30daf0be095e7fe207f2ec64459f3b84de5354436d66ee7bc87fdb5
3
+ size 4888466376
model-00009-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:019e332f7c02b72f409d07597a51fe6b9750f7e3c6844e625ff7d5b64fc53dd4
3
+ size 4932728248
model-00010-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08796964f4ab0f84558720edf6eaa3a84d4284fd8fb46e5efa8c307acee50bfb
3
+ size 4932728256
model-00011-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b923056be1b7922da2ccde84fc5fbce0e4493e74d075080c9cff6d6d72baccc
3
+ size 4932728256
model-00012-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4df2fe028ea52491ed31525dd581ecac07e19da6585e066ed00513f956c1e4a2
3
+ size 4888466376
model-00013-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51cc2f9e11b4d986cf83a0568afb4ae3a0796606bd9bdd8337a4f44f734ca86b
3
+ size 4932728240
model-00014-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:166b445c266747bfef1b529494a8b64b8014ae9f681d3b31bb279f5ce56148dc
3
+ size 4932728280
model-00015-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51c4a8745ab117e3993ec7e41933885240759ff5766b61309e6210740e5f0687
3
+ size 4932728296
model-00016-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:401be63b10132316fd3d16a1a4eb5ab1b41d38dcd24dfba7585a13faf36b1d55
3
+ size 4888466416
model-00017-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f508fbcfac88c06a1a9c01c6041530cd7bc32261753faa86240ce733f96c335d
3
+ size 4932728288
model-00018-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba87d72f59502b88a04fa2a145f73712d412fa784e63c816b72f872cccc167e3
3
+ size 4932728296
model-00019-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22c0313fb62cb31922e14d2d71b3794d6ca82a6193ca1fced36fb57fc445b0c2
3
+ size 4932728296
model-00020-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54b983476b64d4eafadfe50647c307d0bd1114415be8eb1e2f65c612c778bf07
3
+ size 4888466416
model-00021-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d59190fb228d8490d7c65a4003e7f65877bd8d0f2c527b7b3a1b493026efa88
3
+ size 4932728288
model-00022-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af8b37e5d9a55ff376f923cafe175c109ab3329dc0dcd703d1d5110d25f5cd2f
3
+ size 4932728296
model-00023-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:042526a2936e7fb570abd769da46aedd6fcc65745aaee13f00bcdc70a5e06b81
3
+ size 4932728296
model-00024-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eebc383a0a528db88d0d4965a02b4cbc81d702991963e4523a6b2dfc3a9151f9
3
+ size 4888466416
model-00025-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ae5b9fcc37e67350cd83ff8b3db1313d119ed5afc7d594ab8a6918077918eba
3
+ size 4932728288
model-00026-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:033d368ca1793ae5ba73bb03b8c6cc7256eb4a18f77329c2c8fdeb8b5fbd3411
3
+ size 4932728296
model-00027-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:84f35cfd716d3f28eb636926ab692a0fef1cca2a5fc6df2aaf508895b4d6b8c5
3
+ size 4932728296
model-00028-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee70a22cc0b03e27154164b623d987837f156c813d3ad3db979859995a9101da
3
+ size 4888466416
model-00029-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d970806d6867f0808c497eb159a21dcded2194cc676a3da58159c5449a424c8
3
+ size 4932728288
model-00030-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2d1b9f8845e95ca6afcd803e158d82807646f35114e06164451d303b9ab9ec8
3
+ size 4932728296
model-00031-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b116b356f8871f98dcac1130097315bb6e534e9f91e8703b2c5a12ad9ba000f
3
+ size 4932728296
model-00032-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ca0e4c2054445a7b33a548f45c6b09bf469e97ffb0c48e27e6b277bf04e6037
3
+ size 4888466416
model-00033-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e18cbafcfc97cecc047fa18f64ff805b61a3976b8b6b01b333c6cae73c3b9797
3
+ size 4932728288
model-00034-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc7bfcbd66ee533cd39cf2c236ac7a32f249f4b90c6a1d025bd30e3dcba8b37e
3
+ size 4932728288
model-00035-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3957da1791e004a08595a89a2ea4587c168a1c6b916da521fd4fde3751b68a89
3
+ size 4932728296
model-00036-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5691f61db31dd0894d272f6a2107e366484825fe1279952f9abfc835421cf16e
3
+ size 4989142256
model-00037-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21f2dbd599835d83511dd2122d5bea4ad6647f521f950dcb901699c1aa1bcfcb
3
+ size 4964173160
model-00038-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:500fd82e253552d7283f6bc2dd7287a1cfc524d3a483bd6e525de912238c815c
3
+ size 4932728288
model-00039-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79307da855fad9bd2377cc36c914938657dfc6554a35edaee4874b6153bef98f
3
+ size 4932728296
model-00040-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b300507198f682ee40a81b1af4b16169023ae07fc3f45767eea3d0019c8f84f6
3
+ size 4932728296
model-00041-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c20d1529859d1a2cd0ba96512ce0dfe4d97137e591febf0998d80a2ee497731
3
+ size 4888466408
model-00042-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bfdf0dcff1dc6f5da4754dbf6d58f4ec69102b185f95a3116c106597c9fd34b6
3
+ size 4932728288
model-00043-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d62413dbb7ec0a905a8f03b47f86693ddf0570c35dc8afb83cdc31892708d420
3
+ size 4932728296
model-00044-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0de53f440d3e537a70891a225e39555f0a730ae2ba92916f98087e86531d330d
3
+ size 4932728296
model-00045-of-00054.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74006b98dac79bfa8765e97abab9ef348e65c76b581c7810578489ab7c2258cc
3
+ size 4888466408