File size: 6,099 Bytes
e2f74a7 6b8dc8d e2f74a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
---
license: apache-2.0
base_model: allura-org/TQ2.5-14B-Neon-v1
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- llama-cpp
- gguf-my-repo
---
# Triangle104/TQ2.5-14B-Neon-v1-Q4_K_M-GGUF
This model was converted to GGUF format from [`allura-org/TQ2.5-14B-Neon-v1`](https://huggingface.co/allura-org/TQ2.5-14B-Neon-v1) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
Refer to the [original model card](https://huggingface.co/allura-org/TQ2.5-14B-Neon-v1) for more details on the model.
---
Model details:
-
RP finetune of Supernova-Medius. Turned out surprisingly nice on it's own, I honestly made it only as a merge fuel, but it impressed me and Prodeus enough to release it separately (history repeats I guess, Sugarquill also started out this way). Quite interesting prose, definitely quite distinct from Supernova or EVA for that matter. Instruction following is decent as well. Not really much to say about this one, just a decent RP model, tbh. Euryale-inspired I guess.
Model was trained by Auri.
Training notes
Model was trained on a dataset consisting of 77M tokens of synthetic RP and short story gen data. Training took around 2 hours on 8xH100 SXM node. Training config was more or less reused from Sugarquill, and it worked fairly well again. Had the node crash after finishing the training and merging in the LoRA, so I had to merge it with MergeKit on a separate node, otherwise everything was smooth.
Huge thanks to Retis Labs for sponsoring this run!
Format
Model responds to ChatML instruct formatting, exactly like it's base model.
<|im_start|>system
{system message}<|im_end|>
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>
Recommended Samplers
My classic stable Qwen setup works quite well:
Temperature - 0.8
Min-P - 0.05
Top-A - 0.3
Repetition Penalty - 1.03
Training config
See Axolotl config
axolotl version 0.6.0
# Model
base_model: arcee-ai/SuperNova-Medius
strict: false
# Liger Kernels (optimization)
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
# Output and HuggingFace
output_dir: /workspace/axolotl/TQ-2.5-14B-Neon
hub_model_id: allura-org/TQ-2.5-14B-Neon-LoRA
hf_use_auth_token: true
hub_strategy: "all_checkpoints"
# WandB
wandb_project: allura-org
wandb_entity:
wandb_name: TQ-2.5-14B-Neon-1
# Data
chat_template: chatml
#train_on_inputs: false
group_by_length: false
datasets:
- path: allura-org/neon-41k
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
## Evaluation
val_set_size: 0.01
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
# Technical aspects
sequence_len: 16384
save_safetensors: true
saves_per_epoch: 2
logging_steps: 1
special_tokens:
# Quantization
bf16: auto
fp16:
tf32: false
## For LoRA
load_in_8bit: false
load_in_4bit: false
# LoRA
peft_use_rslora: true
peft_use_dora: false # better but slower
adapter: lora # lora or qlora
lora_model_dir:
lora_r: 64 # 64 is optimal for most trains on instruct
lora_alpha: 32
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
# - embed_tokens
# - lm_head
#loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
#loraplus_lr_embedding:
# Training hyperparameters
# max_steps:
num_epochs: 2
# Anti Overfit and Stability
weight_decay: 0.01
max_grad_norm: 1.0
## Learning Rate
warmup_ratio: 0.05
learning_rate: 0.00003
lr_scheduler: cosine
#lr_scheduler_kwargs:
# min_lr: 0.0000024
optimizer: paged_ademamix_8bit # usually adamw_torch or paged_adamw_8bit
## Batch Size
gradient_accumulation_steps: 4 # More effective batch size - stabler train, usually. MBS also speeds it up.
micro_batch_size: 4 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
eval_batch_size: 1
# Optimizations
pad_to_sequence_len: true
sample_packing: true
eval_sample_packing: false
flash_attention: true
xformers_attention:
gradient_checkpointing: "unsloth"
gradient_checkpointing_kwargs:
use_reentrant: true
local_rank:
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
# fsdp:
# - full_shard
# - auto_wrap
# fsdp_config:
# fsdp_limit_all_gathers: true
# fsdp_sync_module_states: true
# fsdp_offload_params: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
# fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
# fsdp_state_dict_type: FULL_STATE_DICT
# fsdp_sharding_strategy: FULL_SHARD
# Misc
early_stopping_patience:
debug:
---
## Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)
```bash
brew install llama.cpp
```
Invoke the llama.cpp server or the CLI.
### CLI:
```bash
llama-cli --hf-repo Triangle104/TQ2.5-14B-Neon-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-neon-v1-q4_k_m.gguf -p "The meaning to life and the universe is"
```
### Server:
```bash
llama-server --hf-repo Triangle104/TQ2.5-14B-Neon-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-neon-v1-q4_k_m.gguf -c 2048
```
Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.
Step 1: Clone llama.cpp from GitHub.
```
git clone https://github.com/ggerganov/llama.cpp
```
Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
```
cd llama.cpp && LLAMA_CURL=1 make
```
Step 3: Run inference through the main binary.
```
./llama-cli --hf-repo Triangle104/TQ2.5-14B-Neon-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-neon-v1-q4_k_m.gguf -p "The meaning to life and the universe is"
```
or
```
./llama-server --hf-repo Triangle104/TQ2.5-14B-Neon-v1-Q4_K_M-GGUF --hf-file tq2.5-14b-neon-v1-q4_k_m.gguf -c 2048
```
|