tsumeone
/

llama-30b-supercot-4bit-cuda

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

llama-30b-supercot-4bit-cuda / README.md

tsumeone's picture

Update README.md

a0689c8 over 1 year ago

|

700 Bytes

	Quantized version of this: https://huggingface.co/ausboss/llama-30b-supercot

	GPTQ quantization using https://github.com/0cc4m/GPTQ-for-LLaMa for compatibility with 0cc4m's fork of KoboldAI

	This one is without groupsize to save on VRAM, so that you can enjoy the full 2048 max context if you have 24GB VRAM (or at least get a lot closer to it versus the groupsize version)

	Command used to quantize:
	```python llama.py c:\llama-30b-supercot c4 --wbits 4 --act-order --true-sequential --save_safetensors 4bit.safetensors```

	Evaluation & Score (Lower is better):
	* WikiText2: 4.66
	* PTB: 17.64
	* C4: 6.50

	Groupsize version is here: https://huggingface.co/tsumeone/llama-30b-supercot-4bit-128g-cuda