EXL2 quant of alpindale/goliath-120b (https://huggingface.co/alpindale/goliath-120b), to be used on exllamav2. 4.25bpw to being to able to use CFG comfortably on 72GB VRAM. (20,21,22 for gpu split)

Update 06/01/2024: Updated with new quant method after some time, thanks for the measurement here

Calibration dataset is a cleaned, fixed pippa RP dataset, which does affect the results (in favor) for RP usage.

You can find the calibration dataset here

I've added a measurement.json file if you want to do your own quants.

Original model card

Goliath 120B

An auto-regressive causal LM created by combining 2x finetuned Llama-2 70B into one.

Please check out the quantized formats provided by @TheBloke and @Panchovix:

GGUF (llama.cpp)
GPTQ (KoboldAI, TGW, Aphrodite)
AWQ (TGW, Aphrodite, vLLM)
Exllamav2 (TGW, KoboldAI)

Prompting Format

Both Vicuna and Alpaca will work, but due the initial and final layers belonging primarily to Xwin, I expect Vicuna to work the best.

Merge process

The models used in the merge are Xwin and Euryale.

The layer ranges used are as follows:

- range 0, 16
  Xwin
- range 8, 24
  Euryale
- range 17, 32
  Xwin
- range 25, 40
  Euryale
- range 33, 48
  Xwin
- range 41, 56
  Euryale
- range 49, 64
  Xwin
- range 57, 72
  Euryale
- range 65, 80
  Xwin

Screenshots

Benchmarks

Coming soon.

Acknowledgements

Credits goes to @chargoddard for developing the framework used to merge the model - mergekit.

Special thanks to @Undi95 for helping with the merge ratios.