Quantization made by Richard Erkhov.

SparseLlama-3-8B-pruned_50.2of4 - GGUF

Model creator: https://huggingface.co/neuralmagic/
Original model: https://huggingface.co/neuralmagic/SparseLlama-3-8B-pruned_50.2of4/

Name	Quant method	Size
SparseLlama-3-8B-pruned_50.2of4.Q2_K.gguf	Q2_K	2.96GB
SparseLlama-3-8B-pruned_50.2of4.IQ3_XS.gguf	IQ3_XS	3.28GB
SparseLlama-3-8B-pruned_50.2of4.IQ3_S.gguf	IQ3_S	3.43GB
SparseLlama-3-8B-pruned_50.2of4.Q3_K_S.gguf	Q3_K_S	3.41GB
SparseLlama-3-8B-pruned_50.2of4.IQ3_M.gguf	IQ3_M	3.52GB
SparseLlama-3-8B-pruned_50.2of4.Q3_K.gguf	Q3_K	3.74GB
SparseLlama-3-8B-pruned_50.2of4.Q3_K_M.gguf	Q3_K_M	3.74GB
SparseLlama-3-8B-pruned_50.2of4.Q3_K_L.gguf	Q3_K_L	4.03GB
SparseLlama-3-8B-pruned_50.2of4.IQ4_XS.gguf	IQ4_XS	4.18GB
SparseLlama-3-8B-pruned_50.2of4.Q4_0.gguf	Q4_0	4.34GB
SparseLlama-3-8B-pruned_50.2of4.IQ4_NL.gguf	IQ4_NL	4.38GB
SparseLlama-3-8B-pruned_50.2of4.Q4_K_S.gguf	Q4_K_S	4.37GB
SparseLlama-3-8B-pruned_50.2of4.Q4_K.gguf	Q4_K	4.58GB
SparseLlama-3-8B-pruned_50.2of4.Q4_K_M.gguf	Q4_K_M	4.58GB
SparseLlama-3-8B-pruned_50.2of4.Q4_1.gguf	Q4_1	4.78GB
SparseLlama-3-8B-pruned_50.2of4.Q5_0.gguf	Q5_0	5.21GB
SparseLlama-3-8B-pruned_50.2of4.Q5_K_S.gguf	Q5_K_S	5.21GB
SparseLlama-3-8B-pruned_50.2of4.Q5_K.gguf	Q5_K	5.34GB
SparseLlama-3-8B-pruned_50.2of4.Q5_K_M.gguf	Q5_K_M	5.34GB
SparseLlama-3-8B-pruned_50.2of4.Q5_1.gguf	Q5_1	5.65GB
SparseLlama-3-8B-pruned_50.2of4.Q6_K.gguf	Q6_K	6.14GB
SparseLlama-3-8B-pruned_50.2of4.Q8_0.gguf	Q8_0	7.95GB

Original model description:

base_model: meta-llama/Meta-Llama-3-8B inference: true model_type: llama pipeline_tag: text-generation tags: - sparse

SparseLlama-3-8B-pruned_50.2of4

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask.

Note: This is still a work in progress and subject to change. We expect to release new weights with even better accuracy soon.

Running the model

It can be run naively in transformers for testing purposes:

# pip install transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4")
model = AutoModelForCausalLM.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4", device_map="auto")

input_text = "A poem about Machine Learning goes as follows:"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

To take advantage of the 2:4 sparsity present, install nm-vllm for fast inference and low memory-usage:

pip install nm-vllm[sparse] --extra-index-url https://pypi.neuralmagic.com/simple

from vllm import LLM, SamplingParams

model = LLM("nm-testing/SparseLlama-3-8B-pruned_50.2of4", sparsity="semi_structured_sparse_w16a16")

prompt = "A poem about Machine Learning goes as follows:"
sampling_params = SamplingParams(max_tokens=100, temperature=0)

outputs = model.generate(prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark	Meta-Llama-3-8B	SparseLlama-3-8B-pruned_50.2of4 (this model)
ARC-c 25-shot	59.47%	57.76%
MMLU 5-shot	65.29%	60.44%
HellaSwag 10-shot	82.14%	79.97%
WinoGrande 5-shot	77.27%	77.19%
GSM8K 5-shot	44.81%	47.92%
TruthfulQA 0-shot	43.96%	41.02%
Average Accuracy	62.16%	60.72%
Recovery	100%	97.68%

Model evaluation results obtained via Mosaic Eval Gauntlet following the configuration of Eval Gauntlet v0.3.

Benchmark	Meta-Llama-3-8B	SparseLlama-3-8B-pruned_50.2of4 (this model)
World Knowledge	58.08%	54.61%
Commonsense Reasoning	47.66%	47.62%
Language Understanding	71.13%	67.58%
Symbolic Problem Solving	38.44%	32.15%
Reading Comprehension	57.48%	55.76%
Average Accuracy	54.70%	51.54%
Recovery	100%	94.22%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Acknowledgment

This model is built with Meta Llama 3. For more details on its licence please check the model card of Meta-Llama-3-8B.