Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Quantization made by Richard Erkhov.

Github

Discord

Request more models

SparseLlama-3-8B-pruned_50.2of4 - GGUF

Original model description:

base_model: meta-llama/Meta-Llama-3-8B inference: true model_type: llama pipeline_tag: text-generation tags: - sparse

SparseLlama-3-8B-pruned_50.2of4

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask.

Note: This is still a work in progress and subject to change. We expect to release new weights with even better accuracy soon.

Running the model

It can be run naively in transformers for testing purposes:

# pip install transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4")
model = AutoModelForCausalLM.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4", device_map="auto")

input_text = "A poem about Machine Learning goes as follows:"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

To take advantage of the 2:4 sparsity present, install nm-vllm for fast inference and low memory-usage:

pip install nm-vllm[sparse] --extra-index-url https://pypi.neuralmagic.com/simple
from vllm import LLM, SamplingParams

model = LLM("nm-testing/SparseLlama-3-8B-pruned_50.2of4", sparsity="semi_structured_sparse_w16a16")

prompt = "A poem about Machine Learning goes as follows:"
sampling_params = SamplingParams(max_tokens=100, temperature=0)

outputs = model.generate(prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark Meta-Llama-3-8B SparseLlama-3-8B-pruned_50.2of4
(this model)
ARC-c
25-shot
59.47% 57.76%
MMLU
5-shot
65.29% 60.44%
HellaSwag
10-shot
82.14% 79.97%
WinoGrande
5-shot
77.27% 77.19%
GSM8K
5-shot
44.81% 47.92%
TruthfulQA
0-shot
43.96% 41.02%
Average
Accuracy
62.16% 60.72%
Recovery 100% 97.68%

Model evaluation results obtained via Mosaic Eval Gauntlet following the configuration of Eval Gauntlet v0.3.

Benchmark Meta-Llama-3-8B SparseLlama-3-8B-pruned_50.2of4
(this model)
World Knowledge 58.08% 54.61%
Commonsense Reasoning 47.66% 47.62%
Language Understanding 71.13% 67.58%
Symbolic Problem Solving 38.44% 32.15%
Reading Comprehension 57.48% 55.76%
Average Accuracy 54.70% 51.54%
Recovery 100% 94.22%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Acknowledgment

This model is built with Meta Llama 3. For more details on its licence please check the model card of Meta-Llama-3-8B.

Downloads last month
239
GGUF
Model size
8.03B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference API
Unable to determine this model's library. Check the docs .