Edit model card

exllamav2-quantized version of Llama-3-8B-RAG-v1 from glaiveai: https://huggingface.co/glaiveai/Llama-3-8B-RAG-v1 bpw: 6.0 head-bpw: 8.0

example usage with exllamav2:


from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2Sampler, ExLlamaV2DynamicGenerator

model_path = /path/to/model_folder

config = ExLlamaV2Config(model_path)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len = 4096, lazy = True)
model.load_autosplit(cache, progress = True)
tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2DynamicGenerator(
    model = model,
    cache = cache,
    tokenizer = tokenizer,
)



gen_settings = ExLlamaV2Sampler.Settings(
    temperature = 1.0, 
    top_p = 0.1,
    token_repetition_penalty = 1.0
)

outputs = generator.generate(
    prompt = ["first input", "second input"], # string or list of strings
    max_new_tokens = 1024,
    stop_conditions = [tokenizer.eos_token_id],
    gen_settings = gen_settings,
    add_bos = True,
)

print(outputs)
Downloads last month
6
Safetensors
Model size
1.97B params
Tensor type
I32
·
FP16
·
I16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train KT313/Llama-3-8B-RAG-v1-exl2-6.0bpw