exllamav2-quantized version of Llama-3-8B-RAG-v1 from glaiveai: https://huggingface.co/glaiveai/Llama-3-8B-RAG-v1 bpw: 6.0 head-bpw: 8.0
example usage with exllamav2:
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2Sampler, ExLlamaV2DynamicGenerator
model_path = /path/to/model_folder
config = ExLlamaV2Config(model_path)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len = 4096, lazy = True)
model.load_autosplit(cache, progress = True)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2DynamicGenerator(
model = model,
cache = cache,
tokenizer = tokenizer,
)
gen_settings = ExLlamaV2Sampler.Settings(
temperature = 1.0,
top_p = 0.1,
token_repetition_penalty = 1.0
)
outputs = generator.generate(
prompt = ["first input", "second input"], # string or list of strings
max_new_tokens = 1024,
stop_conditions = [tokenizer.eos_token_id],
gen_settings = gen_settings,
add_bos = True,
)
print(outputs)
- Downloads last month
- 6
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.