--- base_model: EpistemeAI/Fireball-Llama-3.1-8B-Instruct-v1-16bit language: - en license: apache-2.0 tags: - text-generation-inference - transformers - unsloth - llama - trl - dpo --- # Uploaded model - **Developed by:** EpistemeAI - **License:** apache-2.0 - **Finetuned from model :** EpistemeAI/Fireball-Llama-3.1-8B-Instruct-v1-16bit This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth) # Fireball-Llama-3.1-V1-Instruct # ## How to use This repository contains Fireball-Llama-3.11-V1-Instruct , for use with transformers and with the original llama codebase. ### Use with transformers Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function. Make sure to update your transformers installation via pip install --upgrade transformers. Example: ````py !pip install -U transformers trl peft accelerate bitsandbytes ```` ````py import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, ) base_model = "EpistemeAI/Fireball-Llama-3.1-8B-Instruct-v1dpo" model = AutoModelForCausalLM.from_pretrained( base_model, torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(base_model) sys = "You are help assistant " \ "(Advanced Natural-based interaction for the language)." messages = [ {"role": "system", "content": sys}, {"role": "user", "content": "What is DPO and ORPO fine tune?"}, ] #Method 1 prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False) for k,v in inputs.items(): inputs[k] = v.cuda() outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_p=0.9, temperature=0.6) results = tokenizer.batch_decode(outputs)[0] print(results) #Method 2 import transformers pipe = transformers.pipeline( model=model, tokenizer=tokenizer, return_full_text=False, # langchain expects the full text task='text-generation', max_new_tokens=512, # max number of tokens to generate in the output temperature=0.6, #temperature for more or less creative answers do_sample=True, top_p=0.9, ) sequences = pipe(messages) for seq in sequences: print(f"{seq['generated_text']}") ````