EpistemeAI
/

Fireball-Llama-3.1-8B-v1dpo

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

legolasyiu commited on Aug 16

Commit

407bac8

•

1 Parent(s): 5ec2510

Update README.md

Files changed (1) hide show

README.md +64 -1

README.md CHANGED Viewed

@@ -24,7 +24,70 @@ This llama model was trained 2x faster with [Unsloth](https://github.com/unsloth
 Fireball-Llama-3.1-V1
 <img src="https://huggingface.co/EpistemeAI/Fireball-Llama-3.1-8B-v1dpo/resolve/main/fireball-llama.JPG" width="200"/>
-## For transfomer:

 Fireball-Llama-3.1-V1
 <img src="https://huggingface.co/EpistemeAI/Fireball-Llama-3.1-8B-v1dpo/resolve/main/fireball-llama.JPG" width="200"/>
+# Fireball-Llama-3.11-V1
+## How to use
+This repository contains Fireball-Llama-3.11-V1 , for use with transformers and with the original llama codebase.
+### Use with transformers
+Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
+Make sure to update your transformers installation via pip install --upgrade transformers.
+Example:
+````py
+!pip install -U transformers trl peft accelerate bitsandbytes
+````
+````py
+import torch
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+base_model = "EpistemeAI/Fireball-Llama-3.1-8B-v1dpo"
+model = AutoModelForCausalLM.from_pretrained(
+    base_model,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(base_model)
+sys = "You are help assistant " \
+    "(Advanced Natural-based interaction for the language)."
+messages = [
+    {"role": "system", "content": sys},
+    {"role": "user", "content": "What is DPO and ORPO fine tune?"},
+]
+#Method 1
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
+for k,v in inputs.items():
+    inputs[k] = v.cuda()
+outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_p=0.9, temperature=0.6)
+results = tokenizer.batch_decode(outputs)[0]
+print(results)
+#Method 2
+import transformers
+pipe = transformers.pipeline(
+    model=model,
+    tokenizer=tokenizer,
+    return_full_text=False, # langchain expects the full text
+    task='text-generation',
+    max_new_tokens=512, # max number of tokens to generate in the output
+    temperature=0.6,  #temperature for more or less creative answers
+    do_sample=True,
+    top_p=0.9,
+)
+sequences = pipe(messages)
+for seq in sequences:
+    print(f"{seq['generated_text']}")
+````