SUMMARY
Just a model using to learn Fine Tuning of 'gpt2-medium'
- on a self made datasets
- on a self made special tokens
- on a multiple fine tuned with ~15K dataset (in progress mode)
If interested in how I got to this point and how I created the datasets you can visit:
Crafting GPT2 for Personalized AI-Preparing Data the Long Way
FINE TUNED - BASE MODEL
I would consider this GPT2-medium-custom-v1.0 a the base model to start my Fine Tuning 2.0 on specific Datasets.
- Previous models of this: gpt-special-tokens-medium(1~4) are consider beta check-points to this
This model is available to test on Ollama Deeokay/mediumgpt2 it is not perfect and I am still working out some stuff, but I am quite proud that I was able to make it this far. Please note, the acutal GGUF file is also included in this repository if you would like to create your own versions (templates etc.)
DECLARING NEW SPECIAL TOKENS
special_tokens_dict = {
'eos_token': '<|STOP|>',
'bos_token': '<|STOP|>',
'pad_token': '<|PAD|>',
'additional_special_tokens': ['<|BEGIN_QUERY|>', '<|BEGIN_QUERY|>',
'<|BEGIN_ANALYSIS|>', '<|END_ANALYSIS|>',
'<|BEGIN_RESPONSE|>', '<|END_RESPONSE|>',
'<|BEGIN_SENTIMENT|>', '<|END_SENTIMENT|>',
'<|BEGIN_CLASSIFICATION|>', '<|END_CLASSIFICATION|>',]
}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids('<|STOP|>')
tokenizer.bos_token_id = tokenizer.convert_tokens_to_ids('<|STOP|>')
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids('<|PAD|>')
The order of tokens is as follows:
def combine_text(user_prompt, analysis, sentiment, new_response, classification):
user_q = f"<|STOP|><|BEGIN_QUERY|>{user_prompt}<|END_QUERY|>"
analysis = f"<|BEGIN_ANALYSIS|>{analysis}<|END_ANALYSIS|>"
new_response = f"<|BEGIN_RESPONSE|>{new_response}<|END_RESPONSE|>"
classification = f"<|BEGIN_CLASSIFICATION|>{classification}<|END_CLASSIFICATION|>"
sentiment = f"<|BEGIN_SENTIMENT|>Sentiment: {sentiment}<|END_SENTIMENT|><|STOP|>"
return user_q + analysis + new_response + classification + sentiment
INFERANCING
I am currently testing two ways, if anyone knows a better one, please let me know!
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
models_folder = "Deeokay/gpt2-medium-custom-v1.0"
model = GPT2LMHeadModel.from_pretrained(models_folder)
tokenizer = GPT2Tokenizer.from_pretrained(models_folder)
# Device configuration <<change as needed>>
device = torch.device("cpu")
model.to(device)
OPTION 1 INFERFENCE
import time
class Stopwatch:
def __init__(self):
self.start_time = None
self.end_time = None
def start(self):
self.start_time = time.time()
def stop(self):
self.end_time = time.time()
def elapsed_time(self):
if self.start_time is None:
return "Stopwatch hasn't been started"
if self.end_time is None:
return "Stopwatch hasn't been stopped"
return self.end_time - self.start_time
stopwatch1 = Stopwatch()
def generate_response(input_text, max_length=250):
stopwatch1.start()
# Prepare the input
# input_text = f"<|BEGIN_QUERY|>{input_text}<|END_QUERY|><|BEGIN_ANALYSIS|>{input_text}<|END_ANALYSIS|><|BEGIN_RESPONSE|>"
input_text = f"<|BEGIN_QUERY|>{input_text}<|END_QUERY|><|BEGIN_ANALYSIS|>"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
# Create attention mask
attention_mask = torch.ones_like(input_ids).to(device)
# Generate
output = model.generate(
input_ids,
max_new_tokens=max_length,
num_return_sequences=1,
no_repeat_ngram_size=2,
attention_mask=attention_mask,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids('<|STOP|>'),
)
stopwatch1.stop()
return tokenizer.decode(output[0], skip_special_tokens=False)
OPTION 2 INFERNCE
import time
class Stopwatch:
def __init__(self):
self.start_time = None
self.end_time = None
def start(self):
self.start_time = time.time()
def stop(self):
self.end_time = time.time()
def elapsed_time(self):
if self.start_time is None:
return "Stopwatch hasn't been started"
if self.end_time is None:
return "Stopwatch hasn't been stopped"
return self.end_time - self.start_time
stopwatch2 = Stopwatch()
def generate_response2(input_text, max_length=250):
stopwatch2.start()
# Prepare the input
# input_text = f"<|BEGIN_QUERY|>{input_text}<|END_QUERY|><|BEGIN_ANALYSIS|>{input_text}<|END_ANALYSIS|><|BEGIN_RESPONSE|>"
input_text = f"<|BEGIN_QUERY|>{input_text}<|END_QUERY|><|BEGIN_ANALYSIS|>"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
# Create attention mask
attention_mask = torch.ones_like(input_ids).to(device)
# # 2ND OPTION FOR : Generate
output = model.generate(
input_ids,
max_new_tokens=max_length,
attention_mask=attention_mask,
do_sample=True,
temperature=0.4, # this can be played around
top_k=60, # this can be played around
no_repeat_ngram_size=2,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
stopwatch2.stop()
return tokenizer.decode(output[0], skip_special_tokens=False)
DECODING ANSWER
When I need just the response
def decode(text):
full_text = text
# Extract the response part
start_token = "<|BEGIN_RESPONSE|>"
end_token = "<|END_RESPONSE|>"
start_idx = full_text.find(start_token)
end_idx = full_text.find(end_token)
if start_idx != -1 and end_idx != -1:
response = full_text[start_idx + len(start_token):end_idx].strip()
else:
response = full_text.strip()
return response
MY SETUP
I use the stopwatch to time the responses and I use both inference to see the difference
input_text = "Who is Steve Jobs and what was contribution?"
response1_full = generate_response(input_text)
#response1 = decode(response1_full)
print(f"Input: {input_text}")
print("=======================================")
print(f"Response1: {response1_full}")
elapsed1 = stopwatch1.elapsed_time()
print(f"Process took {elapsed1:.4f} seconds")
print("=======================================")
response2_full = generate_response2(input_text)
#response2 = decode(response2_full)
print(f"Response2: {response2_full}")
elapsed2 = stopwatch2.elapsed_time()
print(f"Process took {elapsed2:.4f} seconds")
print("=======================================")
Out-of-Scope Use
Well everything that has a factual data.. trust at your own risk!
Never tested on mathamatical knowledge.
I quite enjoy how the response feels closer to what I had in mind..
- Downloads last month
- 494