metadata
license: cc-by-nc-4.0
A llama.c model based on Karpathy's Llama2.c project. https://github.com/karpathy/llama2.c
Vocab of 4096, trained on Tinystories, and my custom littlestories dataset (currently unreleased.)
Model uses ↨ as a shift key, instead of using capial letters, this allowed simplification of the tokenizer to avoid duplicates that are uppercase.
To convert normal text to the right format I use:
def add_caseifer(text):
# Using list comprehension for more efficient concatenation
return ''.join(['↨' + char.lower() if char.isupper() else char for char in text])
To return the text to human format I use:
def remove_caseifer(text):
new_text = ""
i = 0
while i < len(text):
if text[i] == "↨":
if i+1 < len(text):
new_text += text[i+1].upper()
i += 1
else:
pass # skip this index
else:
new_text += text[i]
i += 1
return new_text