How to run the Fill-in-the-middle setting
I have been able to get the model to generate autoregressively, however, when I try to tokenize a sequence consisting of special tokens as shown in the FIM example in the model card - "", "", "" - I see multiple token ids being generated corresponding to each of the tokens, further, I am not able to get good generations in the FIM setting, even with the example provided in the Model card. I do not see the FIM tokens being part of the special symbols in the tokenizer either.
Kindly suggest how to use the fill-in-the-middle setting of Santacoder
Hi, you need to manually add the FIM special tokens to the vocab, you will also need to specify return_token_type_ids=False
when tokenizing to not get the token ids that might confuse the order. We will try to make the model card more clear about this. Here's a functioning example. You can also find more details in this notebook.
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("bigcode/santacoder", revision="fim", trust_remote_code=True)
tokenizer_fim = AutoTokenizer.from_pretrained("bigcode/santacoder", padding_side="left")
FIM_PREFIX = "<fim-prefix>"
FIM_MIDDLE = "<fim-middle>"
FIM_SUFFIX = "<fim-suffix>"
FIM_PAD = "<fim-pad>"
EOD = "<|endoftext|>"
tokenizer_fim.add_special_tokens({
"additional_special_tokens": [EOD, FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD],
"pad_token": EOD,
})
input_text = "<fim-prefix>def fib(n):<fim-suffix> else:\n return fib(n - 2) + fib(n - 1)<fim-middle>"
inputs = tokenizer_fim(input_text, return_tensors="pt", padding=True, return_token_type_ids=False)
outputs = model.generate(**inputs, max_new_tokens=25)
generation = [tokenizer_fim.decode(tensor, skip_special_tokens=False) for tensor in outputs]
print(generation[0])
<fim-prefix>def fib(n):<fim-suffix> else:
return fib(n - 2) + fib(n - 1)<fim-middle>
if n == 0:
return 0
elif n == 1:
return 1
<|endoftext|><fim-prefix>
FYI the special tokens are now in the tokenizer by default: https://huggingface.co/bigcode/santacoder/discussions/11
And you don't even need to specify return_token_type_ids=False now, we turned it off by default