Vocabulary
#4
by
NEDIX
- opened
Seeing that this project has moved to Llama2 architecture, I have been attempting to convert this model to LLAMA GGML format.
I am currently at a dead end because of inoperable implementations of get_vocab
and save_vocabulary
methods in tokenization_codegen25.py
. When attempting to invoke the get_vocab
method the issue is that some of the vocabulary uses a different encoding from the defined utf-8
.
These could be solutions:
a. Change tokenization_codegen25.py line 169
encoding from utf-8
to latin-1
b. With the next version of this model filter non utf-8 characters from the vocabulary