metadata
license: gpl-2.0
language:
- en
- ja
tags:
- tokenizer
- novelai
- sentencepiece
NovelAI Tokenizer v1
This repository is exactly the same as NovelAI/nerdstash-tokenizer-v1, but the config has been changed to address the following points (the sp model itself is not changed).
- Load as T5Tokenizer
- Enable to decode digits (In the original, digits are registered as
additional_special_tokens
, so ifskip_special_tokens=True
when decoding, the digits are also skipped.)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mkshing/novelai-tokenizer-v1", use_fast=False)
text = "1+1=3"
tokenizer.decode(tokenizer.encode(text), skip_special_tokens=True)
# '1+1=3'