mkshing's picture
Update README.md
77e084e
|
raw
history blame
783 Bytes
metadata
license: gpl-2.0
language:
  - en
  - ja
tags:
  - tokenizer
  - novelai
  - sentencepiece

NovelAI Tokenizer v1

This repository is exactly the same as NovelAI/nerdstash-tokenizer-v1, but the config has been changed to address the following points (the sp model itself is not changed).

  • Load as T5Tokenizer
  • Enable to decode digits (In the original, digits are registered as additional_special_tokens, so if skip_special_tokens=True when decoding, the digits are also skipped.)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mkshing/novelai-tokenizer-v1", use_fast=False)

text = "1+1=3"
tokenizer.decode(tokenizer.encode(text), skip_special_tokens=True)
# '1+1=3'