|
--- |
|
license: gpl-2.0 |
|
language: |
|
- en |
|
- ja |
|
tags: |
|
- tokenizer |
|
- novelai |
|
- sentencepiece |
|
--- |
|
|
|
# NovelAI Tokenizer v1 |
|
This repository is exactly the same as [NovelAI/nerdstash-tokenizer-v1](https://huggingface.co/NovelAI/nerdstash-tokenizer-v1), |
|
but the config has been changed to address the following points (the sp model itself is not changed). |
|
|
|
- Load as T5Tokenizer |
|
- Enable to decode digits (In the original, digits are registered as `additional_special_tokens`, so if `skip_special_tokens=True` when decoding, the digits are also skipped.) |
|
|
|
```python |
|
|
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("mkshing/novelai-tokenizer-v1", use_fast=False) |
|
|
|
text = "1+1=3" |
|
tokenizer.decode(tokenizer.encode(text), skip_special_tokens=True) |
|
# '1+1=3' |
|
``` |
|
|