Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,28 @@
|
|
1 |
---
|
2 |
license: gpl-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: gpl-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- ja
|
6 |
+
tags:
|
7 |
+
- tokenizer
|
8 |
+
- novelai
|
9 |
+
- sentencepiece
|
10 |
---
|
11 |
+
|
12 |
+
# NovelAI Tokenizer v1
|
13 |
+
This repository is exactly the same as [NovelAI/nerdstash-tokenizer-v1](https://huggingface.co/NovelAI/nerdstash-tokenizer-v1),
|
14 |
+
but the config has been changed to address the following points (the sp model itself is not changed).
|
15 |
+
|
16 |
+
- Load as T5Tokenizer
|
17 |
+
- Enable to decode digits (In the original, digits are registered as `additional_special_tokens`, so if `skip_special_tokens=True` when decoding, the digits are also skipped.)
|
18 |
+
|
19 |
+
```python
|
20 |
+
|
21 |
+
from transformers import AutoTokenizer
|
22 |
+
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained("mkshing/novelai-tokenizer-v1", use_fast=False)
|
24 |
+
|
25 |
+
text = "1+1=3"
|
26 |
+
tokenizer.decode(tokenizer.encode(text), skip_special_tokens=True)
|
27 |
+
# '1+1=3'
|
28 |
+
```
|