Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,22 @@ license: apache-2.0
|
|
7 |
This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.
|
8 |
|
9 |
|
10 |
-
The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.
|
11 |
|
12 |
|
13 |
-
This has not been trained any more than the original model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.
|
8 |
|
9 |
|
10 |
+
The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook [remove-donut-tokens.ipynb](remove-donut-tokens.ipynb) details the whole process.
|
11 |
|
12 |
|
13 |
+
This has not been trained any more than the original model.
|
14 |
+
|
15 |
+
I made a whole video about it: https://youtu.be/Uzr553x1gdM
|
16 |
+
|
17 |
+
|
18 |
+
I did a quick speed test for generation against the default model and using `bad_words_ids`. The `bad_words_ids` was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.
|
19 |
+
|
20 |
+
Speed script [here](speed_test.py)
|
21 |
+
Launched with [this](run_speed_tests.sh)
|
22 |
+
|
23 |
+
|
24 |
+
approach | time to generate 10 tokens
|
25 |
+
- | -
|
26 |
+
"naver-clova-ix/donut-base" | 205ms
|
27 |
+
"naver-clova-ix/donut-base" + 12k `bad_words_ids` | 280ms
|
28 |
+
"donut-base-ascii" | 195ms
|