nbroad HF staff commited on
Commit
39129e8
1 Parent(s): cbde782

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -2
README.md CHANGED
@@ -7,7 +7,22 @@ license: apache-2.0
7
  This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.
8
 
9
 
10
- The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.
11
 
12
 
13
- This has not been trained any more than the original model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.
8
 
9
 
10
+ The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook [remove-donut-tokens.ipynb](remove-donut-tokens.ipynb) details the whole process.
11
 
12
 
13
+ This has not been trained any more than the original model.
14
+
15
+ I made a whole video about it: https://youtu.be/Uzr553x1gdM
16
+
17
+
18
+ I did a quick speed test for generation against the default model and using `bad_words_ids`. The `bad_words_ids` was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.
19
+
20
+ Speed script [here](speed_test.py)
21
+ Launched with [this](run_speed_tests.sh)
22
+
23
+
24
+ approach | time to generate 10 tokens
25
+ - | -
26
+ "naver-clova-ix/donut-base" | 205ms
27
+ "naver-clova-ix/donut-base" + 12k `bad_words_ids` | 280ms
28
+ "donut-base-ascii" | 195ms