akeyhero commited on
Commit
6eadfbc
1 Parent(s): 0a7cd57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -45,7 +45,7 @@ model = AutoModelForTokenClassification.from_pretrained(model_name)
45
  以下のことを意識している:
46
 
47
  - 推論時の形態素解析器なし
48
- - トークンが単語 (辞書: `unidic-cwj-202302`) の境界を跨がない
49
  - Hugging Faceで使いやすい
50
  - 大きすぎない語彙数
51
 
@@ -70,8 +70,8 @@ Although the original DeBERTa V3 is characterized by a large vocabulary size, wh
70
  | WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
71
  | Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
72
  | CC-100 | ja | 90GB | x1 |
73
- | mC4 | ja; extracted 10% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
74
- | OSCAR 2023 | ja; extracted 20% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
75
 
76
  # Training parameters
77
  - Number of devices: 8
 
45
  以下のことを意識している:
46
 
47
  - 推論時の形態素解析器なし
48
+ - トークンが単語の境界を跨がない (辞書: `unidic-cwj-202302`)
49
  - Hugging Faceで使いやすい
50
  - 大きすぎない語彙数
51
 
 
70
  | WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
71
  | Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
72
  | CC-100 | ja | 90GB | x1 |
73
+ | mC4 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
74
+ | OSCAR 2023 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
75
 
76
  # Training parameters
77
  - Number of devices: 8