adno commited on
Commit
ba3be43
1 Parent(s): da999f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -3
README.md CHANGED
@@ -1,3 +1,21 @@
1
- ---
2
- license: bsd-3-clause
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ language:
4
+ - zh
5
+ - en
6
+ - id
7
+ - ja
8
+ - es
9
+ ---
10
+
11
+ # TUBELEX FastText Word Embeddings
12
+
13
+ FastText Word Embeddings trained on the TUBELEX YouTube subtitle corpora. We use the 300-dimensional [fastText](https://github.com/facebookresearch/fastText) CBOW model with position weights, 10 negative samples, 10~epochs, character 5-grams (other paramters: default) ([Grave et al., 2018](https://aclanthology.org/L18-1550)).
14
+
15
+ # What is TUBELEX?
16
+
17
+ TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
18
+
19
+ - TODO: paper link
20
+ - [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
21
+ - [word frequencies and code](https://github.com/naist-nlp/tubelex)