Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,15 @@ We provide both '\*.bin' files (for fastText) and '\*.vec' files that follow the
|
|
18 |
|
19 |
TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
|
20 |
|
21 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
- [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
|
23 |
- [word frequencies and code](https://github.com/naist-nlp/tubelex)
|
24 |
|
|
|
18 |
|
19 |
TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
|
20 |
|
21 |
+
- [preprint](http://arxiv.org/abs/1908.09283), BibTeX entry:
|
22 |
+
```
|
23 |
+
@article{nohejl_etal_2024_film,
|
24 |
+
title={Beyond {{Film Subtitles}}: {{Is YouTube}} the {{Best Approximation}} of {{Spoken Vocabulary}}?},
|
25 |
+
author={Nohejl, Adam and Hudi, Frederikus and Kardinata, Eunike Andriani and Ozaki, Shintaro and Riera Machin, Maria Angelica and Sun, Hongyu and Vasselli, Justin and Watanabe, Taro},
|
26 |
+
year={2024}, eprint={2410.03240}, archiveprefix={arXiv}, primaryclass={cs.CL},
|
27 |
+
url={https://arxiv.org/abs/2410.03240v1}, journal={ArXiv preprint}, volume={arXiv:2410.03240v1 [cs]}
|
28 |
+
}
|
29 |
+
```
|
30 |
- [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
|
31 |
- [word frequencies and code](https://github.com/naist-nlp/tubelex)
|
32 |
|