adno commited on
Commit
a541106
1 Parent(s): cfef1bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -18,7 +18,15 @@ We provide both '\*.bin' files (for fastText) and '\*.vec' files that follow the
18
 
19
  TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
20
 
21
- - TODO: paper link
 
 
 
 
 
 
 
 
22
  - [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
23
  - [word frequencies and code](https://github.com/naist-nlp/tubelex)
24
 
 
18
 
19
  TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
20
 
21
+ - [preprint](http://arxiv.org/abs/1908.09283), BibTeX entry:
22
+ ```
23
+ @article{nohejl_etal_2024_film,
24
+ title={Beyond {{Film Subtitles}}: {{Is YouTube}} the {{Best Approximation}} of {{Spoken Vocabulary}}?},
25
+ author={Nohejl, Adam and Hudi, Frederikus and Kardinata, Eunike Andriani and Ozaki, Shintaro and Riera Machin, Maria Angelica and Sun, Hongyu and Vasselli, Justin and Watanabe, Taro},
26
+ year={2024}, eprint={2410.03240}, archiveprefix={arXiv}, primaryclass={cs.CL},
27
+ url={https://arxiv.org/abs/2410.03240v1}, journal={ArXiv preprint}, volume={arXiv:2410.03240v1 [cs]}
28
+ }
29
+ ```
30
  - [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
31
  - [word frequencies and code](https://github.com/naist-nlp/tubelex)
32