naist-nlp
/

tubelex-fasttext

Model card Files Files and versions Community

adno commited on Oct 7

Commit

a541106

•

1 Parent(s): cfef1bb

Update README.md

Files changed (1) hide show

README.md +9 -1

README.md CHANGED Viewed

@@ -18,7 +18,15 @@ We provide both '\*.bin' files (for fastText) and '\*.vec' files that follow the
 TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
-- TODO: paper link
 - [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
 - [word frequencies and code](https://github.com/naist-nlp/tubelex)

 TUBELEX is a YouTube subtitle corpus currently available for Chinese, English, Indonesian, Japanese, and Spanish.
+- [preprint](http://arxiv.org/abs/1908.09283), BibTeX entry:
+```
+@article{nohejl_etal_2024_film,
+  title={Beyond {{Film Subtitles}}: {{Is YouTube}} the {{Best Approximation}} of {{Spoken Vocabulary}}?},
+  author={Nohejl, Adam and Hudi, Frederikus and Kardinata, Eunike Andriani and Ozaki, Shintaro and Riera Machin, Maria Angelica and Sun, Hongyu and Vasselli, Justin and Watanabe, Taro},
+  year={2024}, eprint={2410.03240}, archiveprefix={arXiv}, primaryclass={cs.CL},
+  url={https://arxiv.org/abs/2410.03240v1}, journal={ArXiv preprint}, volume={arXiv:2410.03240v1 [cs]}
+}
+```
 - [KenLM n-gram models](https://huggingface.co/naist-nlp/tubelex-kenlm)
 - [word frequencies and code](https://github.com/naist-nlp/tubelex)