Icelandic-lt
/

asr_6gram_lm

Model card Files Files and versions Community

danielschnell commited on May 28, 2024

Commit

0cd204b

•

1 Parent(s): 9d0ed7d

Copy from Clarin: http://hdl.handle.net/20.500.12537/226

Browse files

"6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06 is a
word level n-gram language model in binary format suitable for
recognizers based on the NVIDIA-NeMo framework.

Signed-off-by: Daniel Schnell <dschnell@grammatek.com>

Files changed (3) hide show

.gitattributes +1 -0
6GRAM_ARPA_MODEL.bin +3 -0
Readme.txt +88 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+6GRAM_ARPA_MODEL.bin filter=lfs diff=lfs merge=lfs -text

6GRAM_ARPA_MODEL.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd3b955a0ddc8f1aa694ecaee76904340cd1bb8bf125cedb6f8560c0342d9c7e
+size 5433972976

Readme.txt ADDED Viewed

	@@ -0,0 +1,88 @@

+-------------------------------------------------------------------------------
+      6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
+-------------------------------------------------------------------------------
+Authors               : Carlos Daniel Hernández Mena (carlosm@ru.is).
+Language              : Icelandic.
+Recommended use       : speech recognition.
+-------------------------------------------------------------------------------
+Description
+-------------------------------------------------------------------------------
+"6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06" is a
+word level n-gram language model in binary format suitable for recognizers
+based on the NVIDIA-NeMo framework [1].
+This language model was originally created to be used in the field of
+Automatic Speech Recognition (ASR). In specific, it was designed for the
+following NeMo recipe, developed by the Language and Voice Lab (LVL) at
+Reykjavík University in 2022:
+  https://github.com/cadia-lvl/samromur-asr/tree/n5_samromur/n5_samromur
+Nevertheless, due to the flexibility of this kind of resources and their
+possible application in other tasks, systems or code recipes; it was
+decided to publish this model as an independent item.
+-------------------------------------------------------------------------------
+The Language Model
+-------------------------------------------------------------------------------
+The language model was created using the Icelandic Gigaword Corpus [2]. The
+Gigaword corpus contains text from newspaper articles, parliamentary speeches,
+adjudications, books, transcribed radio/television news and more. The
+normalization process of the sentences utilized to generate the language
+model includes to allowing only characters belonging to the Icelandic alphabet,
+expanding numbers and abbreviations, and removing punctuation marks [3]. The
+resulting text has a length of more than 44 million lines of text (5.3GB
+approximately), and it was used to create a pruned 6-gram language model with
+the SRILM toolkit [4].
+-------------------------------------------------------------------------------
+Citation
+-------------------------------------------------------------------------------
+When publishing results based on the models please refer to:
+   Mena, Carlos; "6-GRAM Language Model in Icelandic for NeMo (Binary Format)
+   22.06". Web Download. Reykjavik University: Language and Voice Lab, 2022.
+Contact: Carlos Mena (carlosm@ru.is)
+License: CC BY 4.0
+-------------------------------------------------------------------------------
+Acknowledgements
+-------------------------------------------------------------------------------
+This initiative was funded by the Language Technology Programme for Icelandic
+2019-2023. The programme, which is managed and coordinated by Almannarómur,
+is funded by the Icelandic Ministry of Education, Science and Culture.
+-------------------------------------------------------------------------------
+References
+-------------------------------------------------------------------------------
+[1] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg,
+    B., ... & Cohen, J. M. (2019). Nemo: a toolkit for building ai
+    applications using neural modules. arXiv preprint arXiv:1909.09577.
+[2] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
+    & Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
+    corpus. In Proceedings of the Eleventh International Conference on
+    Language Resources and Evaluation (LREC 2018).
+[3] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
+    J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
+    In Proceedings of the Eleventh International Conference on Language
+    Resources and Evaluation (LREC 2018).
+[4] Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In
+    Seventh international conference on spoken language processing.
+-------------------------------------------------------------------------------
+-------------------------------------------------------------------------------