danielschnell commited on
Commit
0cd204b
1 Parent(s): 9d0ed7d

Copy from Clarin: http://hdl.handle.net/20.500.12537/226

Browse files

"6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06 is a
word level n-gram language model in binary format suitable for
recognizers based on the NVIDIA-NeMo framework.

Signed-off-by: Daniel Schnell <dschnell@grammatek.com>

Files changed (3) hide show
  1. .gitattributes +1 -0
  2. 6GRAM_ARPA_MODEL.bin +3 -0
  3. Readme.txt +88 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ 6GRAM_ARPA_MODEL.bin filter=lfs diff=lfs merge=lfs -text
6GRAM_ARPA_MODEL.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd3b955a0ddc8f1aa694ecaee76904340cd1bb8bf125cedb6f8560c0342d9c7e
3
+ size 5433972976
Readme.txt ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ -------------------------------------------------------------------------------
2
+ 6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
3
+ -------------------------------------------------------------------------------
4
+
5
+ Authors : Carlos Daniel Hernández Mena (carlosm@ru.is).
6
+
7
+ Language : Icelandic.
8
+
9
+ Recommended use : speech recognition.
10
+
11
+ -------------------------------------------------------------------------------
12
+ Description
13
+ -------------------------------------------------------------------------------
14
+
15
+ "6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06" is a
16
+ word level n-gram language model in binary format suitable for recognizers
17
+ based on the NVIDIA-NeMo framework [1].
18
+
19
+ This language model was originally created to be used in the field of
20
+ Automatic Speech Recognition (ASR). In specific, it was designed for the
21
+ following NeMo recipe, developed by the Language and Voice Lab (LVL) at
22
+ Reykjavík University in 2022:
23
+
24
+ https://github.com/cadia-lvl/samromur-asr/tree/n5_samromur/n5_samromur
25
+
26
+ Nevertheless, due to the flexibility of this kind of resources and their
27
+ possible application in other tasks, systems or code recipes; it was
28
+ decided to publish this model as an independent item.
29
+
30
+ -------------------------------------------------------------------------------
31
+ The Language Model
32
+ -------------------------------------------------------------------------------
33
+
34
+ The language model was created using the Icelandic Gigaword Corpus [2]. The
35
+ Gigaword corpus contains text from newspaper articles, parliamentary speeches,
36
+ adjudications, books, transcribed radio/television news and more. The
37
+ normalization process of the sentences utilized to generate the language
38
+ model includes to allowing only characters belonging to the Icelandic alphabet,
39
+ expanding numbers and abbreviations, and removing punctuation marks [3]. The
40
+ resulting text has a length of more than 44 million lines of text (5.3GB
41
+ approximately), and it was used to create a pruned 6-gram language model with
42
+ the SRILM toolkit [4].
43
+
44
+ -------------------------------------------------------------------------------
45
+ Citation
46
+ -------------------------------------------------------------------------------
47
+
48
+ When publishing results based on the models please refer to:
49
+
50
+ Mena, Carlos; "6-GRAM Language Model in Icelandic for NeMo (Binary Format)
51
+ 22.06". Web Download. Reykjavik University: Language and Voice Lab, 2022.
52
+
53
+ Contact: Carlos Mena (carlosm@ru.is)
54
+
55
+ License: CC BY 4.0
56
+
57
+ -------------------------------------------------------------------------------
58
+ Acknowledgements
59
+ -------------------------------------------------------------------------------
60
+
61
+ This initiative was funded by the Language Technology Programme for Icelandic
62
+ 2019-2023. The programme, which is managed and coordinated by Almannarómur,
63
+ is funded by the Icelandic Ministry of Education, Science and Culture.
64
+
65
+ -------------------------------------------------------------------------------
66
+ References
67
+ -------------------------------------------------------------------------------
68
+
69
+ [1] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg,
70
+ B., ... & Cohen, J. M. (2019). Nemo: a toolkit for building ai
71
+ applications using neural modules. arXiv preprint arXiv:1909.09577.
72
+
73
+ [2] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
74
+ & Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
75
+ corpus. In Proceedings of the Eleventh International Conference on
76
+ Language Resources and Evaluation (LREC 2018).
77
+
78
+ [3] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
79
+ J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
80
+ In Proceedings of the Eleventh International Conference on Language
81
+ Resources and Evaluation (LREC 2018).
82
+
83
+ [4] Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In
84
+ Seventh international conference on spoken language processing.
85
+
86
+ -------------------------------------------------------------------------------
87
+ -------------------------------------------------------------------------------
88
+