danielschnell
commited on
Commit
•
0cd204b
1
Parent(s):
9d0ed7d
Copy from Clarin: http://hdl.handle.net/20.500.12537/226
Browse files"6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06 is a
word level n-gram language model in binary format suitable for
recognizers based on the NVIDIA-NeMo framework.
Signed-off-by: Daniel Schnell <dschnell@grammatek.com>
- .gitattributes +1 -0
- 6GRAM_ARPA_MODEL.bin +3 -0
- Readme.txt +88 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
6GRAM_ARPA_MODEL.bin filter=lfs diff=lfs merge=lfs -text
|
6GRAM_ARPA_MODEL.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:dd3b955a0ddc8f1aa694ecaee76904340cd1bb8bf125cedb6f8560c0342d9c7e
|
3 |
+
size 5433972976
|
Readme.txt
ADDED
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
-------------------------------------------------------------------------------
|
2 |
+
6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
|
3 |
+
-------------------------------------------------------------------------------
|
4 |
+
|
5 |
+
Authors : Carlos Daniel Hernández Mena (carlosm@ru.is).
|
6 |
+
|
7 |
+
Language : Icelandic.
|
8 |
+
|
9 |
+
Recommended use : speech recognition.
|
10 |
+
|
11 |
+
-------------------------------------------------------------------------------
|
12 |
+
Description
|
13 |
+
-------------------------------------------------------------------------------
|
14 |
+
|
15 |
+
"6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06" is a
|
16 |
+
word level n-gram language model in binary format suitable for recognizers
|
17 |
+
based on the NVIDIA-NeMo framework [1].
|
18 |
+
|
19 |
+
This language model was originally created to be used in the field of
|
20 |
+
Automatic Speech Recognition (ASR). In specific, it was designed for the
|
21 |
+
following NeMo recipe, developed by the Language and Voice Lab (LVL) at
|
22 |
+
Reykjavík University in 2022:
|
23 |
+
|
24 |
+
https://github.com/cadia-lvl/samromur-asr/tree/n5_samromur/n5_samromur
|
25 |
+
|
26 |
+
Nevertheless, due to the flexibility of this kind of resources and their
|
27 |
+
possible application in other tasks, systems or code recipes; it was
|
28 |
+
decided to publish this model as an independent item.
|
29 |
+
|
30 |
+
-------------------------------------------------------------------------------
|
31 |
+
The Language Model
|
32 |
+
-------------------------------------------------------------------------------
|
33 |
+
|
34 |
+
The language model was created using the Icelandic Gigaword Corpus [2]. The
|
35 |
+
Gigaword corpus contains text from newspaper articles, parliamentary speeches,
|
36 |
+
adjudications, books, transcribed radio/television news and more. The
|
37 |
+
normalization process of the sentences utilized to generate the language
|
38 |
+
model includes to allowing only characters belonging to the Icelandic alphabet,
|
39 |
+
expanding numbers and abbreviations, and removing punctuation marks [3]. The
|
40 |
+
resulting text has a length of more than 44 million lines of text (5.3GB
|
41 |
+
approximately), and it was used to create a pruned 6-gram language model with
|
42 |
+
the SRILM toolkit [4].
|
43 |
+
|
44 |
+
-------------------------------------------------------------------------------
|
45 |
+
Citation
|
46 |
+
-------------------------------------------------------------------------------
|
47 |
+
|
48 |
+
When publishing results based on the models please refer to:
|
49 |
+
|
50 |
+
Mena, Carlos; "6-GRAM Language Model in Icelandic for NeMo (Binary Format)
|
51 |
+
22.06". Web Download. Reykjavik University: Language and Voice Lab, 2022.
|
52 |
+
|
53 |
+
Contact: Carlos Mena (carlosm@ru.is)
|
54 |
+
|
55 |
+
License: CC BY 4.0
|
56 |
+
|
57 |
+
-------------------------------------------------------------------------------
|
58 |
+
Acknowledgements
|
59 |
+
-------------------------------------------------------------------------------
|
60 |
+
|
61 |
+
This initiative was funded by the Language Technology Programme for Icelandic
|
62 |
+
2019-2023. The programme, which is managed and coordinated by Almannarómur,
|
63 |
+
is funded by the Icelandic Ministry of Education, Science and Culture.
|
64 |
+
|
65 |
+
-------------------------------------------------------------------------------
|
66 |
+
References
|
67 |
+
-------------------------------------------------------------------------------
|
68 |
+
|
69 |
+
[1] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg,
|
70 |
+
B., ... & Cohen, J. M. (2019). Nemo: a toolkit for building ai
|
71 |
+
applications using neural modules. arXiv preprint arXiv:1909.09577.
|
72 |
+
|
73 |
+
[2] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
|
74 |
+
& Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
|
75 |
+
corpus. In Proceedings of the Eleventh International Conference on
|
76 |
+
Language Resources and Evaluation (LREC 2018).
|
77 |
+
|
78 |
+
[3] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
|
79 |
+
J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
|
80 |
+
In Proceedings of the Eleventh International Conference on Language
|
81 |
+
Resources and Evaluation (LREC 2018).
|
82 |
+
|
83 |
+
[4] Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In
|
84 |
+
Seventh international conference on spoken language processing.
|
85 |
+
|
86 |
+
-------------------------------------------------------------------------------
|
87 |
+
-------------------------------------------------------------------------------
|
88 |
+
|