File size: 4,198 Bytes
0cd204b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
-------------------------------------------------------------------------------
      6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
-------------------------------------------------------------------------------

Authors               : Carlos Daniel Hernández Mena (carlosm@ru.is).

Language              : Icelandic.

Recommended use       : speech recognition.

-------------------------------------------------------------------------------
Description
-------------------------------------------------------------------------------

"6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06" is a 
word level n-gram language model in binary format suitable for recognizers 
based on the NVIDIA-NeMo framework [1].

This language model was originally created to be used in the field of 
Automatic Speech Recognition (ASR). In specific, it was designed for the 
following NeMo recipe, developed by the Language and Voice Lab (LVL) at 
Reykjavík University in 2022:

  https://github.com/cadia-lvl/samromur-asr/tree/n5_samromur/n5_samromur

Nevertheless, due to the flexibility of this kind of resources and their 
possible application in other tasks, systems or code recipes; it was 
decided to publish this model as an independent item.

-------------------------------------------------------------------------------
The Language Model
-------------------------------------------------------------------------------

The language model was created using the Icelandic Gigaword Corpus [2]. The  
Gigaword corpus contains text from newspaper articles, parliamentary speeches, 
adjudications, books, transcribed radio/television news and more. The 
normalization process of the sentences utilized to generate the language 
model includes to allowing only characters belonging to the Icelandic alphabet, 
expanding numbers and abbreviations, and removing punctuation marks [3]. The 
resulting text has a length of more than 44 million lines of text (5.3GB 
approximately), and it was used to create a pruned 6-gram language model with 
the SRILM toolkit [4]. 

-------------------------------------------------------------------------------
Citation
-------------------------------------------------------------------------------

When publishing results based on the models please refer to:

   Mena, Carlos; "6-GRAM Language Model in Icelandic for NeMo (Binary Format) 
   22.06". Web Download. Reykjavik University: Language and Voice Lab, 2022.

Contact: Carlos Mena (carlosm@ru.is)

License: CC BY 4.0

-------------------------------------------------------------------------------
Acknowledgements
-------------------------------------------------------------------------------

This initiative was funded by the Language Technology Programme for Icelandic 
2019-2023. The programme, which is managed and coordinated by Almannarómur, 
is funded by the Icelandic Ministry of Education, Science and Culture.

-------------------------------------------------------------------------------
References
-------------------------------------------------------------------------------

[1] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, 
    B., ... & Cohen, J. M. (2019). Nemo: a toolkit for building ai 
    applications using neural modules. arXiv preprint arXiv:1909.09577.

[2] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S., 
    & Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text 
    corpus. In Proceedings of the Eleventh International Conference on 
    Language Resources and Evaluation (LREC 2018).
    
[3] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason, 
    J. (2018, May). Open ASR for Icelandic: Resources and a baseline system. 
    In Proceedings of the Eleventh International Conference on Language 
    Resources and Evaluation (LREC 2018).
    
[4] Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In 
    Seventh international conference on spoken language processing.

-------------------------------------------------------------------------------
-------------------------------------------------------------------------------