metadata
license: cc-by-nc-sa-4.0
language:
- ga
- sga
- la
pipeline_tag: feature-extraction
library_name: gensim
Training Data
Old Irish FastText models were trained on St. Gall Glosses, Würzburg Glosses and Old Irish texts from CELT. A text was included in the training dataset if "Old Irish" or the dates "700-900" were explicitely mentioned in its metadata on CELT, including texts marked as "Old and Middle Irish" or "Old, Middle and Early Modern Irish". Therefore, Old Irish models can have some Middle and Early Modern Irish words in the vocabulary, as well as some Latin due to code-switching.
Available Models
There are 3 models in this familily:
- Cased, 40 364 words:
old_irish_cased_ft_100_5_2.txt
- Lowercase, 38 216 words:
old_irish_lower_ft_100_5_2.txt
- Lowercase with initial mutations removed, 35 946 words:
old_irish_lower_demutated_ft_100_5_2.txt
All models are trained with the same hyperparameters (emb_size=100, window=5, min_count=2, n_epochs=100
) and saved as KeyedVectors
(see Gensim Documentation).
Usage
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="ancatmara/old-irish-ft-vectors", filename="old_irish_lower_ft_100_5_2.txt")
model = KeyedVectors.load_word2vec_format(model_path, binary=False)
model.similar_by_word('conchobar')
Out:
>>> [('chonchobar', 0.7281773090362549),
('conchobair', 0.7064376473426819),
('fergus', 0.6939500570297241),
('conchulaind', 0.6923369765281677),
('conchobor', 0.6832515597343445),
('óenadaig', 0.6077216863632202),
('dochraidi', 0.5989463329315186),
('choba', 0.5952028632164001),
('conchobur', 0.5945655107498169),
('cúculaind', 0.5888893604278564)]