Icelandic-lt/combo_parser

A Universal Dependency parser built on top of a Transformer language model

Score on pre-tokenized test data:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.70 |     99.77 |     99.73 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.62 |     99.61 |     99.61 |
UPOS       |     96.99 |     96.97 |     96.98 |     97.36
XPOS       |     93.65 |     93.64 |     93.65 |     94.01
UFeats     |     91.31 |     91.29 |     91.30 |     91.65
AllTags    |     86.86 |     86.85 |     86.86 |     87.19
Lemmas     |     95.83 |     95.81 |     95.82 |     96.19
UAS        |     89.01 |     89.00 |     89.00 |     89.35
LAS        |     85.72 |     85.70 |     85.71 |     86.04
CLAS       |     81.39 |     80.91 |     81.15 |     81.34
MLAS       |     69.21 |     68.81 |     69.01 |     69.17
BLEX       |     77.44 |     76.99 |     77.22 |     77.40

Score on untokenized test data:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.50 |     99.66 |     99.58 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.42 |     99.50 |     99.46 |
UPOS       |     96.80 |     96.88 |     96.84 |     97.37
XPOS       |     93.48 |     93.56 |     93.52 |     94.03
UFeats     |     91.13 |     91.20 |     91.16 |     91.66
AllTags    |     86.71 |     86.78 |     86.75 |     87.22
Lemmas     |     95.66 |     95.74 |     95.70 |     96.22
UAS        |     88.76 |     88.83 |     88.80 |     89.28
LAS        |     85.49 |     85.55 |     85.52 |     85.99
CLAS       |     81.19 |     80.73 |     80.96 |     81.31
MLAS       |     69.06 |     68.67 |     68.87 |     69.16
BLEX       |     77.28 |     76.84 |     77.06 |     77.39

To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model (electra-base-igc-is).

git submodule update --init --recursive
pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5

For Python 3.9, you might need to install cython:

pip install -U pip cython

Then you can run the model as it is done in parse_file.py

For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo

The Tokenizer directory is a clone of Miðeind's tokenizer.

The directory transformer_models/ contains the pretrained model electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.

License

https://opensource.org/licenses/Apache-2.0