## UD-þáttari sem nýtir sér upplýsingar úr Transformer-mállíkani

Skor á fortókuðum prófunargögnum:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.70 |     99.77 |     99.73 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.62 |     99.61 |     99.61 |
UPOS       |     96.99 |     96.97 |     96.98 |     97.36
XPOS       |     93.65 |     93.64 |     93.65 |     94.01
UFeats     |     91.31 |     91.29 |     91.30 |     91.65
AllTags    |     86.86 |     86.85 |     86.86 |     87.19
Lemmas     |     95.83 |     95.81 |     95.82 |     96.19
UAS        |     89.01 |     89.00 |     89.00 |     89.35
LAS        |     85.72 |     85.70 |     85.71 |     86.04
CLAS       |     81.39 |     80.91 |     81.15 |     81.34
MLAS       |     69.21 |     68.81 |     69.01 |     69.17
BLEX       |     77.44 |     76.99 |     77.22 |     77.40


Skor á ótókuðum prófunargögnum:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.50 |     99.66 |     99.58 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.42 |     99.50 |     99.46 |
UPOS       |     96.80 |     96.88 |     96.84 |     97.37
XPOS       |     93.48 |     93.56 |     93.52 |     94.03
UFeats     |     91.13 |     91.20 |     91.16 |     91.66
AllTags    |     86.71 |     86.78 |     86.75 |     87.22
Lemmas     |     95.66 |     95.74 |     95.70 |     96.22
UAS        |     88.76 |     88.83 |     88.80 |     89.28
LAS        |     85.49 |     85.55 |     85.52 |     85.99
CLAS       |     81.19 |     80.73 |     80.96 |     81.31
MLAS       |     69.06 |     68.67 |     68.87 |     69.16
BLEX       |     77.28 |     76.84 |     77.06 |     77.39


Til þess að nota líkanið þarf að setja upp COMBO, sem gerir það mögulegt að nota orðgreypingar úr forþjálfuðu transformer-mállíkani (electra-base-igc-is)

* pip install -U pip setuptools wheel
* pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5
* Fyrir Python 3.9 gæti þurft að installa cython: pip install -U pip cython
* Að því loknu er hægt að nota það eins og gert er í parse_file.py.

Sjá einnig leiðbeiningar hér: https://gitlab.clarin-pl.eu/syntactic-tools/combo

Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.


<br>
<hr>
<br>

## A Universal Dependency parser built on top of a Transformer language model

Score on pre-tokenized test data:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.70 |     99.77 |     99.73 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.62 |     99.61 |     99.61 |
UPOS       |     96.99 |     96.97 |     96.98 |     97.36
XPOS       |     93.65 |     93.64 |     93.65 |     94.01
UFeats     |     91.31 |     91.29 |     91.30 |     91.65
AllTags    |     86.86 |     86.85 |     86.86 |     87.19
Lemmas     |     95.83 |     95.81 |     95.82 |     96.19
UAS        |     89.01 |     89.00 |     89.00 |     89.35
LAS        |     85.72 |     85.70 |     85.71 |     86.04
CLAS       |     81.39 |     80.91 |     81.15 |     81.34
MLAS       |     69.21 |     68.81 |     69.01 |     69.17
BLEX       |     77.44 |     76.99 |     77.22 |     77.40


Score on untokenized test data:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.50 |     99.66 |     99.58 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.42 |     99.50 |     99.46 |
UPOS       |     96.80 |     96.88 |     96.84 |     97.37
XPOS       |     93.48 |     93.56 |     93.52 |     94.03
UFeats     |     91.13 |     91.20 |     91.16 |     91.66
AllTags    |     86.71 |     86.78 |     86.75 |     87.22
Lemmas     |     95.66 |     95.74 |     95.70 |     96.22
UAS        |     88.76 |     88.83 |     88.80 |     89.28
LAS        |     85.49 |     85.55 |     85.52 |     85.99
CLAS       |     81.19 |     80.73 |     80.96 |     81.31
MLAS       |     69.06 |     68.67 |     68.87 |     69.16
BLEX       |     77.28 |     76.84 |     77.06 |     77.39

To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model (electra-base-igc-is).

* pip install -U pip setuptools wheel
* pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5
* For Python 3.9, you might need to install cython: pip install -U pip cython
* Then you can run the model as it is done in parse_file.py

For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo

The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/mideind/Tokenizer).
~~~transformer_models/~~~ contains a pretrained model, electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.