## UD-þáttari sem nýtir sér upplýsingar úr Transformer-mállíkani Skor á fortókuðum prófunargögnum: Metric | Precision | Recall | F1 Score | AligndAcc -----------+-----------+-----------+-----------+----------- Tokens | 99.70 | 99.77 | 99.73 | Sentences | 100.00 | 100.00 | 100.00 | Words | 99.62 | 99.61 | 99.61 | UPOS | 96.99 | 96.97 | 96.98 | 97.36 XPOS | 93.65 | 93.64 | 93.65 | 94.01 UFeats | 91.31 | 91.29 | 91.30 | 91.65 AllTags | 86.86 | 86.85 | 86.86 | 87.19 Lemmas | 95.83 | 95.81 | 95.82 | 96.19 UAS | 89.01 | 89.00 | 89.00 | 89.35 LAS | 85.72 | 85.70 | 85.71 | 86.04 CLAS | 81.39 | 80.91 | 81.15 | 81.34 MLAS | 69.21 | 68.81 | 69.01 | 69.17 BLEX | 77.44 | 76.99 | 77.22 | 77.40 Skor á ótókuðum prófunargögnum: Metric | Precision | Recall | F1 Score | AligndAcc -----------+-----------+-----------+-----------+----------- Tokens | 99.50 | 99.66 | 99.58 | Sentences | 100.00 | 100.00 | 100.00 | Words | 99.42 | 99.50 | 99.46 | UPOS | 96.80 | 96.88 | 96.84 | 97.37 XPOS | 93.48 | 93.56 | 93.52 | 94.03 UFeats | 91.13 | 91.20 | 91.16 | 91.66 AllTags | 86.71 | 86.78 | 86.75 | 87.22 Lemmas | 95.66 | 95.74 | 95.70 | 96.22 UAS | 88.76 | 88.83 | 88.80 | 89.28 LAS | 85.49 | 85.55 | 85.52 | 85.99 CLAS | 81.19 | 80.73 | 80.96 | 81.31 MLAS | 69.06 | 68.67 | 68.87 | 69.16 BLEX | 77.28 | 76.84 | 77.06 | 77.39 Til þess að nota líkanið þarf að setja upp COMBO, sem gerir það mögulegt að nota orðgreypingar úr forþjálfuðu transformer-mállíkani (electra-base-igc-is) * pip install -U pip setuptools wheel * pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5 * Fyrir Python 3.9 gæti þurft að installa cython: pip install -U pip cython * Að því loknu er hægt að nota það eins og gert er í parse_file.py. Sjá einnig leiðbeiningar hér: https://gitlab.clarin-pl.eu/syntactic-tools/combo Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer). ~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.


## A Universal Dependency parser built on top of a Transformer language model Score on pre-tokenized test data: Metric | Precision | Recall | F1 Score | AligndAcc -----------+-----------+-----------+-----------+----------- Tokens | 99.70 | 99.77 | 99.73 | Sentences | 100.00 | 100.00 | 100.00 | Words | 99.62 | 99.61 | 99.61 | UPOS | 96.99 | 96.97 | 96.98 | 97.36 XPOS | 93.65 | 93.64 | 93.65 | 94.01 UFeats | 91.31 | 91.29 | 91.30 | 91.65 AllTags | 86.86 | 86.85 | 86.86 | 87.19 Lemmas | 95.83 | 95.81 | 95.82 | 96.19 UAS | 89.01 | 89.00 | 89.00 | 89.35 LAS | 85.72 | 85.70 | 85.71 | 86.04 CLAS | 81.39 | 80.91 | 81.15 | 81.34 MLAS | 69.21 | 68.81 | 69.01 | 69.17 BLEX | 77.44 | 76.99 | 77.22 | 77.40 Score on untokenized test data: Metric | Precision | Recall | F1 Score | AligndAcc -----------+-----------+-----------+-----------+----------- Tokens | 99.50 | 99.66 | 99.58 | Sentences | 100.00 | 100.00 | 100.00 | Words | 99.42 | 99.50 | 99.46 | UPOS | 96.80 | 96.88 | 96.84 | 97.37 XPOS | 93.48 | 93.56 | 93.52 | 94.03 UFeats | 91.13 | 91.20 | 91.16 | 91.66 AllTags | 86.71 | 86.78 | 86.75 | 87.22 Lemmas | 95.66 | 95.74 | 95.70 | 96.22 UAS | 88.76 | 88.83 | 88.80 | 89.28 LAS | 85.49 | 85.55 | 85.52 | 85.99 CLAS | 81.19 | 80.73 | 80.96 | 81.31 MLAS | 69.06 | 68.67 | 68.87 | 69.16 BLEX | 77.28 | 76.84 | 77.06 | 77.39 To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model (electra-base-igc-is). * pip install -U pip setuptools wheel * pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5 * For Python 3.9, you might need to install cython: pip install -U pip cython * Then you can run the model as it is done in parse_file.py For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/mideind/Tokenizer). ~~~transformer_models/~~~ contains a pretrained model, electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.