Sýna einfalda færslu atriðis

 
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Arnardóttir, Þórunn
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Snæbjarnarson, Vésteinn
dc.date.accessioned 2024-03-22T10:36:52Z
dc.date.available 2024-03-22T10:36:52Z
dc.date.issued 2024-03-06
dc.identifier.uri http://hdl.handle.net/20.500.12537/324
dc.description This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is an improved version of a previous model which is accessible here: http://hdl.handle.net/20.500.12537/321. The improved model is trained on contextual and domain-tagged data, with an additional span-masking pre-training, along with a wider variety of text genre. The model is trained on span-masked data, parallel synthetic error data and real error data. The span-masked pre-training data consisted of a wide variety of texts, including forums and texts from the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/254). Synthetic error data was taken from different texts, e.g. from IGC (data which was excluded from the span-masked data), MÍM (http://hdl.handle.net/20.500.12537/113), student essays and educational material. This data was scrambled to simulate real grammatical and typographical errors, and some span-masking was included. Fine-tuning data consisted of data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The model can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the Grammatical Error Correction Test Set (http://hdl.handle.net/20.500.12537/320), the model scores 0.898229 on the GLEU metric (modified BLEU for grammatical error correction) and 0.07% in TER (translation error rate). When measured on the Icelandic Error Corpus' test set, the model scores 0.906834 on the GLEU metric and 0.04% in TER. Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er uppfærð útgáfa af fyrra líkani sem má nálgast hér: http://hdl.handle.net/20.500.12537/321. Uppfærða líkanið er þjálfað á samhengi og gögnum sem hafa verið merkt fyrir óðölum ásamt eyðufylllingarþjálfun og þjálfun með fjölbreyttari texta. Líkanið er þjálfað í eyðufyllingu, á samhliða gervivillugögnum og raunverulegum villugögnum. Eyðufyllingargögn voru tekin úr ýmsum texta, m.a. úr spjallborðum og textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/254). Gervivillugögn voru einnig tekin úr ýmsum texta, m.a. úr Risamálheildinni (þeim hluta sem var ekki í eyðufyllingarverkefninu), MÍM (http://hdl.handle.net/20.500.12537/113), nemendaritgerðum og fræðsluefni. Gögnin voru rugluð til þess að líkja eftir raunverulegum málfræði- og ritunarvillum og voru að hluta til hulin til þess að þjálfa eyðufyllingu. Fínþjálfunargögn voru tekin úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Líkanið getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0,898229 GLEU-stig (BLEU nema lagað að málrýni) og er með 0,07% villuhlutfall í þýðingu (translation error rate), þegar það er metið á Prófunarmengi fyrir textaleiðréttingar (http://hdl.handle.net/20.500.12537/320). Þegar það er metið á prófunarmengi íslensku villumálheildarinnar skorar líkanið 0,906834 GLEU-stig og er með 0,04% villuhlutfall í þýðingu.
dc.language.iso isl
dc.publisher Miðeind ehf
dc.relation.replaces http://hdl.handle.net/20.500.12537/321
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject gec
dc.subject ged
dc.subject grammatical error correction
dc.subject grammatical error detection
dc.title Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (24.03)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture Semantic analysis for spell and grammar checking (L13) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 2156551805
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
byt5_M12_clarin.zip
Size
2.01 GB
Format
application/zip
MD5
cdded50ac9d6bfaea1d2a344f4cfb407
 Download file  Preview
 File Preview  
  • byt5_M12_clarin
    • config.json876 B
    • infer.py1 kB
    • pytorch_model.bin2 GB
    • README2 kB
    • requirements.txt35 B
    • some_is_sentences.txt138 B

Sýna einfalda færslu atriðis