dc.contributor.author | Ingólfsdóttir, Svanhvít Lilja |
dc.contributor.author | Ragnarsson, Pétur Orri |
dc.contributor.author | Jónsson, Haukur Páll |
dc.contributor.author | Símonarson, Haukur Barri |
dc.contributor.author | Þorsteinsson, Vilhjálmur |
dc.contributor.author | Snæbjarnarson, Vésteinn |
dc.date.accessioned | 2022-09-23T09:18:27Z |
dc.date.available | 2022-09-23T09:18:27Z |
dc.date.issued | 2022-09-19 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/255 |
dc.description | This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is trained on parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. It can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the iceEC test data, the model scores 0.862917 on the GLEU metric (modified BLEU for grammatical error correction) and 0.06% in TER (translation error rate). --- Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er þjálfað á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Það getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0.862917 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.06% villuhlutfall í þýðingu (translation error rate), þegar það er metið á prófunarhluta íslensku villumálheildarinnar. |
dc.language.iso | isl |
dc.publisher | Miðeind ehf |
dc.relation.isreplacedby | http://hdl.handle.net/20.500.12537/321 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://yfirlestur.is |
dc.subject | iec |
dc.subject | ged |
dc.subject | grammatical error detection |
dc.subject | grammatical error correction |
dc.title | Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (22.09) |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://huggingface.co/mideind/yfirlestur-icelandic-correction-byt5 |
contact.person | Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf |
sponsor | Ministry of Education, Science and Culture Spell and grammar checking with neural networks (L14) Language Technology for Icelandic 2019-2023 nationalFunds |
files.size | 2326701060 |
files.count | 6 |
Files in this item
Download all files in item (2.17 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- some_is_sentences.txt
- Size
- 138 bytes
- Format
- Text file
- Description
- Unknown
- MD5
- a8e139902c71c805545ab7241cd975bb
- Name
- README
- Size
- 1.72 KB
- Format
- Unknown
- Description
- Unknown
- MD5
- 77363b9e91fb3f614de3bc321fed23b7
- Name
- requirements.txt
- Size
- 35 bytes
- Format
- Text file
- Description
- Unknown
- MD5
- 3aff608c3c35a6d0b9dd503f0f537ca9
- Name
- infer.py
- Size
- 458 bytes
- Format
- Unknown
- Description
- Unknown
- MD5
- dccce3f304edc76d143b36dc3fb3a4a1
- Name
- config.json
- Size
- 738 bytes
- Format
- Unknown
- Description
- Unknown
- MD5
- 36ba274693a96f84ad54c6ddb300a099
- Name
- pytorch_model.bin
- Size
- 2.17 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- 27e6b147a80b5f770b4cc56653265b1e