Show simple item record

 
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Snæbjarnarson, Vésteinn
dc.date.accessioned 2022-09-23T09:18:27Z
dc.date.available 2022-09-23T09:18:27Z
dc.date.issued 2022-09-19
dc.identifier.uri http://hdl.handle.net/20.500.12537/255
dc.description This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is trained on parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. It can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the iceEC test data, the model scores 0.862917 on the GLEU metric (modified BLEU for grammatical error correction) and 0.06% in TER (translation error rate). --- Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er þjálfað á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Það getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0.862917 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.06% villuhlutfall í þýðingu (translation error rate), þegar það er metið á prófunarhluta íslensku villumálheildarinnar.
dc.language.iso isl
dc.publisher Miðeind ehf
dc.relation.isreplacedby http://hdl.handle.net/20.500.12537/321
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://yfirlestur.is
dc.subject iec
dc.subject ged
dc.subject grammatical error detection
dc.subject grammatical error correction
dc.title Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (22.09)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
demo.uri https://huggingface.co/mideind/yfirlestur-icelandic-correction-byt5
contact.person Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture Spell and grammar checking with neural networks (L14) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 2326701060
files.count 6


 Files in this item

 Download all files in item (2.17 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
some_is_sentences.txt
Size
138 bytes
Format
Text file
Description
Unknown
MD5
a8e139902c71c805545ab7241cd975bb
 Download file  Preview
 File Preview  
Hverjum lángar í ís með súkulaðisósu.
Kristínu hlakkar til að fá ísin.
mig langar í ís þig langar í ís alla langar í ís . . .
                                            
Icon
Name
README
Size
1.72 KB
Format
Unknown
Description
Unknown
MD5
77363b9e91fb3f614de3bc321fed23b7
 Download file
Icon
Name
requirements.txt
Size
35 bytes
Format
Text file
Description
Unknown
MD5
3aff608c3c35a6d0b9dd503f0f537ca9
 Download file  Preview
 File Preview  
torch==1.12.1
transformers==4.22.0 . . .
                                            
Icon
Name
infer.py
Size
458 bytes
Format
Unknown
Description
Unknown
MD5
dccce3f304edc76d143b36dc3fb3a4a1
 Download file
Icon
Name
config.json
Size
738 bytes
Format
Unknown
Description
Unknown
MD5
36ba274693a96f84ad54c6ddb300a099
 Download file
Icon
Name
pytorch_model.bin
Size
2.17 GB
Format
Unknown
Description
Unknown
MD5
27e6b147a80b5f770b4cc56653265b1e
 Download file

Show simple item record