Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (24.03)

Ingólfsdóttir, Svanhvít Lilja; Arnardóttir, Þórunn; Ragnarsson, Pétur Orri; Jónsson, Haukur Páll; Símonarson, Haukur Barri; Þorsteinsson, Vilhjálmur; Snæbjarnarson, Vésteinn

dc.contributor.author	Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author	Arnardóttir, Þórunn
dc.contributor.author	Ragnarsson, Pétur Orri
dc.contributor.author	Jónsson, Haukur Páll
dc.contributor.author	Símonarson, Haukur Barri
dc.contributor.author	Þorsteinsson, Vilhjálmur
dc.contributor.author	Snæbjarnarson, Vésteinn
dc.date.accessioned	2024-03-22T10:36:52Z
dc.date.available	2024-03-22T10:36:52Z
dc.date.issued	2024-03-06
dc.identifier.uri	http://hdl.handle.net/20.500.12537/324
dc.description	This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is an improved version of a previous model which is accessible here: http://hdl.handle.net/20.500.12537/321. The improved model is trained on contextual and domain-tagged data, with an additional span-masking pre-training, along with a wider variety of text genre. The model is trained on span-masked data, parallel synthetic error data and real error data. The span-masked pre-training data consisted of a wide variety of texts, including forums and texts from the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/254). Synthetic error data was taken from different texts, e.g. from IGC (data which was excluded from the span-masked data), MÍM (http://hdl.handle.net/20.500.12537/113), student essays and educational material. This data was scrambled to simulate real grammatical and typographical errors, and some span-masking was included. Fine-tuning data consisted of data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The model can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the Grammatical Error Correction Test Set (http://hdl.handle.net/20.500.12537/320), the model scores 0.898229 on the GLEU metric (modified BLEU for grammatical error correction) and 0.07% in TER (translation error rate). When measured on the Icelandic Error Corpus' test set, the model scores 0.906834 on the GLEU metric and 0.04% in TER. Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er uppfærð útgáfa af fyrra líkani sem má nálgast hér: http://hdl.handle.net/20.500.12537/321. Uppfærða líkanið er þjálfað á samhengi og gögnum sem hafa verið merkt fyrir óðölum ásamt eyðufylllingarþjálfun og þjálfun með fjölbreyttari texta. Líkanið er þjálfað í eyðufyllingu, á samhliða gervivillugögnum og raunverulegum villugögnum. Eyðufyllingargögn voru tekin úr ýmsum texta, m.a. úr spjallborðum og textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/254). Gervivillugögn voru einnig tekin úr ýmsum texta, m.a. úr Risamálheildinni (þeim hluta sem var ekki í eyðufyllingarverkefninu), MÍM (http://hdl.handle.net/20.500.12537/113), nemendaritgerðum og fræðsluefni. Gögnin voru rugluð til þess að líkja eftir raunverulegum málfræði- og ritunarvillum og voru að hluta til hulin til þess að þjálfa eyðufyllingu. Fínþjálfunargögn voru tekin úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Líkanið getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0,898229 GLEU-stig (BLEU nema lagað að málrýni) og er með 0,07% villuhlutfall í þýðingu (translation error rate), þegar það er metið á Prófunarmengi fyrir textaleiðréttingar (http://hdl.handle.net/20.500.12537/320). Þegar það er metið á prófunarmengi íslensku villumálheildarinnar skorar líkanið 0,906834 GLEU-stig og er með 0,04% villuhlutfall í þýðingu.
dc.language.iso	isl
dc.publisher	Miðeind ehf
dc.relation.replaces	http://hdl.handle.net/20.500.12537/321
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	gec
dc.subject	ged
dc.subject	grammatical error correction
dc.subject	grammatical error detection
dc.title	Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (24.03)
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	Clarin IS Repository
contact.person	Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf
sponsor	Ministry of Education, Science and Culture Semantic analysis for spell and grammar checking (L13) Language Technology for Icelandic 2019-2023 nationalFunds
files.size	2156551805
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: byt5_M12_clarin.zip
Size: 2.01 GB
Format: application/zip
MD5: cdded50ac9d6bfaea1d2a344f4cfb407

Download file Preview

File Preview

byt5_M12_clarin
- config.json876 B
- infer.py1 kB
- pytorch_model.bin2 GB
- README2 kB
- requirements.txt35 B
- some_is_sentences.txt138 B

Show simple item record

Files in this item

Partners, Coordination, Funding

Repository

More