Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (23.12)

Ingólfsdóttir, Svanhvít Lilja; Arnardóttir, Þórunn; Ragnarsson, Pétur Orri; Jónsson, Haukur Páll; Símonarson, Haukur Barri; Þorsteinsson, Vilhjálmur; Snæbjarnarson, Vésteinn

dc.contributor.author	Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author	Arnardóttir, Þórunn
dc.contributor.author	Ragnarsson, Pétur Orri
dc.contributor.author	Jónsson, Haukur Páll
dc.contributor.author	Símonarson, Haukur Barri
dc.contributor.author	Þorsteinsson, Vilhjálmur
dc.contributor.author	Snæbjarnarson, Vésteinn
dc.date.accessioned	2024-01-30T14:23:47Z
dc.date.available	2024-01-30T14:23:47Z
dc.date.issued	2023-12-31
dc.identifier.uri	http://hdl.handle.net/20.500.12537/321
dc.description	This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is an improved version of a previous model which is accessible here: http://hdl.handle.net/20.500.12537/255. The improved model is trained on contextual and domain-tagged data, with an additional span-masking pre-training, along with a wider variety of text genre. The model is trained on span-masked data, parallel synthetic error data and real error data. The span-masking pre-training step consisted of 30 million training examples from a wide variety of texts, including forums and texts from the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/254). Synthetic error data consisted of 8.5 million training examples taken from different texts. Data for this was e.g. obtained from IGC (data which was excluded from the span-masked data), MÍM (http://hdl.handle.net/20.500.12537/113), student essays and educational material. This data was scrambled to simulate real grammatical and typographical errors. Fine-tuning data consisted of data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The model can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the Grammatical Error Correction Test Set, the model scores 0.918975 on the GLEU metric (modified BLEU for grammatical error correction) and 0.06% in TER (translation error rate). Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er uppfærð útgáfa af fyrra líkani sem má nálgast hér: http://hdl.handle.net/20.500.12537/255. Uppfærða líkanið er þjálfað á samhengi og gögnum sem hafa verið merkt fyrir óðölum ásamt eyðufylllingarþjálfun og þjálfun með fjölbreyttari texta. Líkanið er þjálfað í eyðufyllingu, á samhliða gervivillugögnum og raunverulegum villugögnum. Eyðufyllingarþjálfun var gerð á 30 milljónum þjálfunardæma sem voru tekin úr ýmsum texta, m.a. úr spjallborðum og textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/254). Gervivillugögn innihéldu 8,5 milljón þjálfunardæmi sem voru einnig tekin úr ýmsum texta. Sá texti var m.a. úr Risamálheildinni (þeim hluta sem var ekki í eyðufyllingarverkefninu), MÍM (http://hdl.handle.net/20.500.12537/113), nemendaritgerðum og fræðsluefni. Gögnin voru rugluð til þess að líkja eftir raunverulegum málfræði- og ritunarvillum. Fínþjálfunargögn voru tekin úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Líkanið getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0.918975 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.06% villuhlutfall í þýðingu (translation error rate), þegar það er metið á Prófunarmengi fyrir textaleiðréttingar.
dc.language.iso	isl
dc.publisher	Miðeind ehf
dc.relation.replaces	http://hdl.handle.net/20.500.12537/255
dc.relation.isreplacedby	http://hdl.handle.net/20.500.12537/324
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	gec
dc.subject	ged
dc.subject	grammatical error correction
dc.subject	grammatical error detection
dc.title	Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (23.12)
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	Clarin IS Repository
contact.person	Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf
sponsor	Ministry of Education, Science and Culture Semantic analysis for spell and grammar checking (L13) Language Technology for Icelandic 2019-2023 nationalFunds
files.size	2156552100
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: byt5_correction_model.zip
Size: 2.01 GB
Format: application/zip
Description: A zip file containing all files needed for using the model
MD5: e249ce74f94f489c5a87457f41075302

Download file Preview

File Preview

byt5_M11_clarin
- config.json951 B
- infer.py1 kB
- pytorch_model.bin2 GB
- README2 kB
- requirements.txt35 B
- some_is_sentences.txt138 B

Show simple item record

Files in this item

Partners, Coordination, Funding

Repository

More