dc.contributor.author | Ingólfsdóttir, Svanhvít Lilja |
dc.contributor.author | Ragnarsson, Pétur Orri |
dc.contributor.author | Jónsson, Haukur Páll |
dc.contributor.author | Símonarson, Haukur Barri |
dc.contributor.author | Þorsteinsson, Vilhjálmur |
dc.contributor.author | Snæbjarnarson, Vésteinn |
dc.date.accessioned | 2022-09-23T09:18:42Z |
dc.date.available | 2022-09-23T09:18:42Z |
dc.date.issued | 2022-09-20 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/256 |
dc.description | The model is a fine-tuned byT5-base Transformer model for error detection in natural language. It is tuned for sentence classification using parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. The objective was to train a grammatical error detection model that could classify whether a sentence contains an error or not. The overall F1 score is 72.8% (precision: 76.3, recall: 71.7). --- Líkanið er byT5-base Transformer-líkan þjálfað til setningaflokkunar á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Tilgangurinn var að þjálfa líkan sem gæti sagt til um hvort líklegt væri að setning innihéldi villu eða ekki. F1 fyrir líkanið er 72,8% (nákvæmni: 76,3, heimt: 71,7). |
dc.language.iso | isl |
dc.publisher | Miðeind ehf |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | ged |
dc.subject | grammatical error detection |
dc.title | Binary Error Classifier for Icelandic Sentences (22.09) |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf |
sponsor | Ministry of Education, Science and Culture Spell and grammar checking with neural networks (L14) Language Technology for Icelandic 2019-2023 nationalFunds |
files.size | 2326701000 |
files.count | 6 |
Files in this item
Download all files in item (2.17 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- some_is_sentences.txt
- Size
- 50 bytes
- Format
- Text file
- Description
- Unknown
- MD5
- 749f598487ca6a0bcb1b3e923f4d3a41
- Name
- Size
- 1.79 KB
- Format
- Unknown
- Description
- Unknown
- MD5
- 9354cbabaee7674efe232973405a2c08
- Name
- requirements.txt
- Size
- 35 bytes
- Format
- Text file
- Description
- Unknown
- MD5
- 3aff608c3c35a6d0b9dd503f0f537ca9
- Name
- infer.py
- Size
- 418 bytes
- Format
- Unknown
- Description
- Unknown
- MD5
- ab31d0c067ea6135d3dbf1b17a27a83e
- Name
- config.json
- Size
- 738 bytes
- Format
- Unknown
- Description
- Unknown
- MD5
- 36ba274693a96f84ad54c6ddb300a099
- Name
- pytorch_model.bin
- Size
- 2.17 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- cad9c37724cf43f163c2ad9619d8e196