Show simple item record

 
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Snæbjarnarson, Vésteinn
dc.date.accessioned 2022-09-23T09:18:42Z
dc.date.available 2022-09-23T09:18:42Z
dc.date.issued 2022-09-20
dc.identifier.uri http://hdl.handle.net/20.500.12537/256
dc.description The model is a fine-tuned byT5-base Transformer model for error detection in natural language. It is tuned for sentence classification using parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. The objective was to train a grammatical error detection model that could classify whether a sentence contains an error or not. The overall F1 score is 72.8% (precision: 76.3, recall: 71.7). --- Líkanið er byT5-base Transformer-líkan þjálfað til setningaflokkunar á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Tilgangurinn var að þjálfa líkan sem gæti sagt til um hvort líklegt væri að setning innihéldi villu eða ekki. F1 fyrir líkanið er 72,8% (nákvæmni: 76,3, heimt: 71,7).
dc.language.iso isl
dc.publisher Miðeind ehf
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject ged
dc.subject grammatical error detection
dc.title Binary Error Classifier for Icelandic Sentences (22.09)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Svanhvít Lilja Ingólfsdóttir svanhvit@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture Spell and grammar checking with neural networks (L14) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 2326701000
files.count 6


 Files in this item

 Download all files in item (2.17 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
some_is_sentences.txt
Size
50 bytes
Format
Text file
Description
Unknown
MD5
749f598487ca6a0bcb1b3e923f4d3a41
 Download file  Preview
 File Preview  
Þesi setníng er raung.
Þessi setning er rétt. . . .
                                            
Icon
Name
README
Size
1.79 KB
Format
Unknown
Description
Unknown
MD5
9354cbabaee7674efe232973405a2c08
 Download file
Icon
Name
requirements.txt
Size
35 bytes
Format
Text file
Description
Unknown
MD5
3aff608c3c35a6d0b9dd503f0f537ca9
 Download file  Preview
 File Preview  
torch==1.12.1
transformers==4.22.0 . . .
                                            
Icon
Name
infer.py
Size
418 bytes
Format
Unknown
Description
Unknown
MD5
ab31d0c067ea6135d3dbf1b17a27a83e
 Download file
Icon
Name
config.json
Size
738 bytes
Format
Unknown
Description
Unknown
MD5
36ba274693a96f84ad54c6ddb300a099
 Download file
Icon
Name
pytorch_model.bin
Size
2.17 GB
Format
Unknown
Description
Unknown
MD5
cad9c37724cf43f163c2ad9619d8e196
 Download file

Show simple item record