Sýna einfalda færslu atriðis

 
dc.contributor.author Arnardóttir, Þórunn
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Ingvarsson Juto, Garðar
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Einarsson, Hafsteinn
dc.contributor.author Ingason, Anton Karl
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.date.accessioned 2024-04-08T14:05:36Z
dc.date.available 2024-04-08T14:05:36Z
dc.date.issued 2024-04-01
dc.identifier.uri http://hdl.handle.net/20.500.12537/326
dc.description Icelandic GPT-SW3 for spell and grammar checking is a GPT-SW3 model fine-tuned on Icelandic and particularly on the spell and grammar checking task. The 6.7B GPT-SW3 model (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) was pre-trained on Icelandic texts and fine-tuned on Icelandic error corpora. Texts for pre-training included texts from the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/253) and MÍM (http://hdl.handle.net/20.500.12537/195). For fine-tuning, the following Icelandic error corpora were used: the Icelandic Error Corpus (http://hdl.handle.net/20.500.12537/105), the Icelandic L2 Error Corpus (http://hdl.handle.net/20.500.12537/280), the Icelandic Dyslexia Error Corpus (http://hdl.handle.net/20.500.12537/281), and the Icelandic Child Language Error Corpus (http://hdl.handle.net/20.500.12537/133). The model is fine-tuned on three different tasks: - Task 1: The model evaluates one text with regards to e.g. grammar and spelling, and returns all errors in the input text as a list, with their position in the text and their corrections. - Task 2: The model evaluates two texts and chooses which one is better with regards to e.g. grammar and spelling. - Task 3: The model evaluates one text with regards to e.g. grammar and spelling, and returns a corrected version of the text. For task 1, the model delivers a 0.28 F0.5 score on the Grammatical Error Correction Test Set (http://hdl.handle.net/20.500.12537/320) and for task 2, the model delivers a 63.95% accuracy score on the same test set. For task 3, the model scores 0.925559 on the GLEU metric (modified BLEU for grammatical error correction) and 0.02 in TER (translation error rate). Íslenskt GPT-SW3 fyrir málfræði- og stafsetningarleiðréttingu er GPT-SW3-líkan sem hefur verið fínþjálfað á íslensku og sérstaklega í málfræði- og stafsetningarleiðréttingu. 6,7 milljarða stika GPT-SW3-líkan (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) var forþjálfað á íslenskum textum og fínþjálfað á íslenskum villumálheildum. Forþjálfunartextar samanstóðu m.a. af textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/253) og MÍM (http://hdl.handle.net/20.500.12537/195). Í fínþjálfun voru eftirfarandi villumálheildir notaðar: íslenska villumálheildin (http://hdl.handle.net/20.500.12537/105), íslenska annarsmálsvillumálheildin (http://hdl.handle.net/20.500.12537/280), íslenska dyslexíuvillumálheildin (http://hdl.handle.net/20.500.12537/281) og íslenska barnamálsmálheildin (http://hdl.handle.net/20.500.12537/133). Líkanið er fínþjálfað á þremur mismunandi verkefnum: - Verkefni 1: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar öllum villum í inntakstexta sem lista, þar sem staðsetning þeirra í textanum er tekin fram ásamt leiðréttum myndum þeirra. - Verkefni 2: Líkanið metur tvo texta og velur hvor þeirra er betri hvað varðar t.d. málfræði og stafsetningu. - Verkefni 3: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar leiðréttri útgáfu af textanum. Í verkefni 1 skilar líkanið 0.28 F0.5-skori þegar það er metið á Prófunarmengi fyrir textaleiðréttingar (http://hdl.handle.net/20.500.12537/320) og í verkefni 2 skilar líkanið 63,95% nákvæmni þegar það er metið á sömu gögnum. Í verkefni 3 skorar líkanið 0.925559 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.02 villuhlutfall í þýðingu (translation error rate).
dc.language.iso isl
dc.publisher Miðeind ehf
dc.publisher University of Iceland
dc.rights AI Sweden's LLM AI Model License Agreement
dc.rights.uri https://repository.clarin.is/licenses/AI_Swedens_LLM_License.html
dc.rights.label PUB
dc.subject gec
dc.subject ged
dc.subject grammatical error correction
dc.subject grammatical error detection
dc.subject llm
dc.subject large language model
dc.subject gpt-sw3
dc.title Icelandic GPT-SW3 for spell and grammar checking
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Þórunn Arnardóttir thar@hi.is University of Iceland
sponsor Ministry of Education, Science and Culture Semantic analysis for spell and grammar checking (L13) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 10986400027
files.count 3


 Files in this item

 Download all files in item (10.23 GB)
This item is
Publicly Available
and licensed under:
AI Sweden's LLM AI Model License Agreement
Icon
Name
README
Size
3.51 KB
Format
Unknown
Description
The model's README
MD5
596f07db781672eb11d829bcb4b95d3b
 Download file
Icon
Name
Icelandic-GPT-SW3-1of2.zip
Size
6.62 KB
Format
application/zip
Description
A zip file containing all relevant files, except the model itself
MD5
69faba9e592ed05d6d07f0ca011dfed1
 Download file  Preview
 File Preview  
  • GPT-SW3-M12-1
    • example_outputs
      • task2_example.txt4 B
      • task1_example.txt418 B
      • task3_example.txt94 B
    • run_model.py6 kB
    • README3 kB
    • requirements.txt67 B
    • example_inputs
      • task2_example.jsonl234 B
      • task1_example.txt80 B
      • task3_example.txt80 B
Icon
Name
Icelandic-GPT-SW3-2of2.zip
Size
10.23 GB
Format
application/zip
Description
A zip file containing the model
MD5
527e2ddc4a5191ea09992c93e2fd00e0
 Download file  Preview
 File Preview  
  • GPT-SW3-M12-2
    • gpt-sw3-model
      • config.json1004 B
      • generation_config.json154 B
      • pytorch_model-00003-of-00003.bin3 GB
      • pytorch_model-00001-of-00003.bin4 GB
      • pytorch_model-00002-of-00003.bin4 GB
      • pytorch_model.bin.index.json28 kB

Sýna einfalda færslu atriðis