Icelandic GPT-SW3 for spell and grammar checking

Arnardóttir, Þórunn; Ingólfsdóttir, Svanhvít Lilja; Ingvarsson Juto, Garðar; Símonarson, Haukur Barri; Einarsson, Hafsteinn; Ingason, Anton Karl; Þorsteinsson, Vilhjálmur

dc.contributor.author	Arnardóttir, Þórunn
dc.contributor.author	Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author	Ingvarsson Juto, Garðar
dc.contributor.author	Símonarson, Haukur Barri
dc.contributor.author	Einarsson, Hafsteinn
dc.contributor.author	Ingason, Anton Karl
dc.contributor.author	Þorsteinsson, Vilhjálmur
dc.date.accessioned	2024-04-08T14:05:36Z
dc.date.available	2024-04-08T14:05:36Z
dc.date.issued	2024-04-01
dc.identifier.uri	http://hdl.handle.net/20.500.12537/326
dc.description	Icelandic GPT-SW3 for spell and grammar checking is a GPT-SW3 model fine-tuned on Icelandic and particularly on the spell and grammar checking task. The 6.7B GPT-SW3 model (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) was pre-trained on Icelandic texts and fine-tuned on Icelandic error corpora. Texts for pre-training included texts from the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/253) and MÍM (http://hdl.handle.net/20.500.12537/195). For fine-tuning, the following Icelandic error corpora were used: the Icelandic Error Corpus (http://hdl.handle.net/20.500.12537/105), the Icelandic L2 Error Corpus (http://hdl.handle.net/20.500.12537/280), the Icelandic Dyslexia Error Corpus (http://hdl.handle.net/20.500.12537/281), and the Icelandic Child Language Error Corpus (http://hdl.handle.net/20.500.12537/133). The model is fine-tuned on three different tasks: - Task 1: The model evaluates one text with regards to e.g. grammar and spelling, and returns all errors in the input text as a list, with their position in the text and their corrections. - Task 2: The model evaluates two texts and chooses which one is better with regards to e.g. grammar and spelling. - Task 3: The model evaluates one text with regards to e.g. grammar and spelling, and returns a corrected version of the text. For task 1, the model delivers a 0.28 F0.5 score on the Grammatical Error Correction Test Set (http://hdl.handle.net/20.500.12537/320) and for task 2, the model delivers a 63.95% accuracy score on the same test set. For task 3, the model scores 0.925559 on the GLEU metric (modified BLEU for grammatical error correction) and 0.02 in TER (translation error rate). Íslenskt GPT-SW3 fyrir málfræði- og stafsetningarleiðréttingu er GPT-SW3-líkan sem hefur verið fínþjálfað á íslensku og sérstaklega í málfræði- og stafsetningarleiðréttingu. 6,7 milljarða stika GPT-SW3-líkan (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) var forþjálfað á íslenskum textum og fínþjálfað á íslenskum villumálheildum. Forþjálfunartextar samanstóðu m.a. af textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/253) og MÍM (http://hdl.handle.net/20.500.12537/195). Í fínþjálfun voru eftirfarandi villumálheildir notaðar: íslenska villumálheildin (http://hdl.handle.net/20.500.12537/105), íslenska annarsmálsvillumálheildin (http://hdl.handle.net/20.500.12537/280), íslenska dyslexíuvillumálheildin (http://hdl.handle.net/20.500.12537/281) og íslenska barnamálsmálheildin (http://hdl.handle.net/20.500.12537/133). Líkanið er fínþjálfað á þremur mismunandi verkefnum: - Verkefni 1: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar öllum villum í inntakstexta sem lista, þar sem staðsetning þeirra í textanum er tekin fram ásamt leiðréttum myndum þeirra. - Verkefni 2: Líkanið metur tvo texta og velur hvor þeirra er betri hvað varðar t.d. málfræði og stafsetningu. - Verkefni 3: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar leiðréttri útgáfu af textanum. Í verkefni 1 skilar líkanið 0.28 F0.5-skori þegar það er metið á Prófunarmengi fyrir textaleiðréttingar (http://hdl.handle.net/20.500.12537/320) og í verkefni 2 skilar líkanið 63,95% nákvæmni þegar það er metið á sömu gögnum. Í verkefni 3 skorar líkanið 0.925559 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.02 villuhlutfall í þýðingu (translation error rate).
dc.language.iso	isl
dc.publisher	Miðeind ehf
dc.publisher	University of Iceland
dc.rights	AI Sweden's LLM AI Model License Agreement
dc.rights.uri	https://repository.clarin.is/licenses/AI_Swedens_LLM_License.html
dc.rights.label	PUB
dc.subject	gec
dc.subject	ged
dc.subject	grammatical error correction
dc.subject	grammatical error detection
dc.subject	llm
dc.subject	large language model
dc.subject	gpt-sw3
dc.title	Icelandic GPT-SW3 for spell and grammar checking
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	Clarin IS Repository
contact.person	Þórunn Arnardóttir thar@hi.is University of Iceland
sponsor	Ministry of Education, Science and Culture Semantic analysis for spell and grammar checking (L13) Language Technology for Icelandic 2019-2023 nationalFunds
files.size	10986400027
files.count	3

Files in this item

Download all files in item (10.23 GB)

This item is

Publicly Available

and licensed under:
AI Sweden's LLM AI Model License Agreement

Name: README
Size: 3.51 KB
Format: Unknown
Description: The model's README
MD5: 596f07db781672eb11d829bcb4b95d3b

Download file

Name: Icelandic-GPT-SW3-1of2.zip
Size: 6.62 KB
Format: application/zip
Description: A zip file containing all relevant files, except the model itself
MD5: 69faba9e592ed05d6d07f0ca011dfed1

Download file Preview

File Preview

GPT-SW3-M12-1
- example_outputs
  - task2_example.txt4 B
  - task1_example.txt418 B
  - task3_example.txt94 B
- run_model.py6 kB
- README3 kB
- requirements.txt67 B
- example_inputs
  - task2_example.jsonl234 B
  - task1_example.txt80 B
  - task3_example.txt80 B

Name: Icelandic-GPT-SW3-2of2.zip
Size: 10.23 GB
Format: application/zip
Description: A zip file containing the model
MD5: 527e2ddc4a5191ea09992c93e2fd00e0

Download file Preview

File Preview

GPT-SW3-M12-2
- gpt-sw3-model
  - config.json1004 B
  - generation_config.json154 B
  - pytorch_model-00003-of-00003.bin3 GB
  - pytorch_model-00001-of-00003.bin4 GB
  - pytorch_model-00002-of-00003.bin4 GB
  - pytorch_model.bin.index.json28 kB

Sýna einfalda færslu atriðis

Files in this item

Samstarfsaðilar, stjórn og fjármögnun

Gagnasafn

Meira