IGC2024 Filtered-2

Daðason, Jón Friðrik; Steingrímsson, Steinþór; Hafsteinsson, Hinrik

dc.contributor.author	Daðason, Jón Friðrik
dc.contributor.author	Steingrímsson, Steinþór
dc.contributor.author	Hafsteinsson, Hinrik
dc.date.accessioned	2026-04-07T14:48:05Z
dc.date.available	2026-04-07T14:48:05Z
dc.date.issued	2026-03-30
dc.identifier.uri	http://hdl.handle.net/20.500.12537/382
dc.description	[English] This is a JSONL version of the 2024 release of the Icelandic Gigaword Corpus (IGC), prepared for language model training. The archive contains training and validation sets of unannotated documents from the IGC, licensed using the IGC license. The corpus has been filtered, deduplicated, and normalized to remove content unsuitable for training. Documents were excluded if they contained unintended code (e.g., HTML, CSS, or JavaScript), optical character recognition errors, character encoding issues, highly repetitive n-gram sequences, or a very low word count, or if they were duplicates or near-duplicates of other documents in the IGC. In addition, recurring boilerplate text, such as lists of related articles and social media sharing links, has been removed where possible, along with author bylines and image captions. The remaining text has been normalized for whitespace, non-printable and control characters, and other similar issues. [Icelandic] Þetta er útgáfa af Íslensku risamálheildinni (RMH) frá 2024 á JSONL sniði, ætluð til þjálfunar á mállíkönum. Hún samanstendur af ómörkuðum skjölum úr RMH sem gefin eru út með risamálheildarleyfinu, IGC license. Gögnunum hefur verið skipt í þjálfunar- og þróunargögn. Málheildin hefur verið síuð og normalíseruð til að fjarlægja efni sem hentar illa til þjálfunar. Skjölum var sleppt ef þau innihéldu forritunarkóða (t.d. HTML, CSS eða JavaScript), ljóslestrarvillur, stafasettsvandamál, hátt hlutfall af endurteknum n-stæðum, eða ef þau voru mjög stutt. Endurteknar útgáfur af sama skjali voru einnig fjarlægðar. Þar að auki hefur fastatexti (e. boilerplate text), eins og listar yfir tengdar greinar og hlekkir til að deila efni á samfélagsmiðlum, auk höfundalína og myndatexta, verið fjarlægður þar sem kostur var á. Textinn var að lokum normalíseraður með tilliti til bilstafa, ósýnilegra stafa, stýristafa og annarra svipaðra atriða.
dc.language.iso	isl
dc.publisher	The Árni Magnússon Institute for Icelandic Studies
dc.rights	Icelandic Gigaword Corpus
dc.rights.uri	https://repository.clarin.is/repository/xmlui/page/license-gigaword-corpus
dc.rights.label	PUB
dc.source.uri	https://igc.arnastofnun.is
dc.subject	corpus
dc.subject	Icelandic
dc.subject	filtered
dc.subject	llm training
dc.title	IGC2024 Filtered-2
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	Clarin IS Repository
demo.uri	https://malheildir.arnastofnun.is
contact.person	Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
size.info	894874269 words
files.size	2661425859
files.count	2

Files in this item

Download all files in item (2.48 GB)

This item is

Publicly Available

and licensed under:
Icelandic Gigaword Corpus

Name: igc2024-2-filtered.zip
Size: 2.48 GB
Format: application/zip
Description: Unknown
MD5: 6064af8efd40e6f0c868ec4141b489f0

Download file Preview

File Preview

- igc2024-2-filtered-val.jsonl63 MB
- igc2024-2-filtered-train.jsonl6 GB

Name: README.txt
Size: 2.82 KB
Format: Text file
MD5: 483679b0394becabedcf026befafe135

Download file Preview

File Preview

IGC2024-Filtered-2

Authors: Jón Friðrik Daðason, Steinþór Steingrímsson and Hinrik Hafsteinsson
Item identifier: http://hdl.handle.net/20.500.12537/382
Published by: The Árni Magnússon Institute for Icelandic Studies, March 31, 2026

Description in English
---
This is a JSONL version of the 2024 release of the Icelandic Gigaword Corpus (IGC), prepared for language model training. The archive contains training and validation sets of unannotated documents from the IGC, licensed using the IGC license.
The corpus has been filtered, deduplicated, and normalized to remove content unsuitable for training. Documents were excluded if they contained unintended code (e.g., HTML, CSS, or JavaScript), optical character recognition errors, character encoding issues, highly repetitive n-gram sequences, or a very low word count, or if they were duplicates or near-duplicates of other documents in the IGC. In addition, recurring boilerplate text, such as lists of related articles and social media shari . . .

Show simple item record

Files in this item

Partners, Coordination, Funding

Repository

More