| dc.contributor.author | Daðason, Jón Friðrik |
| dc.contributor.author | Steingrímsson, Steinþór |
| dc.contributor.author | Hafsteinsson, Hinrik |
| dc.date.accessioned | 2026-04-07T14:18:12Z |
| dc.date.available | 2026-04-07T14:18:12Z |
| dc.date.issued | 2026-03-30 |
| dc.identifier.uri | http://hdl.handle.net/20.500.12537/381 |
| dc.description | [English] This is a JSONL version of the 2024 release of the Icelandic Gigaword Corpus (IGC), prepared for language model training. The archive contains training and validation sets of unannotated, CC-BY-licensed documents from the IGC. The corpus has been filtered, deduplicated, and normalized to remove content unsuitable for training. Documents were excluded if they contained unintended code (e.g., HTML, CSS, or JavaScript), optical character recognition errors, character encoding issues, highly repetitive n-gram sequences, or a very low word count, or if they were duplicates or near-duplicates of other documents in the IGC. In addition, recurring boilerplate text, such as lists of related articles and social media sharing links, has been removed where possible, along with author bylines and image captions. The remaining text has been normalized for whitespace, non-printable and control characters, and other similar issues. [Icelandic] Þetta er útgáfa af Íslensku risamálheildinni (RMH) frá 2024 á JSONL sniði, ætluð til þjálfunar á mállíkönum. Hún samanstendur af ómörkuðum skjölum úr RMH með CC-BY leyfi sem hefur verið skipt í þjálfunar- og þróunargögn. Málheildin hefur verið síuð og normalíseruð til að fjarlægja efni sem hentar illa til þjálfunar. Skjölum var sleppt ef þau innihéldu forritunarkóða (t.d. HTML, CSS eða JavaScript), ljóslestrarvillur, stafasettsvandamál, hátt hlutfall af endurteknum n-stæðum, eða ef þau voru mjög stutt. Endurteknar útgáfur af sama skjali voru einnig fjarlægðar. Þar að auki hefur fastatexti (e. boilerplate text), eins og listar yfir tengdar greinar og hlekkir til að deila efni á samfélagsmiðlum, auk höfundalína og myndatexta, verið fjarlægður þar sem kostur var á. Textinn var að lokum normalíseraður með tilliti til bilstafa, ósýnilegra stafa, stýristafa og annarra svipaðra atriða. |
| dc.language.iso | isl |
| dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://igc.arnastofnun.is |
| dc.subject | corpus |
| dc.subject | Icelandic |
| dc.subject | filtered |
| dc.subject | llm training |
| dc.title | IGC2024 Filtered-1 |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | Clarin IS Repository |
| demo.uri | https://malheildir.arnastofnun.is |
| contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
| size.info | 794816184 words |
| files.size | 2154884291 |
| files.count | 2 |
Files in this item
Download all files in item (2.01 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- igc2024-1-filtered.zip
- Size
- 2.01 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- ca61ea2bfdd5c3555d7fd7f039610255
- Name
- README.txt
- Size
- 2.76 KB
- Format
- Text file
- Description
- Unknown
- MD5
- 6a9c4e2692a5d7df5963b00dd5b5d513
IGC2024-Filtered-1
Authors: Jón Friðrik Daðason, Steinþór Steingrímsson and Hinrik Hafsteinsson
Item identifier: http://hdl.handle.net/20.500.12537/381
Published by: The Árni Magnússon Institute for Icelandic Studies, March 31, 2026
Description in English
---
This is a JSONL version of the 2024 release of the Icelandic Gigaword Corpus (IGC), prepared for language model training. The archive contains training and validation sets of unannotated, CC-BY-licensed documents from the IGC.
The corpus has been filtered, deduplicated, and normalized to remove content unsuitable for training. Documents were excluded if they contained unintended code (e.g., HTML, CSS, or JavaScript), optical character recognition errors, character encoding issues, highly repetitive n-gram sequences, or a very low word count, or if they were duplicates or near-duplicates of other documents in the IGC. In addition, recurring boilerplate text, such as lists of related articles and social media sharing links, has be . . .