Files in this item

 Download all files in item (2.01 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
igc2024-1-filtered.zip
Size
2.01 GB
Format
application/zip
Description
Unknown
MD5
ca61ea2bfdd5c3555d7fd7f039610255
 Download file  Preview
 File Preview  
    • igc2024-1-filtered-train.jsonl5 GB
    • igc2024-1-filtered-val.jsonl54 MB
Icon
Name
README.txt
Size
2.76 KB
Format
Text file
Description
Unknown
MD5
6a9c4e2692a5d7df5963b00dd5b5d513
 Download file  Preview
 File Preview  
IGC2024-Filtered-1

Authors: Jón Friðrik Daðason, Steinþór Steingrímsson and Hinrik Hafsteinsson
Item identifier: http://hdl.handle.net/20.500.12537/381
Published by: The Árni Magnússon Institute for Icelandic Studies, March 31, 2026

Description in English
---
This is a JSONL version of the 2024 release of the Icelandic Gigaword Corpus (IGC), prepared for language model training. The archive contains training and validation sets of unannotated, CC-BY-licensed documents from the IGC.
The corpus has been filtered, deduplicated, and normalized to remove content unsuitable for training. Documents were excluded if they contained unintended code (e.g., HTML, CSS, or JavaScript), optical character recognition errors, character encoding issues, highly repetitive n-gram sequences, or a very low word count, or if they were duplicates or near-duplicates of other documents in the IGC. In addition, recurring boilerplate text, such as lists of related articles and social media sharing links, has be . . .