dc.contributor.advisor | ÍSLENSKA: Pakkinn inniheldur þær málheildir Íslensku risamálheildarinnar sem voru gefnar út með opnu leyfi, á jsonl-sniðmáti sem hentar m.a. við þjálfun mállíkana. Þessi útgáfa inniheldur bæði IGC-2022 (http://hdl.handle.net/20.500.12537/253) og IGC-2024ext (http://hdl.handle.net/20.500.12537/359). Gagnasettið er einnig aðgengilegt á Huggingface: https://huggingface.co/datasets/arnastofnun/IGC-2024. |
dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Steingrímsson, Steinþór |
dc.date.accessioned | 2025-05-28T08:47:33Z |
dc.date.available | 2025-05-28T08:47:33Z |
dc.date.issued | 2025-05-27 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/364 |
dc.description | English: This package contains those subcorpora of the Icelandic Gigaword Corpus, that have been published with an open licence, in a JSONL format, which is suitable for LLM training. This version includes both IGC-2022 (http://hdl.handle.net/20.500.12537/253) and IGC-2024ext (http://hdl.handle.net/20.500.12537/359). The dataset is also available at Huggingface: https://huggingface.co/datasets/arnastofnun/IGC-2024. |
dc.language.iso | isl |
dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby | http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf |
dc.relation.replaces | http://hdl.handle.net/20.500.12537/335 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://igc.arnastofnun.is |
dc.subject | IGC |
dc.subject | Icelandic Gigaword Corpus |
dc.subject | jsonl |
dc.subject | llms |
dc.title | Icelandic Gigaword Corpus 1 (IGC-2022 + IGC2024ext) - JSONL format |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | false |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://malheildir.arnastofnun.is |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
size.info | 91697694 sentences |
size.info | 1457129329 words |
files.size | 4520140286 |
files.count | 2 |
Files in this item
Download all files in item (4.21 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)

- Name
- IGC-2022_2024_jsonl.zip
- Size
- 4.21 GB
- Format
- application/zip
- Description
- IGC-2022_2024_jsonl
- MD5
- 85b7cf92edcac79a2d08fabb16b1e086
- converted-corpora
- igc_wiki
- igc_wiki.jsonl103 MB
- igc_adjud
- igc_adjud_district.jsonl530 MB
- igc_adjud_appeal.jsonl61 MB
- igc_adjud_supreme.jsonl108 MB
- igc_social
- igc_social_forums_bland.jsonl4 GB
- igc_social_blog_silfuregils.jsonl33 MB
- igc_social_blog_heimur.jsonl12 MB
- igc_social_blog_jonas.jsonl56 MB
- igc_social_forums_hugi.jsonl1 GB
- igc_social_forums_malefnin.jsonl790 MB
- igc_law
- igc_law_proposals.jsonl136 MB
- igc_law_law.jsonl30 MB
- igc_law_bills.jsonl395 MB
- igc_journals
- igc_journals_hr.jsonl10 MB
- igc_journals_aif.jsonl5 MB
- igc_journals_ski.jsonl5 MB
- igc_journals_vv.jsonl59 MB
- igc_journals_tv.jsonl4 MB
- igc_journals_tf.jsonl1 MB
- igc_journals_mf.jsonl346 kB
- igc_journals_tu.jsonl3 MB
- igc_journals_ith.jsonl5 MB
- igc_journals_ss.jsonl4 MB
- igc_journals_tlr.jsonl697 kB
- igc_journals_gr.jsonl2 MB
- igc_journals_lb.jsonl57 MB
- igc_journals_ri.jsonl13 MB
- igc_journals_th.jsonl5 MB
- igc_journals_rg.jsonl6 MB
- igc_journals_tlf.jsonl19 MB
- igc_journals_lf.jsonl1 MB
- igc_journals_ljo.jsonl750 kB
- igc_journals_im.jsonl3 MB
- igc_journals_ne.jsonl3 MB
- igc_journals_vt.jsonl1 MB
- igc_journals_bli.jsonl1 MB
- igc_news1
- igc_news1_ras1_og_2.jsonl310 MB
- igc_news1_trolli.jsonl21 MB
- igc_news1_sjonvarpid.jsonl218 MB
- igc_news1_kopavogsbladid.jsonl4 MB
- igc_news1_kaffid.jsonl22 MB
- igc_news1_ras2.jsonl5 kB
- igc_news1_ras1.jsonl1 MB
- igc_news1_fiskifrettir.jsonl38 MB
- igc_news1_stod2.jsonl172 MB
- igc_news1_frettabladid_is.jsonl367 MB
- igc_news1_viljinn.jsonl8 MB
- igc_news1_mannlif.jsonl88 MB
- igc_news1_eyjafrettir.jsonl64 MB
- igc_news1_huni.jsonl49 MB
- igc_news1_vikudagur.jsonl54 MB
- igc_news1_sunnlenska.jsonl57 MB
- igc_news1_visir.jsonl2 GB
- igc_news1_ruv.jsonl827 MB
- igc_news1_vb.jsonl333 MB
- igc_news1_fjardarfrettir.jsonl9 MB
- igc_news1_eyjar.jsonl50 MB
- igc_news1_siglfirdingur.jsonl7 MB
- igc_news1_bylgjan.jsonl124 MB
- igc_news1_eidfaxi.jsonl101 MB
- igc_parla
- igc_parla.jsonl1 GB
- igc_wiki
- example.py2 kB
- datasets-info
- IGC-Adjud.jsonl692 B
- IGC-Journals.jsonl3 kB
- IGC-News1.jsonl4 kB
- IGC-Law.jsonl510 B
- IGC-Wiki.jsonl143 B
- IGC-Parla.jsonl162 B
- IGC-Social.jsonl1 kB

- Name
- README.md
- Size
- 8.22 KB
- Format
- Unknown
- Description
- Readme file
- MD5
- cfd678307943ee63fc75e5eb603e255d