Sýna einfalda færslu atriðis

 
dc.contributor.advisor ÍSLENSKA: Pakkinn inniheldur þær málheildir Íslensku risamálheildarinnar sem voru gefnar út með opnu leyfi, á jsonl-sniðmáti sem hentar m.a. við þjálfun mállíkana. Þessi útgáfa inniheldur bæði IGC-2022 (http://hdl.handle.net/20.500.12537/253) og IGC-2024ext (http://hdl.handle.net/20.500.12537/359). Gagnasettið er einnig aðgengilegt á Huggingface: https://huggingface.co/datasets/arnastofnun/IGC-2024.
dc.contributor.author Barkarson, Starkaður
dc.contributor.author Steingrímsson, Steinþór
dc.date.accessioned 2025-05-28T08:47:33Z
dc.date.available 2025-05-28T08:47:33Z
dc.date.issued 2025-05-27
dc.identifier.uri http://hdl.handle.net/20.500.12537/364
dc.description English: This package contains those subcorpora of the Icelandic Gigaword Corpus, that have been published with an open licence, in a JSONL format, which is suitable for LLM training. This version includes both IGC-2022 (http://hdl.handle.net/20.500.12537/253) and IGC-2024ext (http://hdl.handle.net/20.500.12537/359). The dataset is also available at Huggingface: https://huggingface.co/datasets/arnastofnun/IGC-2024.
dc.language.iso isl
dc.publisher The Árni Magnússon Institute for Icelandic Studies
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf
dc.relation.replaces http://hdl.handle.net/20.500.12537/335
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri http://igc.arnastofnun.is
dc.subject IGC
dc.subject Icelandic Gigaword Corpus
dc.subject jsonl
dc.subject llms
dc.title Icelandic Gigaword Corpus 1 (IGC-2022 + IGC2024ext) - JSONL format
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden false
has.files yes
branding Clarin IS Repository
demo.uri https://malheildir.arnastofnun.is
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
size.info 91697694 sentences
size.info 1457129329 words
files.size 4520140286
files.count 2


 Files in this item

 Download all files in item (4.21 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
IGC-2022_2024_jsonl.zip
Size
4.21 GB
Format
application/zip
Description
IGC-2022_2024_jsonl
MD5
85b7cf92edcac79a2d08fabb16b1e086
 Download file  Preview
 File Preview  
  • converted-corpora
    • igc_wiki
      • igc_wiki.jsonl103 MB
    • igc_adjud
      • igc_adjud_district.jsonl530 MB
      • igc_adjud_appeal.jsonl61 MB
      • igc_adjud_supreme.jsonl108 MB
    • igc_social
      • igc_social_forums_bland.jsonl4 GB
      • igc_social_blog_silfuregils.jsonl33 MB
      • igc_social_blog_heimur.jsonl12 MB
      • igc_social_blog_jonas.jsonl56 MB
      • igc_social_forums_hugi.jsonl1 GB
      • igc_social_forums_malefnin.jsonl790 MB
    • igc_law
      • igc_law_proposals.jsonl136 MB
      • igc_law_law.jsonl30 MB
      • igc_law_bills.jsonl395 MB
    • igc_journals
      • igc_journals_hr.jsonl10 MB
      • igc_journals_aif.jsonl5 MB
      • igc_journals_ski.jsonl5 MB
      • igc_journals_vv.jsonl59 MB
      • igc_journals_tv.jsonl4 MB
      • igc_journals_tf.jsonl1 MB
      • igc_journals_mf.jsonl346 kB
      • igc_journals_tu.jsonl3 MB
      • igc_journals_ith.jsonl5 MB
      • igc_journals_ss.jsonl4 MB
      • igc_journals_tlr.jsonl697 kB
      • igc_journals_gr.jsonl2 MB
      • igc_journals_lb.jsonl57 MB
      • igc_journals_ri.jsonl13 MB
      • igc_journals_th.jsonl5 MB
      • igc_journals_rg.jsonl6 MB
      • igc_journals_tlf.jsonl19 MB
      • igc_journals_lf.jsonl1 MB
      • igc_journals_ljo.jsonl750 kB
      • igc_journals_im.jsonl3 MB
      • igc_journals_ne.jsonl3 MB
      • igc_journals_vt.jsonl1 MB
      • igc_journals_bli.jsonl1 MB
    • igc_news1
      • igc_news1_ras1_og_2.jsonl310 MB
      • igc_news1_trolli.jsonl21 MB
      • igc_news1_sjonvarpid.jsonl218 MB
      • igc_news1_kopavogsbladid.jsonl4 MB
      • igc_news1_kaffid.jsonl22 MB
      • igc_news1_ras2.jsonl5 kB
      • igc_news1_ras1.jsonl1 MB
      • igc_news1_fiskifrettir.jsonl38 MB
      • igc_news1_stod2.jsonl172 MB
      • igc_news1_frettabladid_is.jsonl367 MB
      • igc_news1_viljinn.jsonl8 MB
      • igc_news1_mannlif.jsonl88 MB
      • igc_news1_eyjafrettir.jsonl64 MB
      • igc_news1_huni.jsonl49 MB
      • igc_news1_vikudagur.jsonl54 MB
      • igc_news1_sunnlenska.jsonl57 MB
      • igc_news1_visir.jsonl2 GB
      • igc_news1_ruv.jsonl827 MB
      • igc_news1_vb.jsonl333 MB
      • igc_news1_fjardarfrettir.jsonl9 MB
      • igc_news1_eyjar.jsonl50 MB
      • igc_news1_siglfirdingur.jsonl7 MB
      • igc_news1_bylgjan.jsonl124 MB
      • igc_news1_eidfaxi.jsonl101 MB
    • igc_parla
      • igc_parla.jsonl1 GB
    • example.py2 kB
  • datasets-info
    • IGC-Adjud.jsonl692 B
    • IGC-Journals.jsonl3 kB
    • IGC-News1.jsonl4 kB
    • IGC-Law.jsonl510 B
    • IGC-Wiki.jsonl143 B
    • IGC-Parla.jsonl162 B
    • IGC-Social.jsonl1 kB
Icon
Name
README.md
Size
8.22 KB
Format
Unknown
Description
Readme file
MD5
cfd678307943ee63fc75e5eb603e255d
 Download file

Sýna einfalda færslu atriðis