Files in this item

 Download all files in item (3.84 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
README.txt
Size
7.42 KB
Format
Text file
Description
README
MD5
e8878156e13b34a5eefedebd6f70a45e
 Download file  Preview
 File Preview  
*******************************************************************************
************ THE ICELANDIC GIGAWORD CORPUS 1 IN JSONL FORMAT ******************
************ http://hdl.handle.net/20.500.12537/334        ********************
*******************************************************************************


This package contains those subcorpora of the Icelandic Gigaword Corpus, version 
22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an 
open licence (CC-BY), in a jsonl format, which is suitable for LLM training.

-----------------------------------------------------------------------------
ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC):

Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253

The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 
billion words:

Open licence:
 IGC-Journals	20.9 million words
 IGC-Law	53.3      -
 IGC-News1	396.7     -
 IGC-Parla	254.1     -
 IGC-Social	724.0     -
 IGC- . . .
                                            
Icon
Name
IGC1_jsonl.zip
Size
3.84 GB
Format
application/zip
Description
IGC1_jsonl
MD5
f7cfe07c77c829dd776819c9cddc8c79
 Download file  Preview
 File Preview  
  • IGC1
    • README.txt7 kB
    • converted-corpora
      • IGC-Wiki
        • IGC-Wiki.jsonl103 MB
      • IGC-News1
        • IGC-News1-sjonvarpid.jsonl201 MB
        • IGC-News1-ras2.jsonl5 kB
        • IGC-News1-ras1.jsonl1 MB
        • IGC-News1-ras1_og_2.jsonl289 MB
        • IGC-News1-eyjafrettir.jsonl58 MB
        • IGC-News1-eidfaxi.jsonl89 MB
        • IGC-News1-sunnlenska.jsonl49 MB
        • IGC-News1-vikudagur.jsonl46 MB
        • IGC-News1-fjardarfrettir.jsonl8 MB
        • IGC-News1-viljinn.jsonl6 MB
        • IGC-News1-trolli.jsonl19 MB
        • IGC-News1-fiskifrettir.jsonl32 MB
        • IGC-News1-visir.jsonl1 GB
        • IGC-News1-mannlif.jsonl56 MB
        • IGC-News1-huni.jsonl42 MB
        • IGC-News1-eyjar.jsonl48 MB
        • IGC-News1-stod2.jsonl158 MB
        • IGC-News1-kaffid.jsonl17 MB
        • IGC-News1-kopavogsbladid.jsonl4 MB
        • IGC-News1-frettabladid_is.jsonl322 MB
        • IGC-News1-bylgjan.jsonl113 MB
        • IGC-News1-vb.jsonl301 MB
        • IGC-News1-siglfirdingur.jsonl7 MB
        • IGC-News1-ruv.jsonl755 MB
      • IGC-Journals
        • IGC-Journals-lb.jsonl57 MB
        • IGC-Journals-th.jsonl5 MB
        • IGC-Journals-ith.jsonl5 MB
        • IGC-Journals-rg.jsonl6 MB
        • IGC-Journals-vv.jsonl59 MB
        • IGC-Journals-bli.jsonl1 MB
        • IGC-Journals-tv.jsonl4 MB
        • IGC-Journals-mf.jsonl346 kB
        • IGC-Journals-tu.jsonl3 MB
        • IGC-Journals-tlr.jsonl697 kB
        • IGC-Journals-im.jsonl3 MB
        • IGC-Journals-ne.jsonl3 MB
        • IGC-Journals-ski.jsonl5 MB
        • IGC-Journals-ss.jsonl4 MB
        • IGC-Journals-gr.jsonl2 MB
        • IGC-Journals-ri.jsonl13 MB
        • IGC-Journals-tlf.jsonl19 MB
        • IGC-Journals-ljo.jsonl750 kB
        • IGC-Journals-lf.jsonl1 MB
        • IGC-Journals-tf.jsonl1 MB
        • IGC-Journals-vt.jsonl1 MB
        • IGC-Journals-aif.jsonl5 MB
        • IGC-Journals-hr.jsonl10 MB
      • IGC-Parla
        • IGC-Parla.jsonl1 GB
      • IGC-Law
        • IGC-Law-Proposals.jsonl128 MB
        • IGC-Law-Bills.jsonl376 MB
        • IGC-Law-Law.jsonl30 MB
      • IGC-Social
        • IGC-Social-Blog-jonas.jsonl56 MB
        • IGC-Social-Blog-silfuregils.jsonl33 MB
        • IGC-Social-Forums-hugi.jsonl1 GB
        • IGC-Social-Forums-bland.jsonl4 GB
        • IGC-Social-Forums-malefnin.jsonl790 MB
        • IGC-Social-Blog-heimur.jsonl12 MB
    • example.py2 kB
    • datasets-info
      • IGC-Journals.jsonl3 kB
      • IGC-News1.jsonl3 kB
      • IGC-Wiki.jsonl143 B
      • IGC-Law.jsonl487 B
      • IGC-Parla.jsonl155 B
      • IGC-Social.jsonl1 kB