Files in this item
Download all files in item (3.84 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- README.txt
- Size
- 7.42 KB
- Format
- Text file
- Description
- README
- MD5
- e8878156e13b34a5eefedebd6f70a45e
******************************************************************************* ************ THE ICELANDIC GIGAWORD CORPUS 1 IN JSONL FORMAT ****************** ************ http://hdl.handle.net/20.500.12537/334 ******************** ******************************************************************************* This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an open licence (CC-BY), in a jsonl format, which is suitable for LLM training. ----------------------------------------------------------------------------- ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC): Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253 The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 billion words: Open licence: IGC-Journals 20.9 million words IGC-Law 53.3 - IGC-News1 396.7 - IGC-Parla 254.1 - IGC-Social 724.0 - IGC- . . .
- Name
- IGC1_jsonl.zip
- Size
- 3.84 GB
- Format
- application/zip
- Description
- IGC1_jsonl
- MD5
- f7cfe07c77c829dd776819c9cddc8c79
- IGC1
- README.txt7 kB
- converted-corpora
- IGC-Wiki
- IGC-Wiki.jsonl103 MB
- IGC-News1
- IGC-News1-sjonvarpid.jsonl201 MB
- IGC-News1-ras2.jsonl5 kB
- IGC-News1-ras1.jsonl1 MB
- IGC-News1-ras1_og_2.jsonl289 MB
- IGC-News1-eyjafrettir.jsonl58 MB
- IGC-News1-eidfaxi.jsonl89 MB
- IGC-News1-sunnlenska.jsonl49 MB
- IGC-News1-vikudagur.jsonl46 MB
- IGC-News1-fjardarfrettir.jsonl8 MB
- IGC-News1-viljinn.jsonl6 MB
- IGC-News1-trolli.jsonl19 MB
- IGC-News1-fiskifrettir.jsonl32 MB
- IGC-News1-visir.jsonl1 GB
- IGC-News1-mannlif.jsonl56 MB
- IGC-News1-huni.jsonl42 MB
- IGC-News1-eyjar.jsonl48 MB
- IGC-News1-stod2.jsonl158 MB
- IGC-News1-kaffid.jsonl17 MB
- IGC-News1-kopavogsbladid.jsonl4 MB
- IGC-News1-frettabladid_is.jsonl322 MB
- IGC-News1-bylgjan.jsonl113 MB
- IGC-News1-vb.jsonl301 MB
- IGC-News1-siglfirdingur.jsonl7 MB
- IGC-News1-ruv.jsonl755 MB
- IGC-Journals
- IGC-Journals-lb.jsonl57 MB
- IGC-Journals-th.jsonl5 MB
- IGC-Journals-ith.jsonl5 MB
- IGC-Journals-rg.jsonl6 MB
- IGC-Journals-vv.jsonl59 MB
- IGC-Journals-bli.jsonl1 MB
- IGC-Journals-tv.jsonl4 MB
- IGC-Journals-mf.jsonl346 kB
- IGC-Journals-tu.jsonl3 MB
- IGC-Journals-tlr.jsonl697 kB
- IGC-Journals-im.jsonl3 MB
- IGC-Journals-ne.jsonl3 MB
- IGC-Journals-ski.jsonl5 MB
- IGC-Journals-ss.jsonl4 MB
- IGC-Journals-gr.jsonl2 MB
- IGC-Journals-ri.jsonl13 MB
- IGC-Journals-tlf.jsonl19 MB
- IGC-Journals-ljo.jsonl750 kB
- IGC-Journals-lf.jsonl1 MB
- IGC-Journals-tf.jsonl1 MB
- IGC-Journals-vt.jsonl1 MB
- IGC-Journals-aif.jsonl5 MB
- IGC-Journals-hr.jsonl10 MB
- IGC-Parla
- IGC-Parla.jsonl1 GB
- IGC-Law
- IGC-Law-Proposals.jsonl128 MB
- IGC-Law-Bills.jsonl376 MB
- IGC-Law-Law.jsonl30 MB
- IGC-Social
- IGC-Social-Blog-jonas.jsonl56 MB
- IGC-Social-Blog-silfuregils.jsonl33 MB
- IGC-Social-Forums-hugi.jsonl1 GB
- IGC-Social-Forums-bland.jsonl4 GB
- IGC-Social-Forums-malefnin.jsonl790 MB
- IGC-Social-Blog-heimur.jsonl12 MB
- IGC-Wiki
- example.py2 kB
- datasets-info
- IGC-Journals.jsonl3 kB
- IGC-News1.jsonl3 kB
- IGC-Wiki.jsonl143 B
- IGC-Law.jsonl487 B
- IGC-Parla.jsonl155 B
- IGC-Social.jsonl1 kB