dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Steingrímsson, Steinþór |
dc.date.accessioned | 2024-08-20T14:48:32Z |
dc.date.available | 2024-08-20T14:48:32Z |
dc.date.issued | 2024-08-20 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/334 |
dc.description | English: This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an open licence, in a jsonl format, which is suitable for LLM training. The dataset is also available at Huggingface: https://huggingface.co/datasets/arnastofnun/IGC-2022-1. ÍSLENSKA: Pakkinn inniheldur þær málheildir Íslensku risamálheildarinnar (útg. 22.10 - http://hdl.handle.net/20.500.12537/253) sem voru gefnar út með opnu leyfi, á jsonl-sniðmáti sem hentar m.a. við þjálfun mállíkana. Gagnasettið er einnig aðgengilegt á Huggingface: https://huggingface.co/datasets/arnastofnun/IGC-2022-1. |
dc.language.iso | isl |
dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby | http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://igc.arnastofnun.is |
dc.subject | IGC |
dc.subject | unannotated |
dc.subject | jsonl |
dc.subject | json |
dc.subject | IGC |
dc.title | Icelandic Gigaword Corpus 1 (IGC-2022) - unannotated version - jsonl format |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hasMetadata | false |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://malheildir.arnastofnun.is |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
sponsor | Ministry of Culture and Business (Menningar- og viðskiptaráðuneytið) G10 – Conversion of IGC format for LLMs Language Technology for Icelandic nationalFunds |
size.info | 84026131 sentences |
size.info | 1312369457 words |
files.size | 4124419031 |
files.count | 2 |
Files in this item
Download all files in item (3.84 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- README.txt
- Size
- 7.42 KB
- Format
- Text file
- Description
- README
- MD5
- e8878156e13b34a5eefedebd6f70a45e
******************************************************************************* ************ THE ICELANDIC GIGAWORD CORPUS 1 IN JSONL FORMAT ****************** ************ http://hdl.handle.net/20.500.12537/334 ******************** ******************************************************************************* This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an open licence (CC-BY), in a jsonl format, which is suitable for LLM training. ----------------------------------------------------------------------------- ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC): Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253 The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 billion words: Open licence: IGC-Journals 20.9 million words IGC-Law 53.3 - IGC-News1 396.7 - IGC-Parla 254.1 - IGC-Social 724.0 - IGC- . . .
- Name
- IGC1_jsonl.zip
- Size
- 3.84 GB
- Format
- application/zip
- Description
- IGC1_jsonl
- MD5
- f7cfe07c77c829dd776819c9cddc8c79
- IGC1
- README.txt7 kB
- converted-corpora
- IGC-Wiki
- IGC-Wiki.jsonl103 MB
- IGC-News1
- IGC-News1-sjonvarpid.jsonl201 MB
- IGC-News1-ras2.jsonl5 kB
- IGC-News1-ras1.jsonl1 MB
- IGC-News1-ras1_og_2.jsonl289 MB
- IGC-News1-eyjafrettir.jsonl58 MB
- IGC-News1-eidfaxi.jsonl89 MB
- IGC-News1-sunnlenska.jsonl49 MB
- IGC-News1-vikudagur.jsonl46 MB
- IGC-News1-fjardarfrettir.jsonl8 MB
- IGC-News1-viljinn.jsonl6 MB
- IGC-News1-trolli.jsonl19 MB
- IGC-News1-fiskifrettir.jsonl32 MB
- IGC-News1-visir.jsonl1 GB
- IGC-News1-mannlif.jsonl56 MB
- IGC-News1-huni.jsonl42 MB
- IGC-News1-eyjar.jsonl48 MB
- IGC-News1-stod2.jsonl158 MB
- IGC-News1-kaffid.jsonl17 MB
- IGC-News1-kopavogsbladid.jsonl4 MB
- IGC-News1-frettabladid_is.jsonl322 MB
- IGC-News1-bylgjan.jsonl113 MB
- IGC-News1-vb.jsonl301 MB
- IGC-News1-siglfirdingur.jsonl7 MB
- IGC-News1-ruv.jsonl755 MB
- IGC-Journals
- IGC-Journals-lb.jsonl57 MB
- IGC-Journals-th.jsonl5 MB
- IGC-Journals-ith.jsonl5 MB
- IGC-Journals-rg.jsonl6 MB
- IGC-Journals-vv.jsonl59 MB
- IGC-Journals-bli.jsonl1 MB
- IGC-Journals-tv.jsonl4 MB
- IGC-Journals-mf.jsonl346 kB
- IGC-Journals-tu.jsonl3 MB
- IGC-Journals-tlr.jsonl697 kB
- IGC-Journals-im.jsonl3 MB
- IGC-Journals-ne.jsonl3 MB
- IGC-Journals-ski.jsonl5 MB
- IGC-Journals-ss.jsonl4 MB
- IGC-Journals-gr.jsonl2 MB
- IGC-Journals-ri.jsonl13 MB
- IGC-Journals-tlf.jsonl19 MB
- IGC-Journals-ljo.jsonl750 kB
- IGC-Journals-lf.jsonl1 MB
- IGC-Journals-tf.jsonl1 MB
- IGC-Journals-vt.jsonl1 MB
- IGC-Journals-aif.jsonl5 MB
- IGC-Journals-hr.jsonl10 MB
- IGC-Parla
- IGC-Parla.jsonl1 GB
- IGC-Law
- IGC-Law-Proposals.jsonl128 MB
- IGC-Law-Bills.jsonl376 MB
- IGC-Law-Law.jsonl30 MB
- IGC-Social
- IGC-Social-Blog-jonas.jsonl56 MB
- IGC-Social-Blog-silfuregils.jsonl33 MB
- IGC-Social-Forums-hugi.jsonl1 GB
- IGC-Social-Forums-bland.jsonl4 GB
- IGC-Social-Forums-malefnin.jsonl790 MB
- IGC-Social-Blog-heimur.jsonl12 MB
- IGC-Wiki
- example.py2 kB
- datasets-info
- IGC-Journals.jsonl3 kB
- IGC-News1.jsonl3 kB
- IGC-Wiki.jsonl143 B
- IGC-Law.jsonl487 B
- IGC-Parla.jsonl155 B
- IGC-Social.jsonl1 kB