dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Steingrímsson, Starkaður |
dc.date.accessioned | 2024-08-20T14:54:52Z |
dc.date.available | 2024-08-20T14:54:52Z |
dc.date.issued | 2024-08-31 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/335 |
dc.description | This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an restricted licence, in a jsonl format, which is suitable for LLM training. |
dc.description | ÍSLENSKA: Pakkinn inniheldur þær málheildir Íslensku risamálheildarinnar (útg. 22.10 - http://hdl.handle.net/20.500.12537/253) sem voru gefnar út með takmörkuðu leyfi, á jsonl-sniðmáti sem hentar m.a. við þjálfun mállíkana. |
dc.language.iso | isl |
dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby | http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf |
dc.rights | Icelandic Gigaword Corpus |
dc.rights.uri | https://repository.clarin.is/repository/xmlui/page/license-gigaword-corpus |
dc.rights.label | PUB |
dc.source.uri | http://igc.arnastofnun.is |
dc.subject | igc |
dc.subject | jsonl |
dc.subject | json |
dc.subject | unannotated |
dc.title | Icelandic Gigaword Corpus 2 (IGC-2022) - unannotated version - jsonl format |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hasMetadata | false |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://malheildir.arnastofnun.is |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
sponsor | Ministry of Culture and Business (Menningar- og viðskiptaráðuneytið) Project code: G10 – Conversion of IGC format for LLMs Project name: Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 56195323 sentences |
size.info | 928673799 words |
files.size | 3123720223 |
files.count | 2 |
Files in this item
Download all files in item (2.91 GB)- Name
- README.txt
- Size
- 7.47 KB
- Format
- Text file
- Description
- README
- MD5
- 5e2fd58fd910c640c779738262d5111b
******************************************************************************* *************THE ICELANDIC GIGAWORD CORPUS 2 IN JSONL FORMAT ****************** ************ http://hdl.handle.net/20.500.12537/335 ******************** ******************************************************************************* This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an restricted licence, in a jsonl format, which is suitable for LLM training. ----------------------------------------------------------------------------- ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC): Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253 The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 billion words: Open licence: IGC-Journals 20.9 million words IGC-Law 53.3 - IGC-News1 396.7 - IGC-Parla 254.1 - IGC-Social 724.0 - IGC-Wik . . .
- Name
- IGC2_jsonl.zip
- Size
- 2.91 GB
- Format
- application/zip
- Description
- IGC2_jsonl
- MD5
- 7579c71ee2833a202d2e91b1f3865a35
- IGC2
- README.txt7 kB
- converted-corpora
- IGC-News2
- IGC-News2-baendabladid.jsonl86 MB
- IGC-News2-dfs.jsonl15 MB
- IGC-News2-bbl.jsonl56 MB
- IGC-News2-stundin_serblad.jsonl23 kB
- IGC-News2-fotbolti.jsonl644 MB
- IGC-News2-433.jsonl69 MB
- IGC-News2-morgunbladid.jsonl3 GB
- IGC-News2-bb.jsonl39 MB
- IGC-News2-skessuhorn.jsonl186 MB
- IGC-News2-mbl.jsonl2 GB
- IGC-News2-dv_is.jsonl819 MB
- IGC-News2-fjardarpostur.jsonl3 MB
- IGC-News2-bondi.jsonl3 MB
- IGC-News2-pressan.jsonl13 MB
- IGC-News2-bleikt.jsonl44 MB
- IGC-News2-stundin.jsonl125 MB
- IGC-News2-stundin_blad.jsonl67 MB
- IGC-News2-kylfingur.jsonl26 MB
- IGC-News2-frettatiminn.jsonl33 MB
- IGC-News2-kjarninn_blad.jsonl5 MB
- IGC-News2-kjarninn.jsonl114 MB
- IGC-News2-frettatiminn_bl.jsonl86 MB
- IGC-News2-eyjan.jsonl125 MB
- IGC-News2-vf.jsonl177 MB
- IGC-Books
- IGC-Books.jsonl134 MB
- IGC-News2
- userlicense_igc_restricted.pdf61 kB
- example.py2 kB
- datasets-info
- IGC-Books.jsonl152 B
- IGC-News2.jsonl3 kB