Show simple item record

 
dc.contributor.author Barkarson, Starkaður
dc.contributor.author Steingrímsson, Starkaður
dc.date.accessioned 2024-08-20T14:54:52Z
dc.date.available 2024-08-20T14:54:52Z
dc.date.issued 2024-08-31
dc.identifier.uri http://hdl.handle.net/20.500.12537/335
dc.description This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an restricted licence, in a jsonl format, which is suitable for LLM training.
dc.description ÍSLENSKA: Pakkinn inniheldur þær málheildir Íslensku risamálheildarinnar (útg. 22.10 - http://hdl.handle.net/20.500.12537/253) sem voru gefnar út með takmörkuðu leyfi, á jsonl-sniðmáti sem hentar m.a. við þjálfun mállíkana.
dc.language.iso isl
dc.publisher The Árni Magnússon Institute for Icelandic Studies
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf
dc.rights Icelandic Gigaword Corpus
dc.rights.uri https://repository.clarin.is/repository/xmlui/page/license-gigaword-corpus
dc.rights.label PUB
dc.source.uri http://igc.arnastofnun.is
dc.subject igc
dc.subject jsonl
dc.subject json
dc.subject unannotated
dc.title Icelandic Gigaword Corpus 2 (IGC-2022) - unannotated version - jsonl format
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hasMetadata false
has.files yes
branding Clarin IS Repository
demo.uri https://malheildir.arnastofnun.is
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
sponsor Ministry of Culture and Business (Menningar- og viðskiptaráðuneytið) Project code: G10 – Conversion of IGC format for LLMs Project name: Language Technology for Icelandic 2019-2023 nationalFunds
size.info 56195323 sentences
size.info 928673799 words
files.size 3123720223
files.count 2


 Files in this item

 Download all files in item (2.91 GB)
This item is
Publicly Available
and licensed under:
Icelandic Gigaword Corpus
Icon
Name
README.txt
Size
7.47 KB
Format
Text file
Description
README
MD5
5e2fd58fd910c640c779738262d5111b
 Download file  Preview
 File Preview  
*******************************************************************************
*************THE ICELANDIC GIGAWORD CORPUS 2 IN JSONL FORMAT ******************
************ http://hdl.handle.net/20.500.12537/335        ********************
*******************************************************************************

This package contains those subcorpora of the Icelandic Gigaword Corpus, version 
22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an 
restricted licence, in a jsonl format, which is suitable for LLM training.

-----------------------------------------------------------------------------
ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC):

Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253

The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 
billion words:

Open licence:
 IGC-Journals	20.9 million words
 IGC-Law	53.3      -
 IGC-News1	396.7     -
 IGC-Parla	254.1     -
 IGC-Social	724.0     -
 IGC-Wik . . .
                                            
Icon
Name
IGC2_jsonl.zip
Size
2.91 GB
Format
application/zip
Description
IGC2_jsonl
MD5
7579c71ee2833a202d2e91b1f3865a35
 Download file  Preview
 File Preview  
  • IGC2
    • README.txt7 kB
    • converted-corpora
      • IGC-News2
        • IGC-News2-baendabladid.jsonl86 MB
        • IGC-News2-dfs.jsonl15 MB
        • IGC-News2-bbl.jsonl56 MB
        • IGC-News2-stundin_serblad.jsonl23 kB
        • IGC-News2-fotbolti.jsonl644 MB
        • IGC-News2-433.jsonl69 MB
        • IGC-News2-morgunbladid.jsonl3 GB
        • IGC-News2-bb.jsonl39 MB
        • IGC-News2-skessuhorn.jsonl186 MB
        • IGC-News2-mbl.jsonl2 GB
        • IGC-News2-dv_is.jsonl819 MB
        • IGC-News2-fjardarpostur.jsonl3 MB
        • IGC-News2-bondi.jsonl3 MB
        • IGC-News2-pressan.jsonl13 MB
        • IGC-News2-bleikt.jsonl44 MB
        • IGC-News2-stundin.jsonl125 MB
        • IGC-News2-stundin_blad.jsonl67 MB
        • IGC-News2-kylfingur.jsonl26 MB
        • IGC-News2-frettatiminn.jsonl33 MB
        • IGC-News2-kjarninn_blad.jsonl5 MB
        • IGC-News2-kjarninn.jsonl114 MB
        • IGC-News2-frettatiminn_bl.jsonl86 MB
        • IGC-News2-eyjan.jsonl125 MB
        • IGC-News2-vf.jsonl177 MB
      • IGC-Books
        • IGC-Books.jsonl134 MB
    • userlicense_igc_restricted.pdf61 kB
    • example.py2 kB
    • datasets-info
      • IGC-Books.jsonl152 B
      • IGC-News2.jsonl3 kB

Show simple item record