dc.contributor.author |
Barkarson, Starkaður |
dc.contributor.author |
Steinþór, Steingrímsson |
dc.contributor.author |
Andrésdóttir, Þórdís Dröfn |
dc.contributor.author |
Hafsteinsdóttir, Hildur |
dc.contributor.author |
Ingimundarson, Finnur Ágúst |
dc.contributor.author |
Magnússon, Árni Davíð |
dc.date.accessioned |
2022-09-22T12:40:54Z |
dc.date.available |
2022-09-22T12:40:54Z |
dc.date.issued |
2022-10-01 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/253 |
dc.description |
[ENGLISH]
The IGC-project (Icelandic Gigaword corpus) aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. The project is divided into nine individual corpora that are listed here below. Each corpus comes in two versions. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. The corpora listed here below are the unannotated versions. The annotated versions can be found at http://hdl.handle.net/20.500.12537/254. The corpus has also been published in a JSONL format which is suitable for LLM training (http://hdl.handle.net/20.500.12537/334). |
dc.description |
[ICELANDIC]
IGC-verkefnið (Íslenska risamálheildin - Icelandic Gigaword corpus) hefur að markmiði að safna eins miklum texta og mögulegt er sem gefa má út með opnu eða takmörkuðu leyfi. Verkefnið samanstendur af níu sjálfstæðum málheildum sem eru listaðar hér að neðan. Hver málheild er gefin út í tveimur útgáfum. Önnur inniheldur skjöl með hreinum texta, án þess að hann hafi verið tókaður. Hin inniheldur textann tókaðan, markaðan og lemmaðan. Málheildirnar hér að neðan innihalda ómarkaðan texta. Nálgast má mörkuðu málheildirnar á http://hdl.handle.net/20.500.12537/254. Málheildin hefur einnig verið gefin út á JSONL-sniði sem er hengtugt fyrir þjálfun stórra mállíkana (http://hdl.handle.net/20.500.12537/334).
Adjud http://hdl.handle.net/20.500.12537/240
Books http://hdl.handle.net/20.500.12537/316
Journals http://hdl.handle.net/20.500.12537/245
Law http://hdl.handle.net/20.500.12537/247
News1 http://hdl.handle.net/20.500.12537/236
News2 http://hdl.handle.net/20.500.12537/238
Parla http://hdl.handle.net/20.500.12537/208
Social http://hdl.handle.net/20.500.12537/242
Wiki http://hdl.handle.net/20.500.12537/251 |
dc.language.iso |
isl |
dc.publisher |
The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby |
https://www.aclweb.org/anthology/L18-1690.pdf |
dc.source.uri |
https://igc.arnastofnun.is |
dc.subject |
igc |
dc.subject |
unannotated |
dc.subject |
corpus |
dc.title |
Icelandic Gigaword Corpus (IGC-2022) - unannotated version |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
hasMetadata |
false |
has.files |
no |
branding |
Clarin IS Repository |
demo.uri |
https://malheildir.arnastofnun.is |
contact.person |
Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
sponsor |
Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) The Icelandic Gigaword Corpus (G1) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info |
2428573565 words |
files.size |
0 |
files.count |
0 |