dc.contributor.author |
Barkarson, Starkaður |
dc.contributor.author |
Steingrímsson, Steinþór |
dc.date.accessioned |
2022-07-20T11:25:11Z |
dc.date.available |
2022-07-20T11:25:11Z |
dc.date.issued |
2022-10-01 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/239 |
dc.description |
ENGLISH:
IGC-News1 and IGC-News2 are a part of the IGC-Project (https://igc.arnastofnun.is) that aims to collect as many as possible of Icelandic texts that can be published under an open or restricted licence. IGC-News1 has an open licence while IGC-News2 has a restricted licence. The two news-corpora contain texts from news media, online and written as well as some from tv and radio. Each text collection is published in two versions, as two independent corpora. One is unannotated and each paragraph is contained in a separate tag (p) while the second one is tokenized, lemmatized and morphosyntactically tagged. This corpus contains the tokenized and annotated version of IGC-News2, where each paragraph is contained inside of a <p> tag. The unannotated version can be found here: http://hdl.handle.net/20.500.12537/238. |
dc.description |
ÍSLENSKA:
IGC-News1 og IGC-News2 eru hluti af IGC-verkefninu (https://igc.arnastofnun.is) sem miðar að því að safna eins miklu og mögulegt er af íslenskum texta sem hægt er að birta með opnu eða takmörkuðu leyfi. IGC-News1 er með opið leyfi á meðan IGC-News2 er með takmarkað leyfi. Þessar tvær fréttamálheildir innihalda texta frá fréttamiðlum, á netinu og ritaða, auk sumra úr sjónvarpi og útvarpi. Hvert textasafn er gefið út í tveimur útgáfum, sem tveir sjálfstæðir hlutar. Önnur er ómörkuð og hver málsgrein er í sérstöku tagi (p) en sú síðari er tókuð, lemmuð og með málfræðilegum mörkum. Þessi málheild inniheldur tókaða og markaða útgáfu af IGC-News2. Ómarkaða útgáfu má finna hér: http://hdl.handle.net/20.500.12537/238. |
dc.language.iso |
isl |
dc.publisher |
The Árni Magnússon Institue for Icelandic Studies |
dc.relation.isreferencedby |
http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf |
dc.rights |
Icelandic Gigaword Corpus |
dc.rights.uri |
https://repository.clarin.is/repository/xmlui/page/license-gigaword-corpus |
dc.rights.label |
PUB |
dc.source.uri |
http://igc.arnastofnun.is |
dc.subject |
corpora |
dc.subject |
news |
dc.subject |
annotated |
dc.subject |
pos |
dc.subject |
pos-tagged |
dc.subject |
lemmas |
dc.subject |
lemmatized |
dc.subject |
TEI |
dc.title |
IGC-News2-22.10 (annotated version) |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
hasMetadata |
false |
has.files |
yes |
branding |
Clarin IS Repository |
demo.uri |
https://malheildir.arnastofnun.is |
contact.person |
Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institue for Icelandic Studies |
sponsor |
Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) Language Technology for Icelandic 2019-2023 The Icelandic Gigaword Corpus (G1) nationalFunds |
size.info |
3225606 articles |
size.info |
51915950 sentences |
size.info |
899836406 words |
size.info |
1001582774 tokens |
files.size |
16924104569 |
files.count |
5 |