dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Steingrímsson, Steinþór |
dc.date.accessioned | 2022-07-20T11:24:54Z |
dc.date.available | 2022-07-20T11:24:54Z |
dc.date.issued | 2022-10-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/237 |
dc.description | ENGLISH: IGC-News1 and IGC-News2 are a part of the IGC-Project (https://igc.arnastofnun.is) that aims to collect as many as possible of Icelandic texts that can be published under an open or restricted licence. IGC-News1 has an open licence while IGC-News2 has a restricted licence. The two news-corpora contain texts from news media, online and written as well as some from tv and radio. Each text collection is published in two versions, as two independent corpora. One is unannotated and each paragraph is contained in a separate tag (p) while the second one is tokenized, lemmatized and morphosyntactically tagged. This corpus contains the tokenized and annotated version of IGC-News1, where each paragraph is contained inside of a <p> tag. The unannotated version can be found here: http://hdl.handle.net/20.500.12537/236. |
dc.description | ÍSLENSKA: IGC-News1 og IGC-News2 eru hluti af IGC-verkefninu (https://igc.arnastofnun.is) sem miðar að því að safna eins miklu og mögulegt er af íslenskum texta sem hægt er að birta með opnu eða takmörkuðu leyfi. IGC-News1 er með opið leyfi á meðan IGC-News2 er með takmarkað leyfi. Þessar tvær fréttamálheildir innihalda texta frá fréttamiðlum, á netinu og ritaða, auk sumra úr sjónvarpi og útvarpi. Hvert textasafn er gefið út í tveimur útgáfum, sem tveir sjálfstæðir hlutar. Önnur er ómörkuð og hver málsgrein er í sérstöku tagi (p) en sú síðari er tókuð, lemmuð og með málfræðilegum mörkum. Þessi málheild inniheldur tókaða og markaða útgáfu af IGC-News1. Ómarkaða útgáfu má finna hér: http://hdl.handle.net/20.500.12537/236. |
dc.language.iso | isl |
dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby | http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf |
dc.relation.replaces | http://hdl.handle.net/20.500.12537/141 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://igc.arnastofnun.is |
dc.subject | corpora |
dc.subject | news |
dc.subject | pos |
dc.subject | lemmas |
dc.subject | lemmatized |
dc.subject | pos-tagged |
dc.subject | annotated |
dc.title | IGC-News1 22.10 (annotated version) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hasMetadata | false |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://malheildir.arnastofnun.is |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
sponsor | Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) The Icelandic Gigaword Corpus (G1) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 1787715 articles |
size.info | 23435693 sentences |
size.info | 396651451 words |
size.info | 436672313 tokens |
files.size | 8038944784 |
files.count | 3 |
Files in this item
Download all files in item (7.49 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- IGC-News1-22.10.ana.zip
- Size
- 2.49 GB
- Format
- application/zip
- Description
- IGC-News1-22.10.ana
- MD5
- 322643e5b8ca0a72be4dd095e4bee387
- Name
- IGC-News1-22.10.ana.z01
- Size
- 5 GB
- Format
- Unknown
- Description
- IGC-News1-22.10.ana (part 2 of zip-file)
- MD5
- 6b0e740463e996c75e622c53060d3dde
- Name
- readme
- Size
- 715 bytes
- Format
- Unknown
- Description
- instructions on how to unzip
- MD5
- d5457a14c854ddb8078d0bd890e99aeb