Show simple item record

 
dc.contributor.author Barkarson, Starkaður
dc.contributor.author Steingrímsson, Starkaður
dc.date.accessioned 2025-01-09T09:48:02Z
dc.date.available 2025-01-09T09:48:02Z
dc.date.issued 2024-12-31
dc.identifier.uri http://hdl.handle.net/20.500.12537/358
dc.description [ENGLISH] This version, IGC-2024ext, is an extension to IGC-2022 [http://hdl.handle.net/20.500.12537/254] and in most cases only contains texts from 2022 and 2023 (see README for more details). The IGC-project (Icelandic Gigaword corpus) aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. The project is divided into nine individual corpora, but this version only contains new data for five of them. Each corpus comes in two versions. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. The corpora listed here below are the annotated versions. The unannotated corpora can be found at http://hdl.handle.net/20.500.12537/359.
dc.description ÍSLENSKA: Þessi útgáfa, IGC-2024ext, inniheldur viðbót við útgáfu IGC-2022 [http://hdl.handle.net/20.500.12537/254] og í flestum tilvikum innihalda málheildirnar texta frá 2022 og 2023 (sjá README fyrir nánari upplýsingar). IGC-verkefnið (Íslenska risamálheildin - Icelandic Gigaword corpus) [https://igc.arnastofnun.is] hefur að markmiði að safna eins miklum texta og mögulegt er sem gefa má út með opnu eða takmörkuðu leyfi. Verkefnið samanstendur af níu sjálfstæðum málheildum en þessi útgáfa inniheldur aðeins gögn fyrir fimm þeirra. Hver málheild er gefin út í tveimur útgáfum. Önnur inniheldur skjöl með hreinum texta, án þess að hann hafi verið tókaður. Hin inniheldur textann tókaðan, markaðan og lemmaðan. Málheildirnar hér að neðan innihalda markaðan texta. Ómörkuðu útgáfuna má nágast á http://hdl.handle.net/20.500.12537/359. Adjud http://hdl.handle.net/20.500.12537/356 Law http://hdl.handle.net/20.500.12537/357 News1 http://hdl.handle.net/20.500.12537/339 News2 http://hdl.handle.net/20.500.12537/346 Parla http://hdl.handle.net/20.500.12537/355
dc.language.iso isl
dc.publisher The Árni Magnússon Institute for Icelandic Studies
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf
dc.relation.replaces http://hdl.handle.net/20.500.12537/254
dc.source.uri http://igc.arnastofnun.is
dc.subject igc
dc.subject 2024ext
dc.subject annotated
dc.subject pos-tagged
dc.subject lemmatized
dc.subject gigaword
dc.subject icelandic
dc.subject 2024
dc.title Icelandic Gigaword Corpus (IGC-2024ext) - annotated version
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden false
has.files yes
branding Clarin IS Repository
demo.uri https://malheildir.arnastofnun.is
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
size.info 9688829 sentences
size.info 162171483 words
size.info 179141662 tokens
files.size 2552
files.count 1


 Files in this item

Icon
Name
README
Size
2.49 KB
Format
Unknown
Description
Readme file
MD5
3a1adb54729655c439af168f0c28edb4
 Download file

Show simple item record