Show simple item record

 
dc.contributor.author Barkarson, Starkaður
dc.contributor.author Steingrímsson, Steinþór
dc.date.accessioned 2022-07-20T12:46:58Z
dc.date.available 2022-07-20T12:46:58Z
dc.date.issued 2022-10-01
dc.identifier.uri http://hdl.handle.net/20.500.12537/243
dc.description ENGLISH: IGC-Social is a part of the IGC-project (Icelandic Gigaword corpus) that aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. IGC-Social contains texts from three blog sites, three forums and Twitter. The corpus comes in two formats. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. IGC-Social contains plain text while IGC-Social.ana is a linguistically marked-up version. This corpus contains the tokenized and annotated version of IGC-Social. The unannotated version can be found here: http://hdl.handle.net/20.500.12537/242. The subcorpus IGC-Social3 contains tweets from Twitter. It has been dehydrated, that is all texts have been removed but information about tweets' IDs and pos-tags are still present. For information about how to rehydrate the corpus please refer to README.txt in the folder Twitter/scripts_and_data/. The texts from the forums in the subcorpus IGC-Social1 have been split into sentences and shuffled to comply with copyright laws.
dc.description ÍSLENSKA: IGC-Social er hluti af IGC-verkefninu (https://igc.arnastofnun.is) sem miðar að því að safna eins miklu og mögulegt er af íslenskum texta sem hægt er að gefa út með opnu eða takmörkuðu leyfi. IGC-Social inniheldur texta af þremur bloggsíðum, þremur spjallrþráðum (bland.is, hugi.is, malenfnin.com) og af Twitter. Málheildin er gefin út í tveimur útgáfum. IGC-Social inniheldur ótókaðan og ómarkaðan texta á meðan IGC-Social.ana er bæði tókuð og mörkuð. Þessi málheild inniheldur markaða útgáfu af IGC-Social. Ómarkaða útgáfu má finna hér: http://hdl.handle.net/20.500.12537/242. Undirmálheildin IGC-Social3 inniheldur tvít af Twitter. Allur texti hefur verið tekinn á brott en upplýsingar um ID tvíta og mörk eru til staðar. Upplýsingar um það hvernig hægt er að sækja textana og setja inn í skjölin er að finna í README-skrá. Textar teknir af spjallþráðum var skipt upp í setningar sem var svo stokkað upp til að hlíta höfundarréttarlögum.
dc.language.iso isl
dc.publisher The Árni Magnússon Institute for Icelandic Studies
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf
dc.relation.replaces http://hdl.handle.net/20.500.12537/138
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri http://igc.arnastofnun.is
dc.subject corpora
dc.subject social media
dc.subject twitter
dc.subject blog
dc.subject forums
dc.subject pos
dc.subject pos-tagged
dc.subject lemmas
dc.subject lemmatieze
dc.subject annotated
dc.title IGC-Social 22.10 (annotated version)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hasMetadata false
has.files yes
branding Clarin IS Repository
demo.uri https://malheildir.arnastofnun.is
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
sponsor Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) Language Technology for Icelandic 2019-2023 The Icelandic Gigaword Corpus (G1) nationalFunds
size.info 724999194 words
size.info 806949613 tokens
size.info 59226104 sentences
files.size 9582890821
files.count 3


 Files in this item

 Download all files in item (8.92 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
IGC-Social-22.10.ana.zip
Size
4.04 GB
Format
application/zip
Description
IGC-Social-22.10.ana zip file 1/2
MD5
9284b5621823d3a1b94834b6c1e2bfbc
 Download file
Icon
Name
IGC-Social-22.10.ana.z01
Size
4.88 GB
Format
Unknown
Description
IGC-Social-22.10.ana zip file 2/2
MD5
61b628c9ce0c0fbd6b72f64051dd77a9
 Download file
Icon
Name
readme
Size
718 bytes
Format
Unknown
Description
Information about how to unzip multiple files
MD5
1f9d4a87d3efc5f13cd66b77bf8173ba
 Download file

Show simple item record