dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Steingrímsson, Steinþór |
dc.date.accessioned | 2022-07-20T12:46:58Z |
dc.date.available | 2022-07-20T12:46:58Z |
dc.date.issued | 2022-10-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/243 |
dc.description | ENGLISH: IGC-Social is a part of the IGC-project (Icelandic Gigaword corpus) that aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. IGC-Social contains texts from three blog sites, three forums and Twitter. The corpus comes in two formats. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. IGC-Social contains plain text while IGC-Social.ana is a linguistically marked-up version. This corpus contains the tokenized and annotated version of IGC-Social. The unannotated version can be found here: http://hdl.handle.net/20.500.12537/242. The subcorpus IGC-Social3 contains tweets from Twitter. It has been dehydrated, that is all texts have been removed but information about tweets' IDs and pos-tags are still present. For information about how to rehydrate the corpus please refer to README.txt in the folder Twitter/scripts_and_data/. The texts from the forums in the subcorpus IGC-Social1 have been split into sentences and shuffled to comply with copyright laws. |
dc.description | ÍSLENSKA: IGC-Social er hluti af IGC-verkefninu (https://igc.arnastofnun.is) sem miðar að því að safna eins miklu og mögulegt er af íslenskum texta sem hægt er að gefa út með opnu eða takmörkuðu leyfi. IGC-Social inniheldur texta af þremur bloggsíðum, þremur spjallrþráðum (bland.is, hugi.is, malenfnin.com) og af Twitter. Málheildin er gefin út í tveimur útgáfum. IGC-Social inniheldur ótókaðan og ómarkaðan texta á meðan IGC-Social.ana er bæði tókuð og mörkuð. Þessi málheild inniheldur markaða útgáfu af IGC-Social. Ómarkaða útgáfu má finna hér: http://hdl.handle.net/20.500.12537/242. Undirmálheildin IGC-Social3 inniheldur tvít af Twitter. Allur texti hefur verið tekinn á brott en upplýsingar um ID tvíta og mörk eru til staðar. Upplýsingar um það hvernig hægt er að sækja textana og setja inn í skjölin er að finna í README-skrá. Textar teknir af spjallþráðum var skipt upp í setningar sem var svo stokkað upp til að hlíta höfundarréttarlögum. |
dc.language.iso | isl |
dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby | http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf |
dc.relation.replaces | http://hdl.handle.net/20.500.12537/138 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://igc.arnastofnun.is |
dc.subject | corpora |
dc.subject | social media |
dc.subject | |
dc.subject | blog |
dc.subject | forums |
dc.subject | pos |
dc.subject | pos-tagged |
dc.subject | lemmas |
dc.subject | lemmatieze |
dc.subject | annotated |
dc.title | IGC-Social 22.10 (annotated version) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hasMetadata | false |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://malheildir.arnastofnun.is |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
sponsor | Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) Language Technology for Icelandic 2019-2023 The Icelandic Gigaword Corpus (G1) nationalFunds |
size.info | 724999194 words |
size.info | 806949613 tokens |
size.info | 59226104 sentences |
files.size | 9582890821 |
files.count | 3 |
Files in this item
Download all files in item (8.92 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- IGC-Social-22.10.ana.zip
- Size
- 4.04 GB
- Format
- application/zip
- Description
- IGC-Social-22.10.ana zip file 1/2
- MD5
- 9284b5621823d3a1b94834b6c1e2bfbc
- Name
- IGC-Social-22.10.ana.z01
- Size
- 4.88 GB
- Format
- Unknown
- Description
- IGC-Social-22.10.ana zip file 2/2
- MD5
- 61b628c9ce0c0fbd6b72f64051dd77a9
- Name
- readme
- Size
- 718 bytes
- Format
- Unknown
- Description
- Information about how to unzip multiple files
- MD5
- 1f9d4a87d3efc5f13cd66b77bf8173ba