• Home
  • Repository
  • About CLARIN-IS
  • CLARIN
  •  Login
  • English íslenska
  • CLARIN-IS Repository Home
  • View Item
  •  
  •   What can you do?
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

Icelandic Gigaword Corpus (IGC-2022) - annotated version

 
Clarin IS Repository
  Authors
Barkarson, Starkaður ; et al.show everyone Barkarson, Starkaður ; Steingrímsson, Steinþór ; Andrésdóttir, Þórdís Dröfn ; Hafsteinsdóttir, Hildur ; Ingimundarson, Finnur Ágúst ; Magnússon, Árni Davíð
  Item identifier
http://hdl.handle.net/20.500.12537/254
 Project URL
https://igc.arnastofnun.is
 Demo URL
https://malheildir.arnastofnun.is
 Referenced by
http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.254.pdf
 Date issued
2022-10-01
 Type
corpus, text
 Size
2428573565 words, 156052431 sentences
 Language(s)
Icelandic
 Description
[ENGLISH] The IGC-project (Icelandic Gigaword corpus) aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. The project is divided into nine individual corpora that are listed here below. Each corpus comes in two versions. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. The corpora listed here below are the annotated versions. The unannotated versions can be found at http://hdl.handle.net/20.500.12537/253. [ICELANDIC] IGC-verkefnið (Íslenska risamálheildin - Icelandic Gigaword corpus) hefur að markmiði að safna eins miklum texta og mögulegt er sem gefa má út með opnu eða takmörkuðu leyfi. Verkefnið samanstendur af níu sjálfstæðum málheildum sem eru listaðar hér að neðan. Hver málheild er gefin út í tveimur útgáfum. Önnur inniheldur skjöl með hreinum texta, án þess að hann hafi verið tókaður. Hin inniheldur textann tókaðan, markaðan og lemmaðan. Málheildirnar hér að neðan innihalda markaðan texta. Nálgast má ómörkuðu málheildirnar á http://hdl.handle.net/20.500.12537/253. Adjud http://hdl.handle.net/20.500.12537/241 Books http://hdl.handle.net/20.500.12537/317 Journals http://hdl.handle.net/20.500.12537/246 Law http://hdl.handle.net/20.500.12537/248 News1 http://hdl.handle.net/20.500.12537/237 News2 http://hdl.handle.net/20.500.12537/239 Parla http://hdl.handle.net/20.500.12537/216 Social http://hdl.handle.net/20.500.12537/243 Wiki http://hdl.handle.net/20.500.12537/252
 Publisher
The Árni Magnússon Institute for Icelandic Studies
 Acknowledgement

Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið)

Project code: The Icelandic Gigaword Corpus (G1)

Project name: Language Technology for Icelandic 2019-2023

 Subject(s)
igc annotated pos-tagged lemmatized
 Collection(s)
Clarin IS
 
This item is replaced by a newer submission:
annotated: http://hdl.handle.net/20.500.12537/358
Show full item record
 
 

Partners, Coordination, Funding

  • Arni Magnusson Institute for Icelandic Studies
  • Ministry of Culture and Business Affairs

Repository

  • Main page
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN
  • META-Net

CLARIN-IS is fully supported by the Ministry of Culture and Business Affairs

Copyright (c) 2023. Arni Magnusson Institute for Icelandic Studies. All rights reserved.