• Home
  • Repository
  • About CLARIN-IS
  • CLARIN
  •  Login
  • English íslenska
  • CLARIN-IS Repository Home
  • View Item
  •  
  •   What can you do?
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

The Icelandic Gigaword Corpus (IGC) 2021

 
Clarin IS Repository
  Authors
Barkarson, Starkaður ; et al.show everyone Barkarson, Starkaður ; Steingrímsson, Steinþór ; Hafsteinsdóttir, Hildur ; Andrésdóttir, Þórdís Dröfn ; Eiríksdóttir, Inga Guðrún ; Magnússon, Bolli ; Ingimundarson, Finnur
  Item identifier
http://hdl.handle.net/20.500.12537/192
 Project URL
http://igc.arnastofnun.is
 Demo URL
https://malheildir.arnastofnun.is
 Referenced by
https://www.aclweb.org/anthology/L18-1690.pdf
 Date issued
2021-12-31
 Type
corpus, text
 Size
1871000000 words
 Language(s)
Icelandic
 Description
[ENGLISH] The IGC-project (Icelandic Gigaword corpus) aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. The project is divided into eight individual corpora that are listed here below. Each corpus comes in two formats. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. [ICELANDIC] IGC-verkefnið (Íslenska risamálheildin - Icelandic Gigaword corpus) hefur að markmiði að safna eins miklum texta og mögulegt er sem gefa má út með opnu eða takmörkuðu leyfi. Verkefnið samanstendur af átta sjálfstæðum málheildum sem eru listaðar hér að neðan. Hver málheild er gefin út í tveimur hlutum. Annar hlutinn inniheldur skjöl með hreinum texta, án þess að hann hafi verið tókaður. Hinn hlutinn inniheldur textann tókaðan, markaðan og lemmaðan. IGC-Adjud: http://hdl.handle.net/20.500.12537/101 IGC-Books: http://hdl.handle.net/20.500.12537/126 IGC-Journals: http://hdl.handle.net/20.500.12537/166 IGC-Laws: http://hdl.handle.net/20.500.12537/116 IGC-News1: http://hdl.handle.net/20.500.12537/141 IGC-News2: http://hdl.handle.net/20.500.12537/142 IGC-Parla: http://hdl.handle.net/20.500.12537/179 IGC-Social: http://hdl.handle.net/20.500.12537/138
 Publisher
The Árni Magnússon Institue for Icelandic Studies
 Acknowledgement

Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið)

Project code: The Icelandic Gigaword Corpus (G1)

Project name: Language Technology for Icelandic 2019-2023

 Subject(s)
igc gigaword corpus lemmatized pos-tagged pos
 Collection(s)
Clarin IS
 
This item is replaced by a newer submission:
http://hdl.handle.net/20.500.12537/254
Show full item record
 
 

Partners, Coordination, Funding

  • Arni Magnusson Institute for Icelandic Studies
  • Ministry of Culture and Business Affairs

Repository

  • Main page
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN
  • META-Net

CLARIN-IS is fully supported by the Ministry of Culture and Business Affairs

Copyright (c) 2023. Arni Magnusson Institute for Icelandic Studies. All rights reserved.