• Home
  • Repository
  • About CLARIN-IS
  • CLARIN
  •  Login
  • English íslenska
  • CLARIN-IS Repository Home
  • View Item
  •  
  •   What can you do?
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

Icelandic Gigaword Corpus (IGC-2022) - unannotated version

 
Clarin IS Repository
  Authors
Barkarson, Starkaður ; et al.show everyone Barkarson, Starkaður ; Steinþór, Steingrímsson ; Andrésdóttir, Þórdís Dröfn ; Hafsteinsdóttir, Hildur ; Ingimundarson, Finnur Ágúst ; Magnússon, Árni Davíð
  Item identifier
http://hdl.handle.net/20.500.12537/253
 Project URL
https://igc.arnastofnun.is
 Demo URL
https://malheildir.arnastofnun.is
 Referenced by
https://www.aclweb.org/anthology/L18-1690.pdf
 Date issued
2022-10-01
 Type
corpus, text
 Size
2428573565 words
 Language(s)
Icelandic
 Description
 
[ENGLISH] NOTE: An extension to IGC-2022 is now available, containing data from 2022 and 2023: http://hdl.handle.net/20.500.12537/359. The IGC-project (Icelandic Gigaword corpus) aims to collect as much as possible of Icelandic texts that can be published, under an open or restricted licence. The project is divided into nine individual corpora that are listed here below. Each corpus comes in two versions. One contains the texts untokenized and untagged where each paragraph is contained inside of a <p> tag, while the other one has been tokenized, POS-tagged and lemmatized. The corpora listed here below are the unannotated versions. The annotated versions can be found at http://hdl.handle.net/20.500.12537/254. The corpus has also been published in a JSONL format which is suitable for LLM training (http://hdl.handle.net/20.500.12537/334).
 
[ICELANDIC] ATH: Viðbót við IGC-2022 er nú tiltæk en hún inniheldur texta frá 2022 og 2023: http://hdl.handle.net/20.500.12537/359. IGC-verkefnið (Íslenska risamálheildin - Icelandic Gigaword corpus) hefur að markmiði að safna eins miklum texta og mögulegt er sem gefa má út með opnu eða takmörkuðu leyfi. Verkefnið samanstendur af níu sjálfstæðum málheildum sem eru listaðar hér að neðan. Hver málheild er gefin út í tveimur útgáfum. Önnur inniheldur skjöl með hreinum texta, án þess að hann hafi verið tókaður. Hin inniheldur textann tókaðan, markaðan og lemmaðan. Málheildirnar hér að neðan innihalda ómarkaðan texta. Nálgast má mörkuðu málheildirnar á http://hdl.handle.net/20.500.12537/254. Málheildin hefur einnig verið gefin út á JSONL-sniði sem er hengtugt fyrir þjálfun stórra mállíkana (http://hdl.handle.net/20.500.12537/334). Adjud http://hdl.handle.net/20.500.12537/240 Books http://hdl.handle.net/20.500.12537/316 Journals http://hdl.handle.net/20.500.12537/245 Law http://hdl.handle.net/20.500.12537/247 News1 http://hdl.handle.net/20.500.12537/236 News2 http://hdl.handle.net/20.500.12537/238 Parla http://hdl.handle.net/20.500.12537/208 Social http://hdl.handle.net/20.500.12537/242 Wiki http://hdl.handle.net/20.500.12537/251
 
 Publisher
The Árni Magnússon Institute for Icelandic Studies
 Acknowledgement

Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið)

Project code: The Icelandic Gigaword Corpus (G1)

Project name: Language Technology for Icelandic 2019-2023

 Subject(s)
igc unannotated corpus
 Collection(s)
Clarin IS
 
This item is replaced by a newer submission:
http://hdl.handle.net/20.500.12537/359
Show full item record
 
 

Partners, Coordination, Funding

  • Arni Magnusson Institute for Icelandic Studies
  • Ministry of Culture and Business Affairs

Repository

  • Main page
  • Submission Lifecycle
  • FAQ
  • About and Policies

More

  • CLARIN
  • META-Net

CLARIN-IS is fully supported by the Ministry of Culture and Business Affairs

Copyright (c) 2023. Arni Magnusson Institute for Icelandic Studies. All rights reserved.