*******************************************************************************
*************THE ICELANDIC GIGAWORD CORPUS 2 IN JSONL FORMAT ******************
************ http://hdl.handle.net/20.500.12537/335        ********************
*******************************************************************************

This package contains those subcorpora of the Icelandic Gigaword Corpus, version 
22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an 
restricted licence, in a jsonl format, which is suitable for LLM training.

-----------------------------------------------------------------------------
ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC):

Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253

The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 
billion words:

Open licence:
 IGC-Journals	20.9 million words
 IGC-Law	53.3      -
 IGC-News1	396.7     -
 IGC-Parla	254.1     -
 IGC-Social	724.0     -
 IGC-Wiki	8.5       -

Restricted licence:
 IGC-Books	13.8      -
 IGC-News2	899.8     -

Each corpora might contain two or more subcorpora. As an example, IGC-Books does 
not contain any subcorpora, meaning that all the texts are in one teiCorpus-tag 
which contain several TEI-tags, one for each book. Each tei-tag is stored in an 
individual XML-file. IGC-News2, on the other hand, contains several subcorpora, 
one for each media. In some cases, as for the IGC-Social, a subcorpus might 
contain yet another subcorpora. 

Since we do not have the permission to distribute the data from Twitter (part of 
IGC-Social) users that download IGC have to fetch the original data themselves and 
then use special scripts to insert the text into the TEI-files. Due to these 
complications, we do not include Twitter in this package.

For further information please refer to https://igc.arnastofnun.is.

-----------------------------------------------------------------------------
LICENSE:

The two corpora contained in this package are published with a restricted lisence,
so called "ICG-Corpus License". Refer to the file 'userlicense_igc_restricted.pdf'
for further information.

-----------------------------------------------------------------------------
THE JSONL FORMAT OF IGC:

Each subcorpora has been converted to one jsonl-file with the "Icelandic Gigaword 
Corpus JSONL Converter" (http://hdl.handle.net/20.500.12537/332). The files are 
located in the folder 'concerted-corpora'. Each XML file of the subcorpus (each book 
or news article) becomes a single line in the JSONL file. The information and format 
of a single line is the following:

{
    "document": "all text of the file, with paragraph splits shown as '\n\n'", 
    "uuid": "a randomly generated ID for the json object", 
    "metadata": 
    {
        "author": "the original file's author, if available", 
        "fetch_timestamp": "the date of the conversion", 
        "xml_id": "the ID of the original XML file", 
        "publish_timestamp": "the publishing date of the text in the original XML file", 
        "title": {"offset": None, "length": None},                                                  
             # the offset and length of the text's title
        "paragraphs": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...],    
             # the offset and length of each paragraph
        "sentences": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...],     
             # the offset and length of each sentence 
        "source": "the source of the original text, taken from the XML file"
    }
}

Further information about each subcorpus is found in the folder datasets-info. The 
information and format of the file in `datasets-info` is the following:

{
    "subdirectory of the subcorpus, e.g. IGC-Adjud-Appeal": 
    {
        "path": "path to the converted corpus", 
        "quality": "quality categorization, taken from `Flokkun.tsv`, which was 
                    created by the Árni Magnússon Institute for Icelandic Studies", 
        "domain": ["a list of all relevant domains, taken from `Flokkun.tsv`"], 
        "lang": "the language of the corpus, which is 'is' for all current cases", 
        "version": "the IGC version, which is 22.10 by default"
    }
}

Further information about the domains and how the quality of the texts was assessed is 
found here below.

--------------------------------------------------------------------------------
USAGE:

The file example.py shows how information and data can be retrieved. After writing 
the correct path to the downlaoded folder you can run it with
  python3 example.py
  
--------------------------------------------------------------------------------

CATEGORIES - DOMAIN:

We classified the 86 subcorpora into 13 domains or genres:

Adjudications 
   Judgements from the three levels of jurisdiction in Iceland
   Subcorpus: IGC-Adjud
Blog 
   Three online blogs
   Subcorpus: IGC-Social2
News 
   Texts from general news media (online and printed) 
News - local 
   Texts from local news media (online and printed)
   Selected subcorpora from IGC-News1 and IGC-News2
News - radio
   Transcripts of news on the radio
   Selected subcorpora from IGC-News1
News - specialized 
   Texts from media (online and printed) dedicated to specific issues (business, 
     agriculture …)
   Selected subcorpora from IGC-News1 and IGC-News2
News - sport
   Texts from four websites that deliver sports news
   Selected subcorpora from IGC-News1 and IGC-News2
News - TV 
   Transcripts of news on TV
   Selected subcorpora from IGC-News1
Online forum 
   Three online forums
   Subcorpus: IGC-Social1
Parliamentary data 
   Subcorpora: IGC-Parla, IGC-Law 
   The Icelandic Law corpus, explanatory reports and observations extracted from 
     bills submitted to Althingi, and parliamentary proposals and resolutions.
Published books
   Subcorpus: IGC-Books
Scientific journals 
   Mainly printed journals but also a few online journals
   Subcorpus: IGC-Journals
Wikipedia
   Subcorpus: IGC-Wiki

------------------------------------------------------------------------------------
CATEGORIES - QUALITY

We selected random sentences from each subcorpora (max 50.000 tokens for the bigger 
corpora), that were then corrected using the Byte-Level Neural Error Correction Model 
for Icelandic (http://hdl.handle.net/20.500.12537/255). Each sentence was also analysed 
with Greynir (http://hdl.handle.net/20.500.12537/269) and sentences that the tool 
classified as a foreign sentence were marked specially. Finally, the ratio of sentences 
containing errors or marked as foreign to the total amount of sentences was calculated. 
We divided the texts into three groups, A - C, where A has the fewest errors/foreign 
sentences and C the most.

As expected, texts from public data, scientific journals and news from the bigger news 
media (generally proofread by professionals) mostly ranked high, and texts from the 
online forums ranked lowest, but some texts that we had expected to rank high did not. 
This is due to the fact that many of the errors have nothing to do with the quality of 
the original text but how it was processed. Texts from Morgunblaðið, which you would expect 
to rank high, often had the headlines glued to the main text, which caused errors. The 
texts from many of the scientific journals were read with OCR which can also lead to errors. 
Finally, the parliamentary speeches, usually of good quality since they have been proofread, 
go back to the beginning of the 20th century when spelling rules were different from now.