******************************************************************************* *************THE ICELANDIC GIGAWORD CORPUS 2 IN JSONL FORMAT ****************** ************ http://hdl.handle.net/20.500.12537/335 ******************** ******************************************************************************* This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an restricted licence, in a jsonl format, which is suitable for LLM training. ----------------------------------------------------------------------------- ABOUT THE ICELANDIC GIGAWORD CORPUS (IGC): Version 22.10 can be downloaded here: http://hdl.handle.net/20.500.12537/253 The Icelandic Gigaword Corpus (IGC) contains 8 corpora, in total almost 2,4 billion words: Open licence: IGC-Journals 20.9 million words IGC-Law 53.3 - IGC-News1 396.7 - IGC-Parla 254.1 - IGC-Social 724.0 - IGC-Wiki 8.5 - Restricted licence: IGC-Books 13.8 - IGC-News2 899.8 - Each corpora might contain two or more subcorpora. As an example, IGC-Books does not contain any subcorpora, meaning that all the texts are in one teiCorpus-tag which contain several TEI-tags, one for each book. Each tei-tag is stored in an individual XML-file. IGC-News2, on the other hand, contains several subcorpora, one for each media. In some cases, as for the IGC-Social, a subcorpus might contain yet another subcorpora. Since we do not have the permission to distribute the data from Twitter (part of IGC-Social) users that download IGC have to fetch the original data themselves and then use special scripts to insert the text into the TEI-files. Due to these complications, we do not include Twitter in this package. For further information please refer to https://igc.arnastofnun.is. ----------------------------------------------------------------------------- LICENSE: The two corpora contained in this package are published with a restricted lisence, so called "ICG-Corpus License". Refer to the file 'userlicense_igc_restricted.pdf' for further information. ----------------------------------------------------------------------------- THE JSONL FORMAT OF IGC: Each subcorpora has been converted to one jsonl-file with the "Icelandic Gigaword Corpus JSONL Converter" (http://hdl.handle.net/20.500.12537/332). The files are located in the folder 'concerted-corpora'. Each XML file of the subcorpus (each book or news article) becomes a single line in the JSONL file. The information and format of a single line is the following: { "document": "all text of the file, with paragraph splits shown as '\n\n'", "uuid": "a randomly generated ID for the json object", "metadata": { "author": "the original file's author, if available", "fetch_timestamp": "the date of the conversion", "xml_id": "the ID of the original XML file", "publish_timestamp": "the publishing date of the text in the original XML file", "title": {"offset": None, "length": None}, # the offset and length of the text's title "paragraphs": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...], # the offset and length of each paragraph "sentences": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...], # the offset and length of each sentence "source": "the source of the original text, taken from the XML file" } } Further information about each subcorpus is found in the folder datasets-info. The information and format of the file in `datasets-info` is the following: { "subdirectory of the subcorpus, e.g. IGC-Adjud-Appeal": { "path": "path to the converted corpus", "quality": "quality categorization, taken from `Flokkun.tsv`, which was created by the Árni Magnússon Institute for Icelandic Studies", "domain": ["a list of all relevant domains, taken from `Flokkun.tsv`"], "lang": "the language of the corpus, which is 'is' for all current cases", "version": "the IGC version, which is 22.10 by default" } } Further information about the domains and how the quality of the texts was assessed is found here below. -------------------------------------------------------------------------------- USAGE: The file example.py shows how information and data can be retrieved. After writing the correct path to the downlaoded folder you can run it with python3 example.py -------------------------------------------------------------------------------- CATEGORIES - DOMAIN: We classified the 86 subcorpora into 13 domains or genres: Adjudications Judgements from the three levels of jurisdiction in Iceland Subcorpus: IGC-Adjud Blog Three online blogs Subcorpus: IGC-Social2 News Texts from general news media (online and printed) News - local Texts from local news media (online and printed) Selected subcorpora from IGC-News1 and IGC-News2 News - radio Transcripts of news on the radio Selected subcorpora from IGC-News1 News - specialized Texts from media (online and printed) dedicated to specific issues (business, agriculture …) Selected subcorpora from IGC-News1 and IGC-News2 News - sport Texts from four websites that deliver sports news Selected subcorpora from IGC-News1 and IGC-News2 News - TV Transcripts of news on TV Selected subcorpora from IGC-News1 Online forum Three online forums Subcorpus: IGC-Social1 Parliamentary data Subcorpora: IGC-Parla, IGC-Law The Icelandic Law corpus, explanatory reports and observations extracted from bills submitted to Althingi, and parliamentary proposals and resolutions. Published books Subcorpus: IGC-Books Scientific journals Mainly printed journals but also a few online journals Subcorpus: IGC-Journals Wikipedia Subcorpus: IGC-Wiki ------------------------------------------------------------------------------------ CATEGORIES - QUALITY We selected random sentences from each subcorpora (max 50.000 tokens for the bigger corpora), that were then corrected using the Byte-Level Neural Error Correction Model for Icelandic (http://hdl.handle.net/20.500.12537/255). Each sentence was also analysed with Greynir (http://hdl.handle.net/20.500.12537/269) and sentences that the tool classified as a foreign sentence were marked specially. Finally, the ratio of sentences containing errors or marked as foreign to the total amount of sentences was calculated. We divided the texts into three groups, A - C, where A has the fewest errors/foreign sentences and C the most. As expected, texts from public data, scientific journals and news from the bigger news media (generally proofread by professionals) mostly ranked high, and texts from the online forums ranked lowest, but some texts that we had expected to rank high did not. This is due to the fact that many of the errors have nothing to do with the quality of the original text but how it was processed. Texts from Morgunblaðið, which you would expect to rank high, often had the headlines glued to the main text, which caused errors. The texts from many of the scientific journals were read with OCR which can also lead to errors. Finally, the parliamentary speeches, usually of good quality since they have been proofread, go back to the beginning of the 20th century when spelling rules were different from now.