###################################################################################
### Texts from the Icelandic Web of Science and the European Web * JSONL-FORMAT ###
### http://hdl.handle.net/20.500.12537/363                                      ### 
###################################################################################

The dataset contains questions and answers from the Icelandic Web of Science 
(www.visindavefur.is) and the European Web (www.evropuvefur.is), run by the 
University of Iceland. The corpus does not contain all the texts from the 
websites but only those authorized by the authors. The corpus can be found
in CLARIN-IS's repository, both the unannotated version 
(http://hdl.handle.net/20.500.12537/361) and the annotated version 
(http://hdl.handle.net/20.500.12537/362).

In the original corpus, the texts were divided into four parts: 'question', 
'long question' (if available), 'answer' and 'rest' (references, information 
about photos, footnotes etc.) In this dataset only the shorter version of the 
question and the answer are included. NOTE that in same cases it was not possible
to remove all extra texts (such as footontes) from the answer.

-----------------------------------------------------------------------------
## LICENSE:

The dataset contained in this package is published with a restricted licence
(https://repository.clarin.is/licenses/userlicense_igc_restricted_download_en.pdf).

-----------------------------------------------------------------------------
## THE JSONL FORMAT:

Each line in the JSONL file contains one article (a question and an answer).
The information and the format of a single line is the following:
```
  {
      "document":
        {"
          "question": "The question",
          "answer": "The answer"
        },

      "uuid": "a randomly generated ID for the json object", 
      "metadata": 
      {
          "author": "the original file's author, if available", 
          "fetch_timestamp": "the date of the conversion", 
          "xml_id": "the ID of the original XML file", 
          "publish_timestamp": "the publishing date of the text in the original XML file",
          "question":
            {
               "paragraphs": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...],    
               # the offset and length of each paragraph in document['question']
               "sentences": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...],     
               # the offset and length of each sentence in document['question']
            },
          "answer":
            {
               "paragraphs": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...],    
               # the offset and length of each paragraph in document['answer']
               "sentences": [{"offset": None, "length": None}, {"offset": None, "length": None}, ...],     
               # the offset and length of each sentence in document['answer']
            },
          "source": "the source of the original text, taken from the XML file"
      }
  }
```

--------------------------------------------------------------------------------
## USAGE:

You can simply read the file containing the dataset, onle line at a time
and load it to a json-object. Here below is a complete Python-code that 
prints out each paragraph for the question and answer in each article:

```
import json

path2file = "/path/to/VV_EN.jsonl"

#read lines from the file
with open(path2file, "r") as f:
    articles = f.readlines()

#iterate through each article
for item in articles:
    #load as json
    article = json.loads(item)
    
    print("## QUESTION ##")
    #iterate through items containing information about offset and length of each paragraph in 'question'
    for paragraph in article['metadata']['question']['paragraphs']:    
        #print the substring of article['document']['question']        
        print(article['document']['question'][paragraph['offset']:paragraph['offset']+paragraph['length']])

    print("## ANSWER ##")
    #iterate through items containing information about offset and length of each paragraph in 'answer'
    for paragraph in article['metadata']['answer']['paragraphs']:    
        
        #print the substring of article['document']['answer']
        print(article['document']['answer'][paragraph['offset']:paragraph['offset']+paragraph['length']])

        
