Show simple item record

 
dc.contributor.author Arnardóttir, Þórunn
dc.contributor.author Einarsson, Elías Bjartur
dc.date.accessioned 2024-09-03T14:49:18Z
dc.date.available 2024-09-03T14:49:18Z
dc.date.issued 2024-10-01
dc.identifier.uri http://hdl.handle.net/20.500.12537/337
dc.description A question answering dataset intended to measure a large language model's knowledge of Icelandic culture and history and its ability to answer questions correctly. The dataset is split into two parts, a gold corpus and a silver corpus. The gold corpus consists of 2,000 pairs of manually reviewed questions and answers while the silver corpus consists of 10,644 pairs of questions and answers which have not been manually reviewed. All pairs were originally automatically created by GPT-4-turbo based on Icelandic Wikipedia articles and online news from RÚV, which are included in the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/236). In the gold corpus, 1,900 pairs are from Wikipedia articles while 100 pairs are from news texts. In the silver corpus, 9,610 pairs are from Wikipedia articles and 1,034 pairs are from news texts. The dataset is published as JSONL files in a format compatible with many question answering datasets. Additionally, the gold corpus is published in formats compatible with BIG-bench and OpenAI-evals. For more information on the formats, see the README. Spurningarsvörunarmálheild sem er ætluð til að mæla þekkingu risamállíkans á íslenskri menningu og sögu og getu þess til þess að svara spurningum rétt. Málheildinni er skipt í tvennt, í gullgögn og silfurgögn. Gullgögnin innihalda 2.000 pör af handyfirförnum spurningum og svörum en silfurgögnin innihalda 10.644 pör af óyfirförnum spurningum og svörum. Öll pörin voru upphaflega búin til á sjálfvirkan hátt með GPT-4-turbo og eru búin til upp úr íslenskum Wikipedia-greinum og fréttum af RÚV, sem eru hluti af Risamálheildinni (http://hdl.handle.net/20.500.12537/236). Í gullgögnunum eru 1.900 pör úr Wikipedia-greinum og 100 pör úr fréttum. Í silfurgögnunum eru 9.610 pör úr Wikipedia-greinum og 1.034 pör úr fréttum. Málheildin er gefin út í JSONL-skjölum á sniði sem samrýmist mörgum spurningarsvörunarmálheildum. Auk þess eru gullgögnin gefin út á sniðum sem samrýmast BIG-bench og OpenAI-evals. Sjá README-skjal fyrir frekari upplýsingar um sniðin.
dc.language.iso isl
dc.publisher Miðeind ehf.
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject question answering
dc.subject qa
dc.subject large language models
dc.subject llm
dc.subject culture
dc.subject history
dc.title Icelandic Culture and History QA Dataset
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Þórunn Arnardóttir thorunn@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Benchmarking datasets for LLMs (G12) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 12644 entries
files.size 12645503
files.count 2


 Files in this item

 Download all files in item (12.06 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
README
Size
2.13 KB
Format
Unknown
Description
The dataset's README
MD5
f1c9617c36d83e3e7c673f09605b6bd2
 Download file
Icon
Name
QA-dataset.zip
Size
12.06 MB
Format
application/zip
Description
A zip file containing the dataset
MD5
46dcd3f391cb7a37e8b6e5ff423f1949
 Download file  Preview
 File Preview  

Show simple item record