dc.contributor.author | Arnardóttir, Þórunn |
dc.contributor.author | Einarsson, Elías Bjartur |
dc.date.accessioned | 2024-10-30T14:31:41Z |
dc.date.available | 2024-10-30T14:31:41Z |
dc.date.issued | 2024-10-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/347 |
dc.description | A question answering dataset intended to measure a large language model's knowledge of Icelandic culture and history and its ability to answer questions correctly. The dataset is split into two parts, a gold corpus and a silver corpus. The gold corpus consists of 2,000 pairs of manually reviewed questions and answers while the silver corpus consists of 10,644 pairs of questions and answers which have not been manually reviewed. All pairs were originally automatically created by GPT-4-turbo based on Icelandic Wikipedia articles and online news from RÚV, which are included in the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/236). In the gold corpus, 1,900 pairs are from Wikipedia articles while 100 pairs are from news texts. In the silver corpus, 9,610 pairs are from Wikipedia articles and 1,034 pairs are from news texts. The dataset is published as JSONL files in a format compatible with many question answering datasets. Additionally, the gold corpus is published in formats compatible with BIG-bench and OpenAI-evals. For more information on the formats, see the README. Spurningarsvörunarmálheild sem er ætluð til að mæla þekkingu risamállíkans á íslenskri menningu og sögu og getu þess til þess að svara spurningum rétt. Málheildinni er skipt í tvennt, í gullgögn og silfurgögn. Gullgögnin innihalda 2.000 pör af handyfirförnum spurningum og svörum en silfurgögnin innihalda 10.644 pör af óyfirförnum spurningum og svörum. Öll pörin voru upphaflega búin til á sjálfvirkan hátt með GPT-4-turbo og eru búin til upp úr íslenskum Wikipedia-greinum og fréttum af RÚV, sem eru hluti af Risamálheildinni (http://hdl.handle.net/20.500.12537/236). Í gullgögnunum eru 1.900 pör úr Wikipedia-greinum og 100 pör úr fréttum. Í silfurgögnunum eru 9.610 pör úr Wikipedia-greinum og 1.034 pör úr fréttum. Málheildin er gefin út í JSONL-skjölum á sniði sem samrýmist mörgum spurningarsvörunarmálheildum. Auk þess eru gullgögnin gefin út á sniðum sem samrýmast BIG-bench og OpenAI-evals. Sjá README-skjal fyrir frekari upplýsingar um sniðin. |
dc.language.iso | isl |
dc.publisher | Miðeind ehf. |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | question answering |
dc.subject | qa |
dc.subject | large language models |
dc.subject | llm |
dc.subject | culture |
dc.subject | history |
dc.title | Icelandic Culture and History QA Dataset 24.10 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Þórunn Arnardóttir thorunn@mideind.is Miðeind ehf. |
sponsor | Ministry of Education, Science and Culture Benchmarking datasets for LLMs (G12) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 12644 entries |
files.size | 12659899 |
files.count | 2 |
Files in this item
Download all files in item (12.07 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- Size
- 2.13 KB
- Format
- Unknown
- Description
- The item's README
- MD5
- f1c9617c36d83e3e7c673f09605b6bd2
- Name
- QA-dataset-v2.zip
- Size
- 12.07 MB
- Format
- application/zip
- Description
- A zip file containing all relevant files
- MD5
- 22426fbed14b3ae06cf19ac475348369
- QA-dataset-v2
- gold
- news.jsonl-1 B
- wikipedia.jsonl-1 B
- OpenAI-evals
- news.jsonl-1 B
- wikipedia.jsonl-1 B
- silver
- news.jsonl-1 B
- wikipedia.jsonl-1 B
- BIG-bench
- news.jsonl-1 B
- wikipedia.jsonl-1 B
- gold