Icelandic WinoGrande v1.0

Snæbjarnarson, Vésteinn; Símonarson, Haukur Barri; Ragnarsson, Pétur Orri; Ingólfsdóttir, Svanhvít Lilja; Jónsson, Haukur Páll; Þorsteinsson, Vilhjálmur; Einarsson, Hafsteinn

Icelandic WinoGrande v1.0

Clarin IS Repository

Authors: Snæbjarnarson, Vésteinn ; et al.show everyone
Snæbjarnarson, Vésteinn ; Símonarson, Haukur Barri ; Ragnarsson, Pétur Orri ; Ingólfsdóttir, Svanhvít Lilja ; Jónsson, Haukur Páll ; Þorsteinsson, Vilhjálmur ; Einarsson, Hafsteinn

Item identifier: http://hdl.handle.net/20.500.12537/170

Referenced by: https://arxiv.org/abs/2201.05601

Date issued: 2022-01-17

Type: corpus, text

Size: 1095 sentences

Language(s): Icelandic

Description: The Icelandic WinoGrande dataset v. 1.0 The WinoGrande dataset (Sakaguchi et al., 2020), used for evaluating common sense capabilities of neural language models, is inspired by the original WinoGrad dataset (Levesque et al., 2012), but its problems are designed to minimize biases which the models may rely on when solving them. We systematically go through the WinoGrande test set (1767 examples) and translate and adapt sentences to work in Icelandic. While the English WinoGrande problems are not always constructed as pairs, in our adaptation, we create sentence pairs where it is feasible. We also found some of the examples to be specific to culture, subjective, or otherwise inapplicable for translation. Those examples were either adjusted or skipped. The result is a dataset of 1095 examples. The size of the Icelandic dataset is closest in size to the small variant of the English dataset (640 examples). Included in the dataset is a five-fold split and a python script that should be used to generate train and development sets. — Íslenska WinoGrande málheildin útg. 1.0 WinoGrande málheildin (Sakaguchi et al., 2020) nýtist við mat á almennum málskilningi mállíkana. Málheildin er innblásin af upphaflegu WinoGrad málheildinni (Leveseque et al., 2012), en sérstaklega hefur verið tekið á verkefnunum til að minnka möguleikann á að hægt sé að leysa þau vegna bjaga í setningunum. Við förum kerfisbundið yfir prófunarhluta WinoGrande málheildarinnar (1767 dæmi), þýðum og staðfærum setningar til að vera gjaldgengar á íslensku. Þótt ensku WinoGrande dæmin séu ekki alltaf sett fram í pörum þá höfum við leitast við að svo sé í íslensku aðlöguninni. Við staðfærðum og aðlögum jafnframt þau dæmi sem vísa sérstaklega í menningarleg atriði. Þeim dæmum sem einhverra hluta vegna þýðast illa yfir í íslensku er sleppt eða var efnislega breytt. Málheildin inniheldur 1095 setningar sem er næst því að vera sambærilegt að stærð og ,,small’’ hluti ensku málheildarinnar (640 dæmi). Við gefum gögnin út í fimm hlutum ásamt skriftu sem skiptir þeim upp í prófunar- og þjálfunargögn.

Publisher: Miðeind ehf

Subject(s): natural language understanding nlu winogrande coreference resolution

Collection(s): Clarin IS

Show full item record

Files in this item

Download all files in item (305.47 KB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: data-05.jsonl
Size: 61.76 KB
Format: Unknown
Description: Split 5
MD5: 12937f1f7eddeee1c0dc60acdf5e49df

Download file

Name: data-01.jsonl
Size: 58.36 KB
Format: Unknown
Description: Split 1
MD5: 31bfcc3e5497341e437b26e2c7ffc205

Download file

Name: data-02.jsonl
Size: 60.58 KB
Format: Unknown
Description: Split 2
MD5: 0e33703c2895ad8bd2dd3375b0c6ab7c

Download file

Name: data-03.jsonl
Size: 61.4 KB
Format: Unknown
Description: Split 3
MD5: 9435ec65648e27d6ac48e4496912f12b

Download file

Name: create_cross_validation_splits.py
Size: 1.19 KB
Format: Unknown
Description: Train validation generation script
MD5: 57c58221f4c0c5dabf164e3151f11b26

Download file

Name: data-04.jsonl
Size: 60.65 KB
Format: Unknown
Description: Split 4
MD5: e162e5e96e9e01172055dffbf30a37fa

Download file

Name: README.txt
Size: 1.53 KB
Format: Text file
Description: Readme
MD5: 9d00689c7d56430f48a06cae4cbb744a

Download file Preview

File Preview

The Icelandic WinoGrande dataset v. 1.0

The WinoGrande dataset (Sakaguchi et al., 2020), used for evaluating common sense capabilities of neural language models, is inspired by the original WinoGrad
dataset (Levesque et al., 2012), but its problems are designed to minimize biases which the models may rely on when solving them. We systematically go through
the WinoGrande test set (1767 examples) and translate and adapt sentences to work in Icelandic. While the English WinoGrande problems are not always constructed as pairs, in our adaptation, we create sentence pairs where it is feasible. We also found some of the examples to be specific to culture, subjective, or otherwise inapplicable for translation. Those examples were either adjusted or skipped. The result is a dataset of 1095 examples. The size of the Icelandic dataset is closes in size to the small variant of the English dataset (640 examples). Included in the dataset is a five-fold split and a python script that should be used . . .

Icelandic WinoGrande v1.0

Files in this item

Partners, Coordination, Funding

Repository

More