dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Steingrímsson, Steinþór |
dc.date.accessioned | 2020-05-28T13:22:46Z |
dc.date.available | 2020-05-28T13:22:46Z |
dc.date.issued | 2020-05-28 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/24 |
dc.description | Three dev/test sets for MT quality estimation created from subcorpora of ParIce. The dev/test sets contain English-Icelandic segment pairs. One of the three sets is made up of subtitle segments from OpenSubtitles, one of segments from drug descriptions distributed by the European Medical Agency (EMA) and one from EEA documents. The sets are manually annotated so all pairs are correct. The goal was to create dev/test sets with a total of at least 3000 correct translation segments from each subcorpus. All segments contain four or more words in the English segments. The OpenSubtitles set contains 1,531/1,532 segments in dev/test. Furthermore, It contains 2,277 segment pairs that have less than four words on the English side and 777 segment pairs that have incorrect alignments or translations. The training set contains 1,298,489 segments, which have not been manually checked for errors. The OpenSubtitles sets are compiled using a Python script that downloads the segments and creates the splits. The EMA set contains 2,254/2,255 segment pairs in dev/test. Furthermore, it contains 491 segment pairs that have less than four words on the English side and 240 segments that have incorrect alignments or translations. The training set contains 399.093 segments, which have not been manually checked for errors. The EEA set contains 22 whole documents. Documents with between 100 and 200 sentences were selected at random until we reached more than 3000 sentence pairs. Alignments and translations were manually corrected for these documents. Longer sentences were split into smaller parts, where possible. The split consists of 2,292/2,396 dev/test segments and 1,697,927 training segments that have not been manually checked. Þrjú sett af setningum til þróunar/prófunar á þýðingavélum. Settin eru búin til úr undirmálheildum ParIce og innihalda ensk-íslensk pör. Eitt af settunum er búið til úr skjátextum úr OpenSubtitles, annað úr fylgiseðlatextum frá EMA og það þriðja úr EES-þýðingum. Pörin hafa verið handyfirfarin til að tryggja að þróunar-/prófunargögn séu örugglega rétt. Markmiðið var að búa til sett til þróunar/prófunar sem hefðu a.m.k. 3000 réttar þýðingar samtals fyrir hverja undirmálheild. Í öllum pörunum eru a.m.k. fjögur orð í enska hlutanum. Settin úr OpenSubtitles inniheldur 1,531/1,532 pör fyrir þróun/prófun. Að auki fylgja með 2,277 pör þar sem færri en fjögur orð eru í enska hlutanum og 777 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 1,298,489 pör, sem ekki hafa verið handyfirfarin. OpenSubtitles settin eru mynduð með Python forriti sem sækir pörin og skiptir þeim upp í settin. EMA settin innihalda 2,254/2,255 pör fyrir þróun/prófun. Að auki fylgja með 491 pör þar sem færri en fjögur orð eru í enska hlutanum og 240 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 399,093 pör, sem ekki hafa verið handyfirfarin. EES settin innihalda 22 heil skjöl. Skjölin voru valin af handahófi úr þeim skjölum í málheildinni sem innihalda á milli 100 og 200 setningar, þar til fleiri en 3000 setningum var náð. Samröðun var handyfirfarin og löguð og rangar þýðingar einnig. Lengri setningum var skipt upp í minni hluta, þegar hægt var. Settin innihalda 2,292/2,396 pör fyrir þróun/prófun og 1,697,927 pör til þjálfunar. Þjálfunarpörin hafa ekki verið handyfirfarin. |
dc.language.iso | isl |
dc.language.iso | eng |
dc.publisher | The Árni Magnússon Institute for Icelandic Studies |
dc.relation.isreferencedby | https://www.aclweb.org/anthology/W19-6115/ |
dc.relation.isreplacedby | http://hdl.handle.net/20.500.12537/146 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://parice.arnastofnun.is |
dc.subject | parallel corpus |
dc.subject | machine translation |
dc.subject | test sets |
dc.title | ParIce Dev/Test/Train Splits 20.05 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
sponsor | Ministry of Education, Science and Culture Parallel Corpora (V2) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 12260 sentences |
files.size | 141948691 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)

- Name
- Parice_dev_test.20.05.zip
- Size
- 135.37 MB
- Format
- application/zip
- Description
- ParIce Dev/Test sets
- MD5
- 4369f07b24e4a51630d8860997e070ef
- Parice_dev_test.20.05
- csv
- ema
- ema_test_short_sentences_en.csv10 kB
- ema_test_en.csv253 kB
- ema_dev_meta.csv14 kB
- ema_train_en.csv38 MB
- ema_train_is.csv41 MB
- ema_rest_is.csv16 kB
- ema_dev_is.csv264 kB
- ema_rest_en.csv15 kB
- ema_test_short_sentences_meta.csv3 kB
- ema_dev_en.csv243 kB
- ema_test_is.csv274 kB
- ema_test_short_sentences_is.csv10 kB
- ema_test_meta.csv14 kB
- ema_train_meta.csv2 MB
- ema_rest_meta.csv1 kB
- opensubtitles
- opensubtitles_dev_en.csv62 kB
- opensubtitles_dev_meta.csv10 kB
- opensubtitles_test_is.csv66 kB
- opensubtitles_test_meta.csv10 kB
- opensubtitles_test_short_sentences_is.csv36 kB
- opensubtitles_test_en.csv63 kB
- opensubtitles_test_short_sentences_en.csv32 kB
- opensubtitles_rest_is.csv29 kB
- opensubtitles_rest_meta.csv5 kB
- opensubtitles_test_short_sentences_meta.csv15 kB
- opensubtitles_rest_en.csv29 kB
- opensubtitles_dev_is.csv65 kB
- eea
- eea_test_is.csv197 kB
- eea_dev_is.csv184 kB
- eea_dev_meta.csv45 kB
- eea_test_en.csv190 kB
- eea_test_meta.csv47 kB
- eea_train_is.csv162 MB
- eea_dev_en.csv177 kB
- eea_train_en.csv171 MB
- eea_train_meta.csv29 MB
- ema
- README.txt3 kB
- opensubtitles
- read_opus.py6 kB
- links_to_opus
- links_is.txt108 MB
- links_en.txt113 MB
- csv