Show simple item record

 
dc.contributor.author Barkarson, Starkaður
dc.contributor.author Steingrímsson, Steinþór
dc.date.accessioned 2020-05-28T13:22:46Z
dc.date.available 2020-05-28T13:22:46Z
dc.date.issued 2020-05-28
dc.identifier.uri http://hdl.handle.net/20.500.12537/24
dc.description Three dev/test sets for MT quality estimation created from subcorpora of ParIce. The dev/test sets contain English-Icelandic segment pairs. One of the three sets is made up of subtitle segments from OpenSubtitles, one of segments from drug descriptions distributed by the European Medical Agency (EMA) and one from EEA documents. The sets are manually annotated so all pairs are correct. The goal was to create dev/test sets with a total of at least 3000 correct translation segments from each subcorpus. All segments contain four or more words in the English segments. The OpenSubtitles set contains 1,531/1,532 segments in dev/test. Furthermore, It contains 2,277 segment pairs that have less than four words on the English side and 777 segment pairs that have incorrect alignments or translations. The training set contains 1,298,489 segments, which have not been manually checked for errors. The OpenSubtitles sets are compiled using a Python script that downloads the segments and creates the splits. The EMA set contains 2,254/2,255 segment pairs in dev/test. Furthermore, it contains 491 segment pairs that have less than four words on the English side and 240 segments that have incorrect alignments or translations. The training set contains 399.093 segments, which have not been manually checked for errors. The EEA set contains 22 whole documents. Documents with between 100 and 200 sentences were selected at random until we reached more than 3000 sentence pairs. Alignments and translations were manually corrected for these documents. Longer sentences were split into smaller parts, where possible. The split consists of 2,292/2,396 dev/test segments and 1,697,927 training segments that have not been manually checked. Þrjú sett af setningum til þróunar/prófunar á þýðingavélum. Settin eru búin til úr undirmálheildum ParIce og innihalda ensk-íslensk pör. Eitt af settunum er búið til úr skjátextum úr OpenSubtitles, annað úr fylgiseðlatextum frá EMA og það þriðja úr EES-þýðingum. Pörin hafa verið handyfirfarin til að tryggja að þróunar-/prófunargögn séu örugglega rétt. Markmiðið var að búa til sett til þróunar/prófunar sem hefðu a.m.k. 3000 réttar þýðingar samtals fyrir hverja undirmálheild. Í öllum pörunum eru a.m.k. fjögur orð í enska hlutanum. Settin úr OpenSubtitles inniheldur 1,531/1,532 pör fyrir þróun/prófun. Að auki fylgja með 2,277 pör þar sem færri en fjögur orð eru í enska hlutanum og 777 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 1,298,489 pör, sem ekki hafa verið handyfirfarin. OpenSubtitles settin eru mynduð með Python forriti sem sækir pörin og skiptir þeim upp í settin. EMA settin innihalda 2,254/2,255 pör fyrir þróun/prófun. Að auki fylgja með 491 pör þar sem færri en fjögur orð eru í enska hlutanum og 240 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 399,093 pör, sem ekki hafa verið handyfirfarin. EES settin innihalda 22 heil skjöl. Skjölin voru valin af handahófi úr þeim skjölum í málheildinni sem innihalda á milli 100 og 200 setningar, þar til fleiri en 3000 setningum var náð. Samröðun var handyfirfarin og löguð og rangar þýðingar einnig. Lengri setningum var skipt upp í minni hluta, þegar hægt var. Settin innihalda 2,292/2,396 pör fyrir þróun/prófun og 1,697,927 pör til þjálfunar. Þjálfunarpörin hafa ekki verið handyfirfarin.
dc.language.iso isl
dc.language.iso eng
dc.publisher The Árni Magnússon Institute for Icelandic Studies
dc.relation.isreferencedby https://www.aclweb.org/anthology/W19-6115/
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri http://parice.arnastofnun.is
dc.subject parallel corpus
dc.subject machine translation
dc.subject test sets
dc.title ParIce Dev/Test/Train Splits 20.05
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
sponsor Ministry of Education, Science and Culture Parallel Corpora (V2) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 12260 sentences
files.size 141948691
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
Parice_dev_test.20.05.zip
Size
135.37 MB
Format
application/zip
Description
ParIce Dev/Test sets
MD5
4369f07b24e4a51630d8860997e070ef
 Download file  Preview
 File Preview  
  • Parice_dev_test.20.05
    • csv
      • ema
        • ema_test_short_sentences_en.csv10 kB
        • ema_test_en.csv253 kB
        • ema_dev_meta.csv14 kB
        • ema_train_en.csv38 MB
        • ema_train_is.csv41 MB
        • ema_rest_is.csv16 kB
        • ema_dev_is.csv264 kB
        • ema_rest_en.csv15 kB
        • ema_test_short_sentences_meta.csv3 kB
        • ema_dev_en.csv243 kB
        • ema_test_is.csv274 kB
        • ema_test_short_sentences_is.csv10 kB
        • ema_test_meta.csv14 kB
        • ema_train_meta.csv2 MB
        • ema_rest_meta.csv1 kB
      • opensubtitles
        • opensubtitles_dev_en.csv62 kB
        • opensubtitles_dev_meta.csv10 kB
        • opensubtitles_test_is.csv66 kB
        • opensubtitles_test_meta.csv10 kB
        • opensubtitles_test_short_sentences_is.csv36 kB
        • opensubtitles_test_en.csv63 kB
        • opensubtitles_test_short_sentences_en.csv32 kB
        • opensubtitles_rest_is.csv29 kB
        • opensubtitles_rest_meta.csv5 kB
        • opensubtitles_test_short_sentences_meta.csv15 kB
        • opensubtitles_rest_en.csv29 kB
        • opensubtitles_dev_is.csv65 kB
      • eea
        • eea_test_is.csv197 kB
        • eea_dev_is.csv184 kB
        • eea_dev_meta.csv45 kB
        • eea_test_en.csv190 kB
        • eea_test_meta.csv47 kB
        • eea_train_is.csv162 MB
        • eea_dev_en.csv177 kB
        • eea_train_en.csv171 MB
        • eea_train_meta.csv29 MB
    • README.txt3 kB
    • opensubtitles
      • read_opus.py6 kB
      • links_to_opus
        • links_is.txt108 MB
        • links_en.txt113 MB

Show simple item record