Show simple item record

 
dc.contributor.author Steingrímsson, Steinþór
dc.date.accessioned 2021-10-01T07:46:04Z
dc.date.available 2021-10-01T07:46:04Z
dc.date.issued 2021-10-01
dc.identifier.uri http://hdl.handle.net/20.500.12537/151
dc.description This is a BUCC-style dataset for testing the accuracy of parallel sentence extraction from comparable corpora. It has 100 thousand sentences, of which 2000 are parallel pairs, and the other 98 thousand for each language randomly selected sentences from the same domain (news). The parallel sentences are from the WMT 2021 news translation task, the randomly selected Icelandic sentences from the news subcorpus of the Icelandic Gigaword Corpus and the randomly selected English sentences from the English newscrawl. Þetta er gagnasett til að prófa nákvæmni aðferða við að veiða samhliða setningapör úr sambærilegum málheildum. Það inniheldur 100 þúsund setningar, af þeim eru 2000 samhliða.
dc.language.iso isl
dc.language.iso eng
dc.publisher Reykjavík University
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject parallel corpora
dc.subject comparable corpora
dc.subject sentence alignment
dc.subject sentence filtering
dc.subject aligned pairs
dc.subject aligned sentence pairs
dc.subject machine translation
dc.title Icelandic-English Parallel Sentence Extraction Dataset 21.10
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies
size.info 100000 sentences
files.size 8925170
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
comparable-extraction.zip
Size
8.51 MB
Format
application/zip
Description
Unknown
MD5
7d4929b90221014ec142cd4ee1e5143b
 Download file  Preview
 File Preview  
    • news.en-is.training.gold-1 B
    • news.en-is.training.en-1 B
    • README-1 B
    • news.en-is.training.is-1 B

Show simple item record