dc.contributor.author | Steingrímsson, Steinþór |
dc.date.accessioned | 2021-10-01T07:46:04Z |
dc.date.available | 2021-10-01T07:46:04Z |
dc.date.issued | 2021-10-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/151 |
dc.description | This is a BUCC-style dataset for testing the accuracy of parallel sentence extraction from comparable corpora. It has 100 thousand sentences, of which 2000 are parallel pairs, and the other 98 thousand for each language randomly selected sentences from the same domain (news). The parallel sentences are from the WMT 2021 news translation task, the randomly selected Icelandic sentences from the news subcorpus of the Icelandic Gigaword Corpus and the randomly selected English sentences from the English newscrawl. Þetta er gagnasett til að prófa nákvæmni aðferða við að veiða samhliða setningapör úr sambærilegum málheildum. Það inniheldur 100 þúsund setningar, af þeim eru 2000 samhliða. |
dc.language.iso | isl |
dc.language.iso | eng |
dc.publisher | Reykjavík University |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | parallel corpora |
dc.subject | comparable corpora |
dc.subject | sentence alignment |
dc.subject | sentence filtering |
dc.subject | aligned pairs |
dc.subject | aligned sentence pairs |
dc.subject | machine translation |
dc.title | Icelandic-English Parallel Sentence Extraction Dataset 21.10 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institute for Icelandic Studies |
size.info | 100000 sentences |
files.size | 8925170 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- comparable-extraction.zip
- Size
- 8.51 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- 7d4929b90221014ec142cd4ee1e5143b