dc.contributor.author | Símonarson, Haukur Barri |
dc.contributor.author | Snæbjarnarson, Vésteinn |
dc.contributor.author | Þorsteinsson, Vilhjálmur |
dc.date.accessioned | 2021-09-28T09:24:19Z |
dc.date.available | 2021-09-28T09:24:19Z |
dc.date.issued | 2021-07-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/127 |
dc.description | Synthetic back-translated training corpus for neural machine translation. An mBART25 finetuned on English-Icelandic data created the corpus by translating Icelandic and English sentences. The English sentences (44,7m) are retrieved from the Wikipedia, Newscrawl and Europarl corpora. The Icelandic sentences (31,3m) are sourced from the Icelandic Gigaword Corpus. The source texts are the same as last years corpus https://repository.clarin.is/repository/xmlui/handle/20.500.12537/70 . The files include all 4-5 beams from the beam-search generated during the translation process. The format is a tab-separated document (line-index , text). Samhliða gervimálheild með bakþýddum þjálfunargögnum fyrir vélþýðingar. mBART25 líkan var yfirfært á ensk-íslenskar þýðingar og það líkan notað til að mynda málheildina með því að þýða enskar og íslenskar setningar. Ensku setningarnar (44,7m) eru fengnar úr Wikipedia, Newscrawl og Europarl málheildunum. Þær íslensku eru fengnar úr Risamálheildinni (31,3m). Frumtextar eru þeir sömu og málheild síðasta árs https://repository.clarin.is/repository/xmlui/handle/20.500.12537/70 . Skrárnar innihalda alla 4-5 geisla úr geislaleitinni sem var framkvæmd sem hluti af þýðingarferlinu. Skráarsnið er .tsv (línunúmer, texti). |
dc.language.iso | isl |
dc.language.iso | eng |
dc.publisher | Miðeind ehf |
dc.relation.isreferencedby | https://arxiv.org/abs/2109.07343 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://velthyding.is/ |
dc.subject | machine translation |
dc.subject | back translation |
dc.subject | neural machine translation |
dc.subject | parallel corpus |
dc.title | En-Is Synthetic Parallel Corpus (21.07) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://velthyding.is/ |
contact.person | Vésteinn Snæbjarnarson vesteinn@mideind.is Miðeind ehf |
sponsor | Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 76000000 sentences |
files.size | 11568547022 |
files.count | 12 |
Files in this item
Download all files in item (10.77 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- visindavefur.tar.gz
- Size
- 36.26 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 64e8734fedb90e344ef6fb4dfe7503db
- visindavefur
- visindavefur.sys157 MB
- visindavefur.src.dup160 MB
- Name
- iswiki.tar.gz
- Size
- 54.95 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 5fa4fa593c3ced75c78cf4b3a2a35342
- iswiki
- iswiki.sys231 MB
- iswiki.src.dup234 MB
- Name
- sjonvarpid.tar.gz
- Size
- 100.27 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- e013c01f3f5a4eab5b486e246fa3c37a
- sjonvarpid
- sjonvarpid.sys438 MB
- sjonvarpid.src.dup459 MB
- Name
- ras1_og_ras2.tar.gz
- Size
- 147.92 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 22d67290d77a793292c531cb969c5cbb
- ras1_og_ras2
- ras1_og_ras2.sys663 MB
- ras1_og_ras2.src.dup691 MB
- Name
- domstolar.tar.gz
- Size
- 262.45 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 8e88102fff9fbe5dcc1d7514c900ea09
- domstolar
- domstolar.src.dup1 GB
- domstolar.sys1 GB
- Name
- haestirettur.tar.gz
- Size
- 298.3 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 690ae941061378d00f5f478a83a89ba2
- haestirettur
- haestirettur.src.dup1 GB
- haestirettur.sys1 GB
- Name
- europarl.tar.gz
- Size
- 330.23 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- fc354627e1ffed9ac67b288a5330d278
- europarl
- europarl.src.dup1 GB
- europarl.detok.sys1 GB
- Name
- althingi.tar.gz
- Size
- 504.24 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- bcbe15bb6e629691941a38206e44fb1d
- althingi
- althingi.sys2 GB
- althingi.src.dup2 GB
- Name
- visir.tar.gz
- Size
- 691.33 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 25d43976ee11635f65c61419a16eaf2c
- visir
- visir.sys2 GB
- visir.src.dup2 GB
- Name
- enwiki.tar.gz
- Size
- 1.47 GB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 723a83b77fd2e0dc788250511ba04ae7
- enwiki
- enwiki.detok.sys4 GB
- enwiki.src.dup4 GB
- Name
- morgunbladid.tar.gz
- Size
- 2.01 GB
- Format
- application/gzip
- Description
- Unknown
- MD5
- b43158f5323b831b99ab61067956cdf4
- morgunbladid
- morgunbladid.sys7 GB
- morgunbladid.src.dup7 GB
- Name
- newscrawl.tar.gz
- Size
- 4.93 GB
- Format
- application/gzip
- Description
- Unknown
- MD5
- a4348009c86d6d83246b0a34aa55ebbc
- newscrawl
- newscrawl.sys17 GB
- newscrawl.src.dup15 GB