Show simple item record

 
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Snæbjarnarson, Vésteinn
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.date.accessioned 2021-09-28T09:24:19Z
dc.date.available 2021-09-28T09:24:19Z
dc.date.issued 2021-07-01
dc.identifier.uri http://hdl.handle.net/20.500.12537/127
dc.description Synthetic back-translated training corpus for neural machine translation. An mBART25 finetuned on English-Icelandic data created the corpus by translating Icelandic and English sentences. The English sentences (44,7m) are retrieved from the Wikipedia, Newscrawl and Europarl corpora. The Icelandic sentences (31,3m) are sourced from the Icelandic Gigaword Corpus. The source texts are the same as last years corpus https://repository.clarin.is/repository/xmlui/handle/20.500.12537/70 . The files include all 4-5 beams from the beam-search generated during the translation process. The format is a tab-separated document (line-index , text). Samhliða gervimálheild með bakþýddum þjálfunargögnum fyrir vélþýðingar. mBART25 líkan var yfirfært á ensk-íslenskar þýðingar og það líkan notað til að mynda málheildina með því að þýða enskar og íslenskar setningar. Ensku setningarnar (44,7m) eru fengnar úr Wikipedia, Newscrawl og Europarl málheildunum. Þær íslensku eru fengnar úr Risamálheildinni (31,3m). Frumtextar eru þeir sömu og málheild síðasta árs https://repository.clarin.is/repository/xmlui/handle/20.500.12537/70 . Skrárnar innihalda alla 4-5 geisla úr geislaleitinni sem var framkvæmd sem hluti af þýðingarferlinu. Skráarsnið er .tsv (línunúmer, texti).
dc.language.iso isl
dc.language.iso eng
dc.publisher Miðeind ehf
dc.relation.isreferencedby https://arxiv.org/abs/2109.07343
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://velthyding.is/
dc.subject machine translation
dc.subject back translation
dc.subject neural machine translation
dc.subject parallel corpus
dc.title En-Is Synthetic Parallel Corpus (21.07)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
demo.uri https://velthyding.is/
contact.person Vésteinn Snæbjarnarson vesteinn@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 76000000 sentences
files.size 11568547022
files.count 12


 Files in this item

 Download all files in item (10.77 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
visindavefur.tar.gz
Size
36.26 MB
Format
application/gzip
Description
Unknown
MD5
64e8734fedb90e344ef6fb4dfe7503db
 Download file  Preview
 File Preview  
  • visindavefur
    • visindavefur.sys157 MB
    • visindavefur.src.dup160 MB
Icon
Name
iswiki.tar.gz
Size
54.95 MB
Format
application/gzip
Description
Unknown
MD5
5fa4fa593c3ced75c78cf4b3a2a35342
 Download file  Preview
 File Preview  
  • iswiki
    • iswiki.sys231 MB
    • iswiki.src.dup234 MB
Icon
Name
sjonvarpid.tar.gz
Size
100.27 MB
Format
application/gzip
Description
Unknown
MD5
e013c01f3f5a4eab5b486e246fa3c37a
 Download file  Preview
 File Preview  
  • sjonvarpid
    • sjonvarpid.sys438 MB
    • sjonvarpid.src.dup459 MB
Icon
Name
ras1_og_ras2.tar.gz
Size
147.92 MB
Format
application/gzip
Description
Unknown
MD5
22d67290d77a793292c531cb969c5cbb
 Download file  Preview
 File Preview  
  • ras1_og_ras2
    • ras1_og_ras2.sys663 MB
    • ras1_og_ras2.src.dup691 MB
Icon
Name
domstolar.tar.gz
Size
262.45 MB
Format
application/gzip
Description
Unknown
MD5
8e88102fff9fbe5dcc1d7514c900ea09
 Download file  Preview
 File Preview  
  • domstolar
    • domstolar.src.dup1 GB
    • domstolar.sys1 GB
Icon
Name
haestirettur.tar.gz
Size
298.3 MB
Format
application/gzip
Description
Unknown
MD5
690ae941061378d00f5f478a83a89ba2
 Download file  Preview
 File Preview  
Icon
Name
europarl.tar.gz
Size
330.23 MB
Format
application/gzip
Description
Unknown
MD5
fc354627e1ffed9ac67b288a5330d278
 Download file  Preview
 File Preview  
  • europarl
    • europarl.src.dup1 GB
    • europarl.detok.sys1 GB
Icon
Name
althingi.tar.gz
Size
504.24 MB
Format
application/gzip
Description
Unknown
MD5
bcbe15bb6e629691941a38206e44fb1d
 Download file  Preview
 File Preview  
  • althingi
    • althingi.sys2 GB
    • althingi.src.dup2 GB
Icon
Name
visir.tar.gz
Size
691.33 MB
Format
application/gzip
Description
Unknown
MD5
25d43976ee11635f65c61419a16eaf2c
 Download file  Preview
 File Preview  
  • visir
    • visir.sys2 GB
    • visir.src.dup2 GB
Icon
Name
enwiki.tar.gz
Size
1.47 GB
Format
application/gzip
Description
Unknown
MD5
723a83b77fd2e0dc788250511ba04ae7
 Download file  Preview
 File Preview  
  • enwiki
    • enwiki.detok.sys4 GB
    • enwiki.src.dup4 GB
Icon
Name
morgunbladid.tar.gz
Size
2.01 GB
Format
application/gzip
Description
Unknown
MD5
b43158f5323b831b99ab61067956cdf4
 Download file  Preview
 File Preview  
Icon
Name
newscrawl.tar.gz
Size
4.93 GB
Format
application/gzip
Description
Unknown
MD5
a4348009c86d6d83246b0a34aa55ebbc
 Download file  Preview
 File Preview  
  • newscrawl
    • newscrawl.sys17 GB
    • newscrawl.src.dup15 GB

Show simple item record