Show simple item record

 
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Snæbjarnarson, Vésteinn
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.date.accessioned 2020-09-29T15:37:24Z
dc.date.available 2020-09-29T15:37:24Z
dc.date.issued 2020-09-28
dc.identifier.uri http://hdl.handle.net/20.500.12537/70
dc.description Synthetic back-translated training corpus for neural machine translation. The GreynirT2T Transformer network created the corpus by translating Icelandic and English sentences. The English sentences (44,7m) are retrieved from the Wikipedia, Newscrawl and Europarl corpora. The Icelandic sentences (31,3m) are sourced from the Icelandic Gigaword Corpus. Samhliða gervimálheild með bakþýddum þjálfunargögnum fyrir vélþýðingar. Tauganetið GreynirT2T Transformer bjó til málheildina með því að þýða enskar og íslenskar setningar. Ensku setningarnar (44,7m) eru fengnar úr Wikipedia, Newscrawl og Europarl málheildunum. Þær íslensku eru fengnar úr Risamálheildinni (31,3m).
dc.language.iso isl
dc.language.iso eng
dc.publisher Miðeind ehf.
dc.relation.isreplacedby http://hdl.handle.net/20.500.12537/127
dc.rights Icelandic Gigaword Corpus Part1
dc.rights.uri https://repository.clarin.is/repository/xmlui/page/license-gigaword-corpus-p1
dc.rights.label PUB
dc.source.uri https://github.com/mideind/GreynirT2T
dc.subject parallel corpus
dc.subject machine translation
dc.subject back translation
dc.subject neural machine translation
dc.title En-Is Synthetic Parallel Corpus (20.09)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Vilhjálmur Þorsteinsson clarin@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 76000000 sentences
files.size 7528925378
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Icelandic Gigaword Corpus Part1
Icon
Name
eng-isl-synthetic-corpus-v1.0.tar.gz
Size
7.01 GB
Format
application/gzip
Description
Backtranslated synthetic En--Is corpus for NMT
MD5
541612f84c6502f721f2ca6e417436b7
 Download file  Preview
 File Preview  
  • eng-isl-synthetic-corpus-v1.0
    • monolingual-eng
      • newscrawl.multi-year.en-is.tsv7 GB
      • enwiki-20161221.en-is.tsv2 GB
      • europarl-v9-en.en-is.tsv609 MB
    • monolingual-isl
      • rmh2018-2
        • visir-rmh2018-2.tsv1 GB
        • ras1_og_ras2-rmh2018-2.tsv254 MB
        • althingi-rmh2018-2.tsv904 MB
        • haestirettur-rmh2018-2.tsv610 MB
        • iswiki-rmh2018-2.tsv87 MB
        • domstolar-rmh2018-2.tsv533 MB
        • sjonvarpid-rmh2018-2.tsv167 MB
      • rmh2018-1
        • morgunbladid-rmh2018-1.tsv2 GB
        • visindavefur-rmh2018-1.tsv60 MB
    • README2 kB

Show simple item record