Sýna einfalda færslu atriðis
dc.contributor.author |
Símonarson, Haukur Barri |
dc.contributor.author |
Snæbjarnarson, Vésteinn |
dc.contributor.author |
Þorsteinsson, Vilhjálmur |
dc.date.accessioned |
2020-09-29T15:37:24Z |
dc.date.available |
2020-09-29T15:37:24Z |
dc.date.issued |
2020-09-28 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/70 |
dc.description |
Synthetic back-translated training corpus for neural machine translation. The GreynirT2T Transformer network created the corpus by translating Icelandic and English sentences. The English sentences (44,7m) are retrieved from the Wikipedia, Newscrawl and Europarl corpora. The Icelandic sentences (31,3m) are sourced from the Icelandic Gigaword Corpus.
Samhliða gervimálheild með bakþýddum þjálfunargögnum fyrir vélþýðingar. Tauganetið GreynirT2T Transformer bjó til málheildina með því að þýða enskar og íslenskar setningar. Ensku setningarnar (44,7m) eru fengnar úr Wikipedia, Newscrawl og Europarl málheildunum. Þær íslensku eru fengnar úr Risamálheildinni (31,3m). |
dc.language.iso |
isl |
dc.language.iso |
eng |
dc.publisher |
Miðeind ehf. |
dc.relation.isreplacedby |
http://hdl.handle.net/20.500.12537/127 |
dc.rights |
Icelandic Gigaword Corpus Part1 |
dc.rights.uri |
https://repository.clarin.is/repository/xmlui/page/license-gigaword-corpus-p1 |
dc.rights.label |
PUB |
dc.source.uri |
https://github.com/mideind/GreynirT2T |
dc.subject |
parallel corpus |
dc.subject |
machine translation |
dc.subject |
back translation |
dc.subject |
neural machine translation |
dc.title |
En-Is Synthetic Parallel Corpus (20.09) |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
yes |
branding |
Clarin IS Repository |
contact.person |
Vilhjálmur Þorsteinsson clarin@mideind.is Miðeind ehf. |
sponsor |
Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info |
76000000 sentences |
files.size |
7528925378 |
files.count |
1 |
Files in this item
This item is
Publicly Available
and licensed under:
Icelandic Gigaword Corpus Part1
- Name
- eng-isl-synthetic-corpus-v1.0.tar.gz
- Size
- 7.01
GB
- Format
- application/gzip
- Description
- Backtranslated synthetic En--Is corpus for NMT
- MD5
- 541612f84c6502f721f2ca6e417436b7
Download file
Preview
- eng-isl-synthetic-corpus-v1.0
- monolingual-eng
- newscrawl.multi-year.en-is.tsv7 GB
- enwiki-20161221.en-is.tsv2 GB
- europarl-v9-en.en-is.tsv609 MB
- monolingual-isl
- rmh2018-2
- visir-rmh2018-2.tsv1 GB
- ras1_og_ras2-rmh2018-2.tsv254 MB
- althingi-rmh2018-2.tsv904 MB
- haestirettur-rmh2018-2.tsv610 MB
- iswiki-rmh2018-2.tsv87 MB
- domstolar-rmh2018-2.tsv533 MB
- sjonvarpid-rmh2018-2.tsv167 MB
- rmh2018-1
- morgunbladid-rmh2018-1.tsv2 GB
- visindavefur-rmh2018-1.tsv60 MB
- README2 kB
Sýna einfalda færslu atriðis