Show simple item record

 
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Snæbjarnarson, Vésteinn
dc.date.accessioned 2022-09-26T13:09:42Z
dc.date.available 2022-09-26T13:09:42Z
dc.date.issued 2022-09-23
dc.identifier.uri http://hdl.handle.net/20.500.12537/260
dc.description This backtranslation corpus was created by extracting publicly available and open datasets which contain texts with context. For English we sampled whole speeches from Europarl V8, whole news articles from NewsCrawl and whole articles from Wikipedia. For Icelandic we sampled whole stories from the Icelandic sagas and Icelandic e-books, we sampled whole documents from the Icelandic Gigaword Corpus (without sports) and whole articles from Wikipedia. The texts were then translated using a translation system which was trained to translate with context. The system translated 3 sentences at once to produce the backtranslations. The corpus consists of multiple *.tsv files in which the first column contains text in the source language and the second column is the corresponding translation. A newline is between each segment (usually a single sentence). An additional newline is placed after each paragraph. A third newline is placed between documents (article, story, etc.). Note that not all the corpora contain paragraph information, and the whole document is considered a single paragraph. # The corpora | Monolingual dataset | Source language | Tokens (millions) | | ----- | ---------------- | ----------------- | | The Icelandic Gigaword Corpus (Without sport) (IGC) | Icelandic | 118.4 | | Wikipedia | Icelandic | 8.9 | | Icelandic sagas | Icelandic | 1.4 | | Icelandic e-books | Icelandic | 1.6 | | NewsCrawl | English | 44.5 | | Wikipedia | English | 53.3 | | EuroPARL | English | 58.4 | --- Þessi málheild með bakþýðingum var mynduð með því að bakþýða opnar málheildir í samhengi. Ensku gögnin voru fengin með því að nýta ræður úr Europarl V8 málheildinni, fréttir úr NewsCrawl og heilar greinar úr Wikipedia. Íslensku frumgögnin eru fengin úr íslendingasögunum, opnum rafbókum, Risamálheildinni og af Wikipedia. Textarnir eru þýddir með þýðingarlíkani sem hefur verið þjálfað til að þýða lengra samhengi, þ.e. meira en eina málsgrein í einu. Bakþýðingarnar voru myndaðar með því að þýða þrjár aðliggjandi málsgreinar í einu. Gögnin eru á *.tsv sniði með frummálið í fyrsta dálki og myndaða textann í seinni dálkinum. Tóm lína er á milli efnisgreina og tvær tómar línur á milli skjala (grein, frétt, saga osfrv.) Ath. að ekki eru öll skjöl með efnisgreinaskiptingu.
dc.language.iso isl
dc.language.iso eng
dc.publisher Miðeind ehf
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://velthyding.is
dc.subject machine translation
dc.subject nmt
dc.subject backtranslations
dc.subject synthetic data
dc.title Long Context Synthetic Translation Pairs for English and Icelandic (22.09)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
demo.uri https://velthyding.is
contact.person Haukur Páll Jónsson haukurpj@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 17197440 sentences
files.size 1418234853
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
backtranslations_context_2209.zip
Size
1.32 GB
Format
application/zip
Description
Unknown
MD5
d7b9a8ac4aa461f4f68d2ca1688e5b50
 Download file  Preview
 File Preview  
  • backtranslations
    • newscrawl_en_is.tsv550 MB
    • README.md1 kB
    • europarl_v8_en_is.tsv695 MB
    • rafbokavefurinn_is_en.tsv19 MB
    • fornsogur_is_en.tsv16 MB
    • rmh_filtered_is_en.tsv1 GB
    • wikipedia_en_is.tsv644 MB
    • wikipedia_is_en.tsv108 MB

Show simple item record