dc.contributor.author | Jónsson, Haukur Páll |
dc.contributor.author | Símonarson, Haukur Barri |
dc.contributor.author | Ragnarsson, Pétur Orri |
dc.contributor.author | Ingólfsdóttir, Svanhvít Lilja |
dc.contributor.author | Þorsteinsson, Vilhjálmur |
dc.contributor.author | Snæbjarnarson, Vésteinn |
dc.date.accessioned | 2022-09-26T13:09:42Z |
dc.date.available | 2022-09-26T13:09:42Z |
dc.date.issued | 2022-09-23 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/260 |
dc.description | This backtranslation corpus was created by extracting publicly available and open datasets which contain texts with context. For English we sampled whole speeches from Europarl V8, whole news articles from NewsCrawl and whole articles from Wikipedia. For Icelandic we sampled whole stories from the Icelandic sagas and Icelandic e-books, we sampled whole documents from the Icelandic Gigaword Corpus (without sports) and whole articles from Wikipedia. The texts were then translated using a translation system which was trained to translate with context. The system translated 3 sentences at once to produce the backtranslations. The corpus consists of multiple *.tsv files in which the first column contains text in the source language and the second column is the corresponding translation. A newline is between each segment (usually a single sentence). An additional newline is placed after each paragraph. A third newline is placed between documents (article, story, etc.). Note that not all the corpora contain paragraph information, and the whole document is considered a single paragraph. # The corpora | Monolingual dataset | Source language | Tokens (millions) | | ----- | ---------------- | ----------------- | | The Icelandic Gigaword Corpus (Without sport) (IGC) | Icelandic | 118.4 | | Wikipedia | Icelandic | 8.9 | | Icelandic sagas | Icelandic | 1.4 | | Icelandic e-books | Icelandic | 1.6 | | NewsCrawl | English | 44.5 | | Wikipedia | English | 53.3 | | EuroPARL | English | 58.4 | --- Þessi málheild með bakþýðingum var mynduð með því að bakþýða opnar málheildir í samhengi. Ensku gögnin voru fengin með því að nýta ræður úr Europarl V8 málheildinni, fréttir úr NewsCrawl og heilar greinar úr Wikipedia. Íslensku frumgögnin eru fengin úr íslendingasögunum, opnum rafbókum, Risamálheildinni og af Wikipedia. Textarnir eru þýddir með þýðingarlíkani sem hefur verið þjálfað til að þýða lengra samhengi, þ.e. meira en eina málsgrein í einu. Bakþýðingarnar voru myndaðar með því að þýða þrjár aðliggjandi málsgreinar í einu. Gögnin eru á *.tsv sniði með frummálið í fyrsta dálki og myndaða textann í seinni dálkinum. Tóm lína er á milli efnisgreina og tvær tómar línur á milli skjala (grein, frétt, saga osfrv.) Ath. að ekki eru öll skjöl með efnisgreinaskiptingu. |
dc.language.iso | isl |
dc.language.iso | eng |
dc.publisher | Miðeind ehf |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://velthyding.is |
dc.subject | machine translation |
dc.subject | nmt |
dc.subject | backtranslations |
dc.subject | synthetic data |
dc.title | Long Context Synthetic Translation Pairs for English and Icelandic (22.09) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://velthyding.is |
contact.person | Haukur Páll Jónsson haukurpj@mideind.is Miðeind ehf |
sponsor | Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 17197440 sentences |
files.size | 1418234853 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- backtranslations_context_2209.zip
- Size
- 1.32 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- d7b9a8ac4aa461f4f68d2ca1688e5b50
- backtranslations
- newscrawl_en_is.tsv550 MB
- README.md1 kB
- europarl_v8_en_is.tsv695 MB
- rafbokavefurinn_is_en.tsv19 MB
- fornsogur_is_en.tsv16 MB
- rmh_filtered_is_en.tsv1 GB
- wikipedia_en_is.tsv644 MB
- wikipedia_is_en.tsv108 MB