dc.contributor.author | Símonarson, Haukur Barri |
dc.contributor.author | Jónsson, Haukur Páll |
dc.contributor.author | Ragnarsson, Pétur Orri |
dc.contributor.author | Ingólfsdóttir, Svanhvít Lilja |
dc.contributor.author | Þorsteinsson, Vilhjálmur |
dc.contributor.author | Snæbjarnarson, Vésteinn |
dc.date.accessioned | 2022-09-29T11:21:54Z |
dc.date.available | 2022-09-29T11:21:54Z |
dc.date.issued | 2022-09-23 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/278 |
dc.description | ENGLISH: These models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhlíða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, íslendingasögurnar, opnar íslenskar rafbækur, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðið, Skýrsla Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu. |
dc.language.iso | isl |
dc.language.iso | eng |
dc.publisher | Miðeind ehf |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://velthyding.is |
dc.subject | nmt |
dc.subject | machine translation |
dc.title | Long Context Translation Models for English-Icelandic translations (22.09) |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | https://velthyding.is |
contact.person | Haukur Páll Jónsson haukur@mideind.is Miðeind ehf |
sponsor | Ministry of Education, Science and Culture MT for Icelandic (V4a) Language Technology for Icelandic 2019-2023 nationalFunds |
files.size | 5312686693 |
files.count | 9 |
Files in this item
Download all files in item (4.95 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- data-bin.zip
- Size
- 4.51 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- f0f440281ad1dc15870f7e0afc7ee7dc
- data-bin
- dict.txt3 MB
- dict.en_XX.txt3 MB
- dict.is_IS.txt3 MB
- Name
- fairseq_user_dir.zip
- Size
- 39.14 KB
- Format
- application/zip
- Description
- Unknown
- MD5
- d4844faf2005a8e33fb6d5fad4514602
- fairseq_user_dir
- watch_sync_ada.sh403 B
- document_utils.py20 kB
- make_merged_sentence_testset.py2 kB
- document_dataset.py874 B
- fragment_noise.py5 kB
- sentencepiece_bpe_sampling.py1 kB
- check_parallel.py3 kB
- scratch_load.py4 kB
- document_translation_from_pretrained_bart.py10 kB
- noised_translation_from_pretrained_bart.py10 kB
- check_pos_dist.py604 B
- indexed_parallel_documents_dataset.py19 kB
- batch_sampler.py1022 B
- indexed_parallel_bt_documents_dataset.py8 kB
- noised_sequence.py147 B
- cached_mmap_jsonl_dataset.py2 kB
- word_noise.py5 kB
- __pycache__
- __init__.cpython-38.pyc288 B
- sentencepiece_bpe_sampling.cpython-38.pyc1 kB
- document_translation_from_pretrained_bart.cpython-38.pyc6 kB
- spm_segmentation_noise.py2 kB
- check_align.py5 kB
- check_domain.py1 kB
- encoders.py1 kB
- __init__.py143 B
- noiser.py87 B
- Name
- infer_en_is.sh
- Size
- 507 bytes
- Format
- Unknown
- Description
- Unknown
- MD5
- 5f321a0c495081daf93664186027a819
- Name
- infer_is_en.sh
- Size
- 507 bytes
- Format
- Unknown
- Description
- Unknown
- MD5
- 2b734e852986039dbd114b40e8d41b76
- Name
- sentence.bpe.model
- Size
- 4.83 MB
- Format
- Unknown
- Description
- Unknown
- MD5
- bf25eb5120ad92ef5c7d8596b5dc4046
- Name
- model_doc_enis.pt.zip
- Size
- 2.47 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 7abf80f7174cd7faf1f79880335a7654
- Name
- model_doc_isen.pt.zip
- Size
- 2.47 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 72f8fd43d602cb477ca3e16552462cd2
- Name
- requirements.txt
- Size
- 52 bytes
- Format
- Text file
- Description
- Unknown
- MD5
- 031a5b1ccf830f30f7964b592758b1cd