Show simple item record

 
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.date.accessioned 2022-09-30T09:49:33Z
dc.date.available 2022-09-30T09:49:33Z
dc.date.issued 2022-09-27
dc.identifier.uri http://hdl.handle.net/20.500.12537/283
dc.description ENGLISH: These models are optimized versions of the translation models released in http://hdl.handle.net/20.500.12537/278. Instead of the 24 layers used in the full model, they have been shrunk down to 7 layers. The computational resources required to run inference on the models is thus significantly less than using the original models. Performance is comparable to the original models when evaluated on general topics such as news, but for expert knowledge from the training data (e.g. EEA regulations) the original models are more capable. The models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön eru smækkaðar útgáfur af líkönunum sem má finna á http://hdl.handle.net/20.500.12537/278 . Upphaflegu líkönin eru með 24 lög en þessar útgáfur eru með 7 lög og eru skilvirkari í keyrslu. Frammistaða líkananna er á pari við þau upphaflegu fyrir almennan texta, svo sem í fréttum. Á sérhæfðari texta sem er að finna í þjálfunargögnunum standa þau sig verr, t.d. á evrópureglugerðum. Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhliða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, Íslendingasögunum, opnum íslenskum rafbókum, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðinu, Skýrslu Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.
dc.language.iso isl
dc.language.iso eng
dc.publisher Miðeind ehf
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://velthyding.is
dc.subject nmt
dc.subject machine translation
dc.subject translation
dc.title Optimized Long Context Translation Models for English-Icelandic translations (22.09)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
demo.uri https://velthyding.is
contact.person Haukur Páll Jónsson haukurpj@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture Back-translation data selection and filtering (V2b) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 4093860858
files.count 2


 Files in this item

 Download all files in item (3.81 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Icon
Name
doc_distil_is_en.zip
Size
1.91 GB
Format
application/zip
Description
Unknown
MD5
56fc4841f6e314ee426a77972dfe6b6e
 Download file  Preview
 File Preview  
  • doc_distil_is_en
    • fairseq_user_dir
      • watch_sync_ada.sh403 B
      • document_utils.py20 kB
      • make_merged_sentence_testset.py2 kB
      • document_dataset.py874 B
      • fragment_noise.py5 kB
      • sentencepiece_bpe_sampling.py1 kB
      • check_parallel.py3 kB
      • scratch_load.py4 kB
      • document_translation_from_pretrained_bart.py10 kB
      • noised_translation_from_pretrained_bart.py10 kB
      • check_pos_dist.py604 B
      • indexed_parallel_documents_dataset.py19 kB
      • batch_sampler.py1022 B
      • indexed_parallel_bt_documents_dataset.py8 kB
      • noised_sequence.py147 B
      • cached_mmap_jsonl_dataset.py2 kB
      • word_noise.py5 kB
      • __pycache__
        • __init__.cpython-38.pyc288 B
        • sentencepiece_bpe_sampling.cpython-38.pyc1 kB
        • document_translation_from_pretrained_bart.cpython-38.pyc6 kB
      • spm_segmentation_noise.py2 kB
      • check_align.py5 kB
      • check_domain.py1 kB
      • encoders.py1 kB
      • __init__.py143 B
      • noiser.py87 B
    • README.md2 kB
    • dict.en_XX.txt3 MB
    • fairseq_model.pt3 GB
    • requirements.txt52 B
    • dict.is_IS.txt3 MB
    • interactive.sh477 B
    • sentencepiece.bpe.model4 MB
Icon
Name
doc_distil_en_is.zip
Size
1.9 GB
Format
application/zip
Description
Unknown
MD5
f7c1fab1632b2371451bb11edc958d72
 Download file  Preview
 File Preview  
  • doc_distil_en_is
    • fairseq_user_dir
      • watch_sync_ada.sh403 B
      • document_utils.py20 kB
      • make_merged_sentence_testset.py2 kB
      • document_dataset.py874 B
      • fragment_noise.py5 kB
      • sentencepiece_bpe_sampling.py1 kB
      • check_parallel.py3 kB
      • scratch_load.py4 kB
      • document_translation_from_pretrained_bart.py10 kB
      • noised_translation_from_pretrained_bart.py10 kB
      • check_pos_dist.py604 B
      • indexed_parallel_documents_dataset.py19 kB
      • batch_sampler.py1022 B
      • indexed_parallel_bt_documents_dataset.py8 kB
      • noised_sequence.py147 B
      • cached_mmap_jsonl_dataset.py2 kB
      • word_noise.py5 kB
      • __pycache__
        • __init__.cpython-38.pyc292 B
        • sentencepiece_bpe_sampling.cpython-38.pyc1 kB
        • document_translation_from_pretrained_bart.cpython-38.pyc6 kB
      • spm_segmentation_noise.py2 kB
      • check_domain.py1 kB
      • check_align.py5 kB
      • encoders.py1 kB
      • __init__.py143 B
      • noiser.py87 B
    • README.md2 kB
    • fairseq_model.pt3 GB
    • dict.en_XX.txt3 MB
    • requirements.txt52 B
    • interactive.sh477 B
    • dict.is_IS.txt3 MB
    • sentencepiece.bpe.model4 MB

Show simple item record