Show simple item record

 
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Snæbjarnarson, Vésteinn
dc.date.accessioned 2022-09-29T11:21:54Z
dc.date.available 2022-09-29T11:21:54Z
dc.date.issued 2022-09-23
dc.identifier.uri http://hdl.handle.net/20.500.12537/278
dc.description ENGLISH: These models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhlíða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, íslendingasögurnar, opnar íslenskar rafbækur, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðið, Skýrsla Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.
dc.language.iso isl
dc.language.iso eng
dc.publisher Miðeind ehf
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://velthyding.is
dc.subject nmt
dc.subject machine translation
dc.title Long Context Translation Models for English-Icelandic translations (22.09)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
demo.uri https://velthyding.is
contact.person Haukur Páll Jónsson haukur@mideind.is Miðeind ehf
sponsor Ministry of Education, Science and Culture MT for Icelandic (V4a) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 5312686693
files.count 9


 Files in this item

 Download all files in item (4.95 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
data-bin.zip
Size
4.51 MB
Format
application/zip
Description
Unknown
MD5
f0f440281ad1dc15870f7e0afc7ee7dc
 Download file  Preview
 File Preview  
  • data-bin
    • dict.txt3 MB
    • dict.en_XX.txt3 MB
    • dict.is_IS.txt3 MB
Icon
Name
fairseq_user_dir.zip
Size
39.14 KB
Format
application/zip
Description
Unknown
MD5
d4844faf2005a8e33fb6d5fad4514602
 Download file  Preview
 File Preview  
  • fairseq_user_dir
    • watch_sync_ada.sh403 B
    • document_utils.py20 kB
    • make_merged_sentence_testset.py2 kB
    • document_dataset.py874 B
    • fragment_noise.py5 kB
    • sentencepiece_bpe_sampling.py1 kB
    • check_parallel.py3 kB
    • scratch_load.py4 kB
    • document_translation_from_pretrained_bart.py10 kB
    • noised_translation_from_pretrained_bart.py10 kB
    • check_pos_dist.py604 B
    • indexed_parallel_documents_dataset.py19 kB
    • batch_sampler.py1022 B
    • indexed_parallel_bt_documents_dataset.py8 kB
    • noised_sequence.py147 B
    • cached_mmap_jsonl_dataset.py2 kB
    • word_noise.py5 kB
    • __pycache__
      • __init__.cpython-38.pyc288 B
      • sentencepiece_bpe_sampling.cpython-38.pyc1 kB
      • document_translation_from_pretrained_bart.cpython-38.pyc6 kB
    • spm_segmentation_noise.py2 kB
    • check_align.py5 kB
    • check_domain.py1 kB
    • encoders.py1 kB
    • __init__.py143 B
    • noiser.py87 B
Icon
Name
infer_en_is.sh
Size
507 bytes
Format
Unknown
Description
Unknown
MD5
5f321a0c495081daf93664186027a819
 Download file
Icon
Name
infer_is_en.sh
Size
507 bytes
Format
Unknown
Description
Unknown
MD5
2b734e852986039dbd114b40e8d41b76
 Download file
Icon
Name
sentence.bpe.model
Size
4.83 MB
Format
Unknown
Description
Unknown
MD5
bf25eb5120ad92ef5c7d8596b5dc4046
 Download file
Icon
Name
model_doc_enis.pt.zip
Size
2.47 GB
Format
application/zip
Description
Unknown
MD5
7abf80f7174cd7faf1f79880335a7654
 Download file  Preview
 File Preview  
    • model_doc_enis.pt4 GB
Icon
Name
model_doc_isen.pt.zip
Size
2.47 GB
Format
application/zip
Description
Unknown
MD5
72f8fd43d602cb477ca3e16552462cd2
 Download file  Preview
 File Preview  
    • model_doc_isen.pt4 GB
Icon
Name
requirements.txt
Size
52 bytes
Format
Text file
Description
Unknown
MD5
031a5b1ccf830f30f7964b592758b1cd
 Download file  Preview
 File Preview  
fairseq==0.10.2
sentencepiece==0.1.97
wheel==0.37.1 . . .
                                            
Icon
Name
README
Size
2.41 KB
Format
Unknown
MD5
d78fb30f81e9d0562ebbad9bf19eca21
 Download file

Show simple item record