Show simple item record

 
dc.contributor.author Snæbjarnarson, Vésteinn
dc.contributor.author Símonarson, Haukur Barri
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.date.accessioned 2020-09-30T11:21:37Z
dc.date.available 2020-09-30T11:21:37Z
dc.date.issued 2020-09-28
dc.identifier.uri http://hdl.handle.net/20.500.12537/74
dc.description A semi-synthetic parallel corpus, based on the ParIce corpus. Person names have been identified in both source and target sentences in each pair. They are then replaced with other names of the same gender and having the same declension, sourced from news articles. By mixing this data with other parallel corpora during training, a network can see many more names than it would otherwise, helping it learn the correct translation of Icelandic person names. Samhliða hálfgervimálheild unnin upp úr ParIce málheildinni. Mannanöfn eru merkt sérstaklega, beggja vegna í hverju setningapari. Þeim er svo skipt út fyrir önnur nöfn af sama kyni og í sama falli. Nöfnin eru fengin af íslenskum fréttavefjum. Með því að blanda þessari málheild saman við aðrar í þjálfun má láta þýðingarlíkan sjá mun fleiri mannanöfn en annars, og hjálpa því þannig að læra meðferð íslenskra mannanafna.
dc.language.iso isl
dc.language.iso eng
dc.publisher Miðeind ehf.
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/mideind/GreynirSeq/tree/develop/src/greynirseq/ner
dc.subject machine translation
dc.subject back translation
dc.subject neural machine translation
dc.subject named entity recognition
dc.title En-Is Semi-Synthetic Parallel Name Robustness Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Vilhjálmur Þorsteinsson clarin@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Text processing (pre-and postprocessing) (V3b) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 38416 sentences
files.size 9460974
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
parice-name-corpus.post-substitution.v2.tsv
Size
9.02 MB
Format
Unknown
Description
Semi synthetic corpus with substituted names
MD5
681b64a3efd7ca28fa76b0b1bd28103c
 Download file

Show simple item record