dc.contributor.author | Snæbjarnarson, Vésteinn |
dc.contributor.author | Símonarson, Haukur Barri |
dc.contributor.author | Þorsteinsson, Vilhjálmur |
dc.date.accessioned | 2020-09-30T11:21:37Z |
dc.date.available | 2020-09-30T11:21:37Z |
dc.date.issued | 2020-09-28 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/74 |
dc.description | A semi-synthetic parallel corpus, based on the ParIce corpus. Person names have been identified in both source and target sentences in each pair. They are then replaced with other names of the same gender and having the same declension, sourced from news articles. By mixing this data with other parallel corpora during training, a network can see many more names than it would otherwise, helping it learn the correct translation of Icelandic person names. Samhliða hálfgervimálheild unnin upp úr ParIce málheildinni. Mannanöfn eru merkt sérstaklega, beggja vegna í hverju setningapari. Þeim er svo skipt út fyrir önnur nöfn af sama kyni og í sama falli. Nöfnin eru fengin af íslenskum fréttavefjum. Með því að blanda þessari málheild saman við aðrar í þjálfun má láta þýðingarlíkan sjá mun fleiri mannanöfn en annars, og hjálpa því þannig að læra meðferð íslenskra mannanafna. |
dc.language.iso | isl |
dc.language.iso | eng |
dc.publisher | Miðeind ehf. |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/mideind/GreynirSeq/tree/develop/src/greynirseq/ner |
dc.subject | machine translation |
dc.subject | back translation |
dc.subject | neural machine translation |
dc.subject | named entity recognition |
dc.title | En-Is Semi-Synthetic Parallel Name Robustness Corpus |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Vilhjálmur Þorsteinsson clarin@mideind.is Miðeind ehf. |
sponsor | Ministry of Education, Science and Culture Text processing (pre-and postprocessing) (V3b) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 38416 sentences |
files.size | 9460974 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- parice-name-corpus.post-substitution.v2.tsv
- Size
- 9.02 MB
- Format
- Unknown
- Description
- Semi synthetic corpus with substituted names
- MD5
- 681b64a3efd7ca28fa76b0b1bd28103c