dc.contributor.author |
Jasonarson, Atli |
dc.contributor.author |
Steingrímsson, Steinþór |
dc.contributor.author |
Ingimundarson, Finnur Ágúst |
dc.contributor.author |
Magnússon, Árni Davíð |
dc.date.accessioned |
2023-04-14T15:21:50Z |
dc.date.available |
2023-04-14T15:21:50Z |
dc.date.issued |
2023-04-14 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/309 |
dc.description |
ENGLISH
During the project L11 - Error models for OCR of The Language Technology Programme 2019-2023, various OCR post-processing models were trained. This is the best performing one. On texts from the 19th century to the early 20th century, it reduces word error rate from 6.49% to 3.08%, and character error rate from 1.39% to 0.73%. On modern texts, it reduces word error rate from 5.52% to 3.60% and character error rate from 1.17% to 1.0%.
More info, such as how to use the model for inference, in README.
ICELANDIC
Í verkefninu L11 - Error models for OCR í Máltækniáætlun 2019-2023 voru nokkur ljóslestrarvilluleiðréttingarlíkön þjálfuð. Þetta er best þeirra. Líkanið lækkar hlutfall orðavillna (e. word error rate) úr 6,49% í 3,08% í textum frá 19. öld og fyrri hluta 20. aldar og hlutfall stafvillna úr 1,39% í 0,73%. Í nútímamálstextum lækkar það hlutfall orðavillna úr 5,52% í 3,60% og hlutfall stafvillna úr 1,17% í 1,0%.
Nánari upplýsingar, svo sem hvernig má nota líkanið, er að finna í meðfylgjandi README-skjali. |
dc.language.iso |
isl |
dc.publisher |
The Árni Magnússon Institue for Icelandic Studies |
dc.rights |
Apache License 2.0 |
dc.rights.uri |
https://opensource.org/license/apache2-0-php/ |
dc.rights.label |
PUB |
dc.subject |
ocr |
dc.title |
OCR Post-Processing Transformer Model 23.04 |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
has.files |
yes |
branding |
Clarin IS Repository |
contact.person |
Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institue for Icelandic Studies |
sponsor |
Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) L11 - Error models for OCR Language Technology for Icelandic 2019-2023 nationalFunds |
files.size |
412424484 |
files.count |
2 |