Show simple item record

 
dc.contributor.author Jasonarson, Atli
dc.contributor.author Steingrímsson, Steinþór
dc.contributor.author Ingimundarson, Finnur Ágúst
dc.contributor.author Magnússon, Árni Davíð
dc.date.accessioned 2023-04-14T15:21:50Z
dc.date.available 2023-04-14T15:21:50Z
dc.date.issued 2023-04-14
dc.identifier.uri http://hdl.handle.net/20.500.12537/309
dc.description ENGLISH During the project L11 - Error models for OCR of The Language Technology Programme 2019-2023, various OCR post-processing models were trained. This is the best performing one. On texts from the 19th century to the early 20th century, it reduces word error rate from 6.49% to 3.08%, and character error rate from 1.39% to 0.73%. On modern texts, it reduces word error rate from 5.52% to 3.60% and character error rate from 1.17% to 1.0%. More info, such as how to use the model for inference, in README. ICELANDIC Í verkefninu L11 - Error models for OCR í Máltækniáætlun 2019-2023 voru nokkur ljóslestrarvilluleiðréttingarlíkön þjálfuð. Þetta er best þeirra. Líkanið lækkar hlutfall orðavillna (e. word error rate) úr 6,49% í 3,08% í textum frá 19. öld og fyrri hluta 20. aldar og hlutfall stafvillna úr 1,39% í 0,73%. Í nútímamálstextum lækkar það hlutfall orðavillna úr 5,52% í 3,60% og hlutfall stafvillna úr 1,17% í 1,0%. Nánari upplýsingar, svo sem hvernig má nota líkanið, er að finna í meðfylgjandi README-skjali.
dc.language.iso isl
dc.publisher The Árni Magnússon Institue for Icelandic Studies
dc.rights Apache License 2.0
dc.rights.uri https://opensource.org/license/apache2-0-php/
dc.rights.label PUB
dc.subject ocr
dc.title OCR Post-Processing Transformer Model 23.04
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institue for Icelandic Studies
sponsor Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) L11 - Error models for OCR Language Technology for Icelandic 2019-2023 nationalFunds
files.size 412424484
files.count 2


 Files in this item

 Download all files in item (393.32 MB)
This item is
Publicly Available
and licensed under:
Apache License 2.0
Icon
Name
README.md
Size
1.8 KB
Format
Unknown
Description
readme
MD5
e1190edf8304d3961742cabbfa7bcbab
 Download file
Icon
Name
ocr-p-p.zip
Size
393.32 MB
Format
application/zip
Description
project
MD5
aa3969f47265b9180c650796ed95f2fe
 Download file  Preview
 File Preview  
  • frsq
    • data
      • data-bin.3000
        • test.original-corrected.original.bin18 kB
        • test.original-corrected.corrected.idx9 kB
        • valid.original-corrected.corrected.bin2 MB
        • dict.original.txt34 kB
        • test.original-corrected.corrected.bin18 kB
        • train.original-corrected.corrected.idx10 MB
        • valid.original-corrected.original.idx653 kB
        • dict.corrected.txt34 kB
        • train.original-corrected.original.idx10 MB
        • train.original-corrected.corrected.bin29 MB
        • valid.original-corrected.original.bin2 MB
        • train.original-corrected.original.bin30 MB
        • test.original-corrected.original.idx9 kB
        • valid.original-corrected.corrected.idx653 kB
        • preprocess.log2 kB
      • sentencepiece
        • data
          • sentencepiece_3000.bpe.model274 kB
          • sentencepiece_3000.bpe.vocab35 kB
    • models
      • checkpoint_best.pt450 MB
    • infer.py719 B
    • test.txt512 B
    • requirements.txt30 B

Show simple item record