OCR Post-Processing Transformer Model 23.04

Jasonarson, Atli; Steingrímsson, Steinþór; Ingimundarson, Finnur Ágúst; Magnússon, Árni Davíð

dc.contributor.author	Jasonarson, Atli
dc.contributor.author	Steingrímsson, Steinþór
dc.contributor.author	Ingimundarson, Finnur Ágúst
dc.contributor.author	Magnússon, Árni Davíð
dc.date.accessioned	2023-04-14T15:21:50Z
dc.date.available	2023-04-14T15:21:50Z
dc.date.issued	2023-04-14
dc.identifier.uri	http://hdl.handle.net/20.500.12537/309
dc.description	ENGLISH During the project L11 - Error models for OCR of The Language Technology Programme 2019-2023, various OCR post-processing models were trained. This is the best performing one. On texts from the 19th century to the early 20th century, it reduces word error rate from 6.49% to 3.08%, and character error rate from 1.39% to 0.73%. On modern texts, it reduces word error rate from 5.52% to 3.60% and character error rate from 1.17% to 1.0%. More info, such as how to use the model for inference, in README. ICELANDIC Í verkefninu L11 - Error models for OCR í Máltækniáætlun 2019-2023 voru nokkur ljóslestrarvilluleiðréttingarlíkön þjálfuð. Þetta er best þeirra. Líkanið lækkar hlutfall orðavillna (e. word error rate) úr 6,49% í 3,08% í textum frá 19. öld og fyrri hluta 20. aldar og hlutfall stafvillna úr 1,39% í 0,73%. Í nútímamálstextum lækkar það hlutfall orðavillna úr 5,52% í 3,60% og hlutfall stafvillna úr 1,17% í 1,0%. Nánari upplýsingar, svo sem hvernig má nota líkanið, er að finna í meðfylgjandi README-skjali.
dc.language.iso	isl
dc.publisher	The Árni Magnússon Institue for Icelandic Studies
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/license/apache2-0-php/
dc.rights.label	PUB
dc.subject	ocr
dc.title	OCR Post-Processing Transformer Model 23.04
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	Clarin IS Repository
contact.person	Steinþór Steingrímsson steinthor.steingrimsson@arnastofnun.is The Árni Magnússon Institue for Icelandic Studies
sponsor	Ministry of Education, Science and Culture (Mennta- og menningamálaráðuneytið) L11 - Error models for OCR Language Technology for Icelandic 2019-2023 nationalFunds
files.size	412424484
files.count	2

Files in this item

Download all files in item (393.32 MB)

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: README.md
Size: 1.8 KB
Format: Unknown
Description: readme
MD5: e1190edf8304d3961742cabbfa7bcbab

Download file

Name: ocr-p-p.zip
Size: 393.32 MB
Format: application/zip
Description: project
MD5: aa3969f47265b9180c650796ed95f2fe

Download file Preview

File Preview

frsq
- data
  - data-bin.3000
    - test.original-corrected.original.bin18 kB
    - test.original-corrected.corrected.idx9 kB
    - valid.original-corrected.corrected.bin2 MB
    - dict.original.txt34 kB
    - test.original-corrected.corrected.bin18 kB
    - train.original-corrected.corrected.idx10 MB
    - valid.original-corrected.original.idx653 kB
    - dict.corrected.txt34 kB
    - train.original-corrected.original.idx10 MB
    - train.original-corrected.corrected.bin29 MB
    - valid.original-corrected.original.bin2 MB
    - train.original-corrected.original.bin30 MB
    - test.original-corrected.original.idx9 kB
    - valid.original-corrected.corrected.idx653 kB
    - preprocess.log2 kB
  - sentencepiece
    - data
      - sentencepiece_3000.bpe.model274 kB
      - sentencepiece_3000.bpe.vocab35 kB
- models
  - checkpoint_best.pt450 MB
- infer.py719 B
- test.txt512 B
- requirements.txt30 B

Sýna einfalda færslu atriðis

Files in this item

Samstarfsaðilar, stjórn og fjármögnun

Gagnasafn

Meira