Sýna einfalda færslu atriðis
dc.contributor.author
Þorsteinsson, Vilhjálmur
dc.contributor.author
Óladóttir, Hulda
dc.contributor.author
Þórðarson, Sveinbjörn
dc.contributor.author
Ragnarsson, Pétur Orri
dc.contributor.author
Jónsson, Haukur Páll
dc.contributor.author
Eyjólfsson, Logi
dc.date.accessioned
2025-10-10T13:03:02Z
dc.date.available
2025-10-10T13:03:02Z
dc.date.issued
2024-08-15
dc.identifier.uri
http://hdl.handle.net/20.500.12537/371
dc.description
Tokenizer is a compact pure-Python 3 executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer
Tokenizer er pakki fyrir Python 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso
isl
dc.publisher
Miðeind ehf.
dc.relation.replaces
http://hdl.handle.net/20.500.12537/367
dc.rights
The MIT License (MIT)
dc.rights.uri
https://opensource.org/licenses/mit-license.php
dc.rights.label
PUB
dc.source.uri
https://github.com/mideind/Tokenizer/releases/tag/v3.5.3
dc.subject
tokenization
dc.subject
sentence detection
dc.subject
token detection
dc.title
Tokenizer for Icelandic text (3.5.3) (2025-10-07)
dc.type
toolService
metashare.ResourceInfo#ContentInfo.detailedType
tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent
true
has.files
yes
branding
Clarin IS Repository
contact.person
Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor
Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size
562797
files.count
2
Files in this item
Download all files in item (549.61
KB)
×
Large Size
The requested files are being packed into one large file. This process can take some time, please be patient.
Continue
Cancel
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Name
Tokenizer-3.5.3.tar.gz
Size
265.82
KB
Format
application/gzip
Description
Unknown
MD5
502d504f47732f7089b736f817978496
Download file
Preview
Tokenizer-3.5.3 src tokenizer abbrev.py 13 kB main.py 9 kB definitions.py 28 kB tokenizer.py 128 kB __init__.py 2 kB Abbrev.conf 46 kB README.md 34 kB .gitignore 1 kB pyproject.toml 2 kB CLAUDE.md 3 kB test toktest_edgecases.txt 6 kB toktest_large_gold_perfect.txt 576 kB test_composite_glyphs.py 8 kB test_helper_functions.py 2 kB Overview.txt 34 kB toktest_large_gold_acceptable.txt 583 kB toktest_edgecases_gold_expected.txt 6 kB test_index_calculation.py 22 kB toktest_edgecases_diff.txt 751 B test_cli.py 7 kB toktest_sentences.txt 21 kB test_abbrev.py 984 B toktest_large.txt 559 kB test_tokenizer.py 103 kB toktest_normal.txt 12 kB toktest_normal_gold_expected.txt 13 kB example.txt 3 kB test_detokenize.py 2 kB test_tokenizer_tok.py 18 kB .github LICENSE.txt 1 kB perf.py 977 B MANIFEST.in 104 B
Name
Tokenizer-3.5.3.zip
Size
283.79
KB
Format
application/zip
Description
Unknown
MD5
1199c654b1a8e4fe437c9a2e03f48577
Download file
Preview
Tokenizer-3.5.3 src tokenizer abbrev.py 13 kB main.py 9 kB definitions.py 28 kB tokenizer.py 128 kB __init__.py 2 kB Abbrev.conf 46 kB README.md 34 kB .gitignore 1 kB pyproject.toml 2 kB CLAUDE.md 3 kB test toktest_edgecases.txt 6 kB toktest_large_gold_perfect.txt 576 kB test_composite_glyphs.py 8 kB test_helper_functions.py 2 kB Overview.txt 34 kB toktest_large_gold_acceptable.txt 583 kB toktest_edgecases_gold_expected.txt 6 kB test_index_calculation.py 22 kB toktest_edgecases_diff.txt 751 B test_cli.py 7 kB toktest_sentences.txt 21 kB test_abbrev.py 984 B toktest_large.txt 559 kB test_tokenizer.py 103 kB toktest_normal.txt 12 kB toktest_normal_gold_expected.txt 13 kB example.txt 3 kB test_detokenize.py 2 kB test_tokenizer_tok.py 18 kB .github LICENSE.txt 1 kB perf.py 977 B MANIFEST.in 104 B
Sýna einfalda færslu atriðis