Sýna einfalda færslu atriðis

 
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Óladóttir, Hulda
dc.contributor.author Þórðarson, Sveinbjörn
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Eyjólfsson, Logi
dc.date.accessioned 2025-10-10T13:03:02Z
dc.date.available 2025-10-10T13:03:02Z
dc.date.issued 2024-08-15
dc.identifier.uri http://hdl.handle.net/20.500.12537/371
dc.description Tokenizer is a compact pure-Python 3 executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer Tokenizer er pakki fyrir Python 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso isl
dc.publisher Miðeind ehf.
dc.relation.replaces http://hdl.handle.net/20.500.12537/367
dc.rights The MIT License (MIT)
dc.rights.uri https://opensource.org/licenses/mit-license.php
dc.rights.label PUB
dc.source.uri https://github.com/mideind/Tokenizer/releases/tag/v3.5.3
dc.subject tokenization
dc.subject sentence detection
dc.subject token detection
dc.title Tokenizer for Icelandic text (3.5.3) (2025-10-07)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 562797
files.count 2


 Files in this item

 Download all files in item (549.61 KB)
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Icon
Name
Tokenizer-3.5.3.tar.gz
Size
265.82 KB
Format
application/gzip
Description
Unknown
MD5
502d504f47732f7089b736f817978496
 Download file  Preview
 File Preview  
  • Tokenizer-3.5.3
    • src
      • tokenizer
        • abbrev.py13 kB
        • main.py9 kB
        • definitions.py28 kB
        • tokenizer.py128 kB
        • __init__.py2 kB
        • Abbrev.conf46 kB
    • README.md34 kB
    • .gitignore1 kB
    • pyproject.toml2 kB
    • CLAUDE.md3 kB
    • test
      • toktest_edgecases.txt6 kB
      • toktest_large_gold_perfect.txt576 kB
      • test_composite_glyphs.py8 kB
      • test_helper_functions.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_edgecases_gold_expected.txt6 kB
      • test_index_calculation.py22 kB
      • toktest_edgecases_diff.txt751 B
      • test_cli.py7 kB
      • toktest_sentences.txt21 kB
      • test_abbrev.py984 B
      • toktest_large.txt559 kB
      • test_tokenizer.py103 kB
      • toktest_normal.txt12 kB
      • toktest_normal_gold_expected.txt13 kB
      • example.txt3 kB
      • test_detokenize.py2 kB
      • test_tokenizer_tok.py18 kB
    • .github
    • LICENSE.txt1 kB
    • perf.py977 B
    • MANIFEST.in104 B
    • pax_global_header52 B
Icon
Name
Tokenizer-3.5.3.zip
Size
283.79 KB
Format
application/zip
Description
Unknown
MD5
1199c654b1a8e4fe437c9a2e03f48577
 Download file  Preview
 File Preview  
  • Tokenizer-3.5.3
    • src
      • tokenizer
        • abbrev.py13 kB
        • main.py9 kB
        • definitions.py28 kB
        • tokenizer.py128 kB
        • __init__.py2 kB
        • Abbrev.conf46 kB
    • README.md34 kB
    • .gitignore1 kB
    • pyproject.toml2 kB
    • CLAUDE.md3 kB
    • test
      • toktest_edgecases.txt6 kB
      • toktest_large_gold_perfect.txt576 kB
      • test_composite_glyphs.py8 kB
      • test_helper_functions.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_edgecases_gold_expected.txt6 kB
      • test_index_calculation.py22 kB
      • toktest_edgecases_diff.txt751 B
      • test_cli.py7 kB
      • toktest_sentences.txt21 kB
      • test_abbrev.py984 B
      • toktest_large.txt559 kB
      • test_tokenizer.py103 kB
      • toktest_normal.txt12 kB
      • toktest_normal_gold_expected.txt13 kB
      • example.txt3 kB
      • test_detokenize.py2 kB
      • test_tokenizer_tok.py18 kB
    • .github
    • LICENSE.txt1 kB
    • perf.py977 B
    • MANIFEST.in104 B

Sýna einfalda færslu atriðis