Show simple item record

 
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Óladóttir, Hulda
dc.contributor.author Þórðarson, Sveinbjörn
dc.date.accessioned 2020-09-28T14:55:24Z
dc.date.available 2020-09-28T14:55:24Z
dc.date.issued 2020-09-25
dc.identifier.uri http://hdl.handle.net/20.500.12537/65
dc.description Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso isl
dc.publisher Miðeind ehf.
dc.relation.replaces http://hdl.handle.net/20.500.12537/11
dc.relation.isreplacedby http://hdl.handle.net/20.500.12537/136
dc.rights The MIT License (MIT)
dc.rights.uri https://opensource.org/licenses/mit-license.php
dc.rights.label PUB
dc.source.uri https://github.com/mideind/Tokenizer/releases/tag/2.3.1
dc.subject tokenization
dc.subject sentence detection
dc.subject token detection
dc.title Tokenizer for Icelandic text (2.3.1)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 493897
files.count 2


 Files in this item

 Download all files in item (482.32 KB)
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Icon
Name
Tokenizer-2.3.1.zip
Size
247.28 KB
Format
application/zip
Description
Tokenizer - zip
MD5
eb66ff0e05901ee831c875cc83a854d9
 Download file  Preview
 File Preview  
  • Tokenizer-2.3.1
    • src
      • tokenizer
        • abbrev.py11 kB
        • main.py6 kB
        • definitions.py26 kB
        • tokenizer.pyi7 kB
        • tokenizer.py90 kB
        • __init__.py1 kB
        • py.typed0 B
        • Abbrev.conf45 kB
    • setup.py3 kB
    • .gitignore1 kB
    • README.rst35 kB
    • setup.cfg64 B
    • .travis.yml230 B
    • test
      • toktest_large.txt559 kB
      • toktest_normal.txt12 kB
      • toktest_normal_gold_expected.txt13 kB
      • test_detokenize.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_perfect.txt576 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_sentences.txt21 kB
      • example.txt3 kB
      • test_tokenizer.py53 kB
    • release.sh246 B
    • LICENSE1 kB
    • MANIFEST.in74 B
Icon
Name
Tokenizer-2.3.1.tar.gz
Size
235.04 KB
Format
application/gzip
Description
Tokenizer - tar.gz
MD5
296fca3864470b8b07e7047669b17968
 Download file  Preview
 File Preview  
  • Tokenizer-2.3.1
    • src
      • tokenizer
        • abbrev.py11 kB
        • main.py6 kB
        • definitions.py26 kB
        • tokenizer.pyi7 kB
        • tokenizer.py90 kB
        • __init__.py1 kB
        • py.typed0 B
        • Abbrev.conf45 kB
    • setup.py3 kB
    • .gitignore1 kB
    • README.rst35 kB
    • setup.cfg64 B
    • .travis.yml230 B
    • test
      • toktest_large.txt559 kB
      • toktest_normal.txt12 kB
      • toktest_normal_gold_expected.txt13 kB
      • test_detokenize.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_perfect.txt576 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_sentences.txt21 kB
      • example.txt3 kB
      • test_tokenizer.py53 kB
    • release.sh246 B
    • LICENSE1 kB
    • MANIFEST.in74 B
    • pax_global_header52 B

Show simple item record