Show simple item record

 
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Óladóttir, Hulda
dc.contributor.author Þórðarson, Sveinbjörn
dc.contributor.author Ragnarsson, Pétur Orri
dc.date.accessioned 2021-09-28T17:19:15Z
dc.date.available 2021-09-28T17:19:15Z
dc.date.issued 2021-09-28
dc.identifier.uri http://hdl.handle.net/20.500.12537/136
dc.description Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso isl
dc.publisher Miðeind ehf.
dc.relation.replaces http://hdl.handle.net/20.500.12537/65
dc.relation.isreplacedby http://hdl.handle.net/20.500.12537/178
dc.rights The MIT License (MIT)
dc.rights.uri https://opensource.org/licenses/mit-license.php
dc.rights.label PUB
dc.source.uri https://github.com/mideind/Tokenizer/releases/tag/3.3.2
dc.subject tokenization
dc.subject sentence detection
dc.subject token detection
dc.title Tokenizer for Icelandic text (3.3.2)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 539198
files.count 2


 Files in this item

 Download all files in item (526.56 KB)
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Icon
Name
Tokenizer-3.3.2.tar.gz
Size
254.99 KB
Format
application/gzip
Description
Unknown
MD5
ceca503cfa798eb43225163fc56aad76
 Download file  Preview
 File Preview  
  • Tokenizer-3.3.2
    • src
      • tokenizer
        • abbrev.py13 kB
        • main.py9 kB
        • definitions.py26 kB
        • tokenizer.py120 kB
        • __init__.py2 kB
        • version.py22 B
        • py.typed0 B
        • Abbrev.conf46 kB
    • setup.py3 kB
    • .gitignore1 kB
    • README.rst39 kB
    • setup.cfg64 B
    • test
      • toktest_large.txt559 kB
      • test_tokenizer_tok.py12 kB
      • toktest_normal.txt12 kB
      • test_index_calculation.py20 kB
      • toktest_normal_gold_expected.txt13 kB
      • toktest_edgecases.txt6 kB
      • toktest_edgecases_gold_expected.txt6 kB
      • test_detokenize.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_perfect.txt576 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_sentences.txt21 kB
      • example.txt3 kB
      • test_tokenizer.py97 kB
      • toktest_edgecases_diff.txt751 B
    • .github
    • release.sh251 B
    • LICENSE1 kB
    • MANIFEST.in74 B
    • pax_global_header52 B
Icon
Name
Tokenizer-3.3.2.zip
Size
271.57 KB
Format
application/zip
Description
Unknown
MD5
b3a2d787e407dd8e4cf3fb58dc0746fc
 Download file  Preview
 File Preview  
  • Tokenizer-3.3.2
    • src
      • tokenizer
        • abbrev.py13 kB
        • main.py9 kB
        • definitions.py26 kB
        • tokenizer.py120 kB
        • __init__.py2 kB
        • version.py22 B
        • py.typed0 B
        • Abbrev.conf46 kB
    • setup.py3 kB
    • .gitignore1 kB
    • README.rst39 kB
    • setup.cfg64 B
    • test
      • toktest_large.txt559 kB
      • test_tokenizer_tok.py12 kB
      • toktest_normal.txt12 kB
      • test_index_calculation.py20 kB
      • toktest_normal_gold_expected.txt13 kB
      • toktest_edgecases.txt6 kB
      • toktest_edgecases_gold_expected.txt6 kB
      • test_detokenize.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_perfect.txt576 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_sentences.txt21 kB
      • example.txt3 kB
      • test_tokenizer.py97 kB
      • toktest_edgecases_diff.txt751 B
    • .github
    • release.sh251 B
    • LICENSE1 kB
    • MANIFEST.in74 B

Show simple item record