Sýna einfalda færslu atriðis

 
dc.contributor.author Þorsteinsson, Vilhjálmur
dc.contributor.author Óladóttir, Hulda
dc.contributor.author Þórðarson, Sveinbjörn
dc.contributor.author Ragnarsson, Pétur Orri
dc.contributor.author Jónsson, Haukur Páll
dc.contributor.author Eyjólfsson, Logi
dc.date.accessioned 2022-09-26T13:14:54Z
dc.date.available 2022-09-26T13:14:54Z
dc.date.issued 2022-09-23
dc.identifier.uri http://hdl.handle.net/20.500.12537/262
dc.description Tokenizer is a compact pure-Python 3 executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer Tokenizer er pakki fyrir Python 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso isl
dc.publisher Miðeind ehf.
dc.relation.replaces http://hdl.handle.net/20.500.12537/219
dc.rights The MIT License (MIT)
dc.rights.uri https://opensource.org/licenses/mit-license.php
dc.rights.label PUB
dc.source.uri https://github.com/mideind/Tokenizer/releases/tag/3.4.2
dc.subject tokenization
dc.subject sentence detection
dc.subject token detection
dc.title Tokenizer for Icelandic text (3.4.2) (22.10)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 281084
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Icon
Name
Tokenizer-3.4.2.zip
Size
274.5 KB
Format
application/zip
Description
Unknown
MD5
ec6d525b50278b069275a312b21dc7b5
 Download file  Preview
 File Preview  
  • Tokenizer-3.4.2
    • src
      • tokenizer
        • abbrev.py13 kB
        • main.py9 kB
        • definitions.py27 kB
        • tokenizer.py125 kB
        • __init__.py2 kB
        • version.py22 B
        • py.typed0 B
        • Abbrev.conf46 kB
    • setup.py3 kB
    • .gitignore1 kB
    • README.rst39 kB
    • setup.cfg64 B
    • test
      • toktest_large.txt559 kB
      • test_tokenizer_tok.py18 kB
      • toktest_normal.txt12 kB
      • test_index_calculation.py20 kB
      • toktest_normal_gold_expected.txt13 kB
      • toktest_edgecases.txt6 kB
      • toktest_edgecases_gold_expected.txt6 kB
      • test_detokenize.py2 kB
      • Overview.txt34 kB
      • toktest_large_gold_perfect.txt576 kB
      • toktest_large_gold_acceptable.txt583 kB
      • toktest_sentences.txt21 kB
      • example.txt3 kB
      • test_tokenizer.py102 kB
      • toktest_edgecases_diff.txt751 B
    • .github
    • release.sh251 B
    • LICENSE1 kB
    • MANIFEST.in74 B

Sýna einfalda færslu atriðis