Sýna einfalda færslu atriðis
dc.contributor.author
Þorsteinsson, Vilhjálmur
dc.contributor.author
Óladóttir, Hulda
dc.contributor.author
Þórðarson, Sveinbjörn
dc.date.accessioned
2020-09-28T14:55:24Z
dc.date.available
2020-09-28T14:55:24Z
dc.date.issued
2020-09-25
dc.identifier.uri
http://hdl.handle.net/20.500.12537/65
dc.description
Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer
Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso
isl
dc.publisher
Miðeind ehf.
dc.relation.replaces
http://hdl.handle.net/20.500.12537/11
dc.relation.isreplacedby
http://hdl.handle.net/20.500.12537/136
dc.rights
The MIT License (MIT)
dc.rights.uri
https://opensource.org/licenses/mit-license.php
dc.rights.label
PUB
dc.source.uri
https://github.com/mideind/Tokenizer/releases/tag/2.3.1
dc.subject
tokenization
dc.subject
sentence detection
dc.subject
token detection
dc.title
Tokenizer for Icelandic text (2.3.1)
dc.type
toolService
metashare.ResourceInfo#ContentInfo.detailedType
tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent
true
has.files
yes
branding
Clarin IS Repository
contact.person
Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor
Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size
493897
files.count
2
Files in this item
Download all files in item (482.32
KB)
×
Large Size
The requested files are being packed into one large file. This process can take some time, please be patient.
Continue
Cancel
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Name
Tokenizer-2.3.1.zip
Size
247.28
KB
Format
application/zip
Description
Tokenizer - zip
MD5
eb66ff0e05901ee831c875cc83a854d9
Download file
Preview
Tokenizer-2.3.1 src tokenizer abbrev.py 11 kB main.py 6 kB definitions.py 26 kB tokenizer.pyi 7 kB tokenizer.py 90 kB __init__.py 1 kB py.typed 0 B Abbrev.conf 45 kB setup.py 3 kB .gitignore 1 kB README.rst 35 kB setup.cfg 64 B .travis.yml 230 B test toktest_large.txt 559 kB toktest_normal.txt 12 kB toktest_normal_gold_expected.txt 13 kB test_detokenize.py 2 kB Overview.txt 34 kB toktest_large_gold_perfect.txt 576 kB toktest_large_gold_acceptable.txt 583 kB toktest_sentences.txt 21 kB example.txt 3 kB test_tokenizer.py 53 kB release.sh 246 B LICENSE 1 kB MANIFEST.in 74 B
Name
Tokenizer-2.3.1.tar.gz
Size
235.04
KB
Format
application/gzip
Description
Tokenizer - tar.gz
MD5
296fca3864470b8b07e7047669b17968
Download file
Preview
Tokenizer-2.3.1 src tokenizer abbrev.py 11 kB main.py 6 kB definitions.py 26 kB tokenizer.pyi 7 kB tokenizer.py 90 kB __init__.py 1 kB py.typed 0 B Abbrev.conf 45 kB setup.py 3 kB .gitignore 1 kB README.rst 35 kB setup.cfg 64 B .travis.yml 230 B test toktest_large.txt 559 kB toktest_normal.txt 12 kB toktest_normal_gold_expected.txt 13 kB test_detokenize.py 2 kB Overview.txt 34 kB toktest_large_gold_perfect.txt 576 kB toktest_large_gold_acceptable.txt 583 kB toktest_sentences.txt 21 kB example.txt 3 kB test_tokenizer.py 53 kB release.sh 246 B LICENSE 1 kB MANIFEST.in 74 B
Sýna einfalda færslu atriðis