Sýna einfalda færslu atriðis
dc.contributor.author
Þorsteinsson, Vilhjálmur
dc.contributor.author
Óladóttir, Hulda
dc.contributor.author
Þórðarson, Sveinbjörn
dc.contributor.author
Ragnarsson, Pétur Orri
dc.date.accessioned
2021-09-28T17:19:15Z
dc.date.available
2021-09-28T17:19:15Z
dc.date.issued
2021-09-28
dc.identifier.uri
http://hdl.handle.net/20.500.12537/136
dc.description
Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer
Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso
isl
dc.publisher
Miðeind ehf.
dc.relation.replaces
http://hdl.handle.net/20.500.12537/65
dc.relation.isreplacedby
http://hdl.handle.net/20.500.12537/178
dc.rights
The MIT License (MIT)
dc.rights.uri
https://opensource.org/licenses/mit-license.php
dc.rights.label
PUB
dc.source.uri
https://github.com/mideind/Tokenizer/releases/tag/3.3.2
dc.subject
tokenization
dc.subject
sentence detection
dc.subject
token detection
dc.title
Tokenizer for Icelandic text (3.3.2)
dc.type
toolService
metashare.ResourceInfo#ContentInfo.detailedType
tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent
true
has.files
yes
branding
Clarin IS Repository
contact.person
Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor
Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size
539198
files.count
2
Files in this item
Download all files in item (526.56
KB)
×
Large Size
The requested files are being packed into one large file. This process can take some time, please be patient.
Continue
Cancel
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Name
Tokenizer-3.3.2.tar.gz
Size
254.99
KB
Format
application/gzip
Description
Unknown
MD5
ceca503cfa798eb43225163fc56aad76
Download file
Preview
Tokenizer-3.3.2 src tokenizer abbrev.py 13 kB main.py 9 kB definitions.py 26 kB tokenizer.py 120 kB __init__.py 2 kB version.py 22 B py.typed 0 B Abbrev.conf 46 kB setup.py 3 kB .gitignore 1 kB README.rst 39 kB setup.cfg 64 B test toktest_large.txt 559 kB test_tokenizer_tok.py 12 kB toktest_normal.txt 12 kB test_index_calculation.py 20 kB toktest_normal_gold_expected.txt 13 kB toktest_edgecases.txt 6 kB toktest_edgecases_gold_expected.txt 6 kB test_detokenize.py 2 kB Overview.txt 34 kB toktest_large_gold_perfect.txt 576 kB toktest_large_gold_acceptable.txt 583 kB toktest_sentences.txt 21 kB example.txt 3 kB test_tokenizer.py 97 kB toktest_edgecases_diff.txt 751 B .github release.sh 251 B LICENSE 1 kB MANIFEST.in 74 B
Name
Tokenizer-3.3.2.zip
Size
271.57
KB
Format
application/zip
Description
Unknown
MD5
b3a2d787e407dd8e4cf3fb58dc0746fc
Download file
Preview
Tokenizer-3.3.2 src tokenizer abbrev.py 13 kB main.py 9 kB definitions.py 26 kB tokenizer.py 120 kB __init__.py 2 kB version.py 22 B py.typed 0 B Abbrev.conf 46 kB setup.py 3 kB .gitignore 1 kB README.rst 39 kB setup.cfg 64 B test toktest_large.txt 559 kB test_tokenizer_tok.py 12 kB toktest_normal.txt 12 kB test_index_calculation.py 20 kB toktest_normal_gold_expected.txt 13 kB toktest_edgecases.txt 6 kB toktest_edgecases_gold_expected.txt 6 kB test_detokenize.py 2 kB Overview.txt 34 kB toktest_large_gold_perfect.txt 576 kB toktest_large_gold_acceptable.txt 583 kB toktest_sentences.txt 21 kB example.txt 3 kB test_tokenizer.py 97 kB toktest_edgecases_diff.txt 751 B .github release.sh 251 B LICENSE 1 kB MANIFEST.in 74 B
Sýna einfalda færslu atriðis