Sýna einfalda færslu atriðis
dc.contributor.author
Þorsteinsson, Vilhjálmur
dc.contributor.author
Óladóttir, Hulda
dc.contributor.author
Þórðarson, Sveinbjörn
dc.contributor.author
Ragnarsson, Pétur Orri
dc.contributor.author
Jónsson, Haukur Páll
dc.contributor.author
Eyjólfsson, Logi
dc.date.accessioned
2022-06-01T09:16:54Z
dc.date.available
2022-06-01T09:16:54Z
dc.date.issued
2022-05-31
dc.identifier.uri
http://hdl.handle.net/20.500.12537/219
dc.description
Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer
dc.description
Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer
dc.language.iso
isl
dc.publisher
Miðeind ehf.
dc.relation.replaces
http://hdl.handle.net/20.500.12537/178
dc.relation.isreplacedby
http://hdl.handle.net/20.500.12537/262
dc.rights
The MIT License (MIT)
dc.rights.uri
https://opensource.org/licenses/mit-license.php
dc.rights.label
PUB
dc.source.uri
https://github.com/mideind/Tokenizer/releases/tag/3.4.1
dc.subject
tokenization
dc.subject
sentence detection
dc.subject
token detection
dc.title
Tokenizer for Icelandic text (3.4.1) (2022-05-31)
dc.type
toolService
metashare.ResourceInfo#ContentInfo.detailedType
tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent
true
has.files
yes
branding
Clarin IS Repository
contact.person
Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf.
sponsor
Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds
files.size
545063
files.count
2
Files in this item
Download all files in item (532.29
KB)
×
Large Size
The requested files are being packed into one large file. This process can take some time, please be patient.
Continue
Cancel
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Name
Tokenizer-3.4.1.tar.gz
Size
258.02
KB
Format
application/gzip
Description
Unknown
MD5
91bb9fa5417e9aca9041953a9b3c486a
Download file
Preview
Tokenizer-3.4.1 src tokenizer abbrev.py 13 kB main.py 9 kB definitions.py 27 kB tokenizer.py 125 kB __init__.py 2 kB version.py 22 B py.typed 0 B Abbrev.conf 46 kB setup.py 3 kB .gitignore 1 kB README.rst 39 kB setup.cfg 64 B test toktest_large.txt 559 kB test_tokenizer_tok.py 18 kB toktest_normal.txt 12 kB test_index_calculation.py 20 kB toktest_normal_gold_expected.txt 13 kB toktest_edgecases.txt 6 kB toktest_edgecases_gold_expected.txt 6 kB test_detokenize.py 2 kB Overview.txt 34 kB toktest_large_gold_perfect.txt 576 kB toktest_large_gold_acceptable.txt 583 kB toktest_sentences.txt 21 kB example.txt 3 kB test_tokenizer.py 102 kB toktest_edgecases_diff.txt 751 B .github release.sh 251 B LICENSE 1 kB MANIFEST.in 74 B
Name
Tokenizer-3.4.1.zip
Size
274.26
KB
Format
application/zip
Description
Unknown
MD5
1578d146018d976c5eb4be0e346784f5
Download file
Preview
Tokenizer-3.4.1 src tokenizer abbrev.py 13 kB main.py 9 kB definitions.py 27 kB tokenizer.py 125 kB __init__.py 2 kB version.py 22 B py.typed 0 B Abbrev.conf 46 kB setup.py 3 kB .gitignore 1 kB README.rst 39 kB setup.cfg 64 B test toktest_large.txt 559 kB test_tokenizer_tok.py 18 kB toktest_normal.txt 12 kB test_index_calculation.py 20 kB toktest_normal_gold_expected.txt 13 kB toktest_edgecases.txt 6 kB toktest_edgecases_gold_expected.txt 6 kB test_detokenize.py 2 kB Overview.txt 34 kB toktest_large_gold_perfect.txt 576 kB toktest_large_gold_acceptable.txt 583 kB toktest_sentences.txt 21 kB example.txt 3 kB test_tokenizer.py 102 kB toktest_edgecases_diff.txt 751 B .github release.sh 251 B LICENSE 1 kB MANIFEST.in 74 B
Sýna einfalda færslu atriðis