dc.contributor.author |
Þorsteinsson, Vilhjálmur |
dc.date.accessioned |
2020-01-14T09:45:05Z |
dc.date.available |
2020-01-14T09:45:05Z |
dc.date.issued |
2020-01-14 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/11 |
dc.description |
Tokenizer is a compact pure-Python (2 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. |
dc.language.iso |
isl |
dc.publisher |
Miðeind ehf. |
dc.relation.isreplacedby |
http://hdl.handle.net/20.500.12537/65 |
dc.rights |
The MIT License (MIT) |
dc.rights.uri |
http://opensource.org/licenses/mit-license.php |
dc.rights.label |
PUB |
dc.source.uri |
https://github.com/mideind/Tokenizer/releases/tag/2.0.3 |
dc.subject |
tokenization |
dc.subject |
token detection |
dc.subject |
sentence detection |
dc.title |
Tokenizer for Icelandic text (2.0.3) |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
hidden |
false |
hasMetadata |
false |
has.files |
yes |
branding |
Clarin IS Repository |
contact.person |
Vilhjálmur Þorsteinsson vthorsteinsson@mideind.is Miðeind ehf |
files.size |
245368 |
files.count |
1 |