dc.contributor.author |
Þorsteinsson, Vilhjálmur |
dc.contributor.author |
Óladóttir, Hulda |
dc.contributor.author |
Þórðarson, Sveinbjörn |
dc.contributor.author |
Ragnarsson, Pétur Orri |
dc.date.accessioned |
2021-09-28T17:19:15Z |
dc.date.available |
2021-09-28T17:19:15Z |
dc.date.issued |
2021-09-28 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/136 |
dc.description |
Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer
Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer |
dc.language.iso |
isl |
dc.publisher |
Miðeind ehf. |
dc.relation.replaces |
http://hdl.handle.net/20.500.12537/65 |
dc.relation.isreplacedby |
http://hdl.handle.net/20.500.12537/178 |
dc.rights |
The MIT License (MIT) |
dc.rights.uri |
https://opensource.org/licenses/mit-license.php |
dc.rights.label |
PUB |
dc.source.uri |
https://github.com/mideind/Tokenizer/releases/tag/3.3.2 |
dc.subject |
tokenization |
dc.subject |
sentence detection |
dc.subject |
token detection |
dc.title |
Tokenizer for Icelandic text (3.3.2) |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
has.files |
yes |
branding |
Clarin IS Repository |
contact.person |
Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf. |
sponsor |
Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds |
files.size |
539198 |
files.count |
2 |