Sýna einfalda færslu atriðis
dc.contributor.author |
Þorsteinsson, Vilhjálmur |
dc.contributor.author |
Óladóttir, Hulda |
dc.contributor.author |
Þórðarson, Sveinbjörn |
dc.contributor.author |
Ragnarsson, Pétur Orri |
dc.contributor.author |
Jónsson, Haukur Páll |
dc.contributor.author |
Eyjólfsson, Logi |
dc.date.accessioned |
2022-09-26T13:14:54Z |
dc.date.available |
2022-09-26T13:14:54Z |
dc.date.issued |
2022-09-23 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/262 |
dc.description |
Tokenizer is a compact pure-Python 3 executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer
Tokenizer er pakki fyrir Python 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer |
dc.language.iso |
isl |
dc.publisher |
Miðeind ehf. |
dc.relation.replaces |
http://hdl.handle.net/20.500.12537/219 |
dc.rights |
The MIT License (MIT) |
dc.rights.uri |
https://opensource.org/licenses/mit-license.php |
dc.rights.label |
PUB |
dc.source.uri |
https://github.com/mideind/Tokenizer/releases/tag/3.4.2 |
dc.subject |
tokenization |
dc.subject |
sentence detection |
dc.subject |
token detection |
dc.title |
Tokenizer for Icelandic text (3.4.2) (22.10) |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
has.files |
yes |
branding |
Clarin IS Repository |
contact.person |
Vilhjálmur Þorsteinsson mideind@mideind.is Miðeind ehf. |
sponsor |
Ministry of Education, Science and Culture Tokenizer (I3) Language Technology for Icelandic 2019-2023 nationalFunds |
files.size |
281084 |
files.count |
1 |
Files in this item
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
- Name
- Tokenizer-3.4.2.zip
- Size
- 274.5
KB
- Format
- application/zip
- Description
- Unknown
- MD5
- ec6d525b50278b069275a312b21dc7b5
Download file
Preview
- Tokenizer-3.4.2
- src
- tokenizer
- abbrev.py13 kB
- main.py9 kB
- definitions.py27 kB
- tokenizer.py125 kB
- __init__.py2 kB
- version.py22 B
- py.typed0 B
- Abbrev.conf46 kB
- setup.py3 kB
- .gitignore1 kB
- README.rst39 kB
- setup.cfg64 B
- test
- toktest_large.txt559 kB
- test_tokenizer_tok.py18 kB
- toktest_normal.txt12 kB
- test_index_calculation.py20 kB
- toktest_normal_gold_expected.txt13 kB
- toktest_edgecases.txt6 kB
- toktest_edgecases_gold_expected.txt6 kB
- test_detokenize.py2 kB
- Overview.txt34 kB
- toktest_large_gold_perfect.txt576 kB
- toktest_large_gold_acceptable.txt583 kB
- toktest_sentences.txt21 kB
- example.txt3 kB
- test_tokenizer.py102 kB
- toktest_edgecases_diff.txt751 B
- .github
- release.sh251 B
- LICENSE1 kB
- MANIFEST.in74 B
Sýna einfalda færslu atriðis