dc.contributor.author | Nikulásdóttir, Anna Björk |
dc.date.accessioned | 2022-09-29T11:59:43Z |
dc.date.available | 2022-09-29T11:59:43Z |
dc.date.issued | 2022-10-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/279 |
dc.description | ENGLISH: This project provides a TTS textprocessing pipeline for Icelandic. The pipeline includes modules for html parsing, text cleaning, text normalization for TTS, spell and grammar correction, phrasing, and grapheme-to-phoneme (g2p) conversion. Before a text can be fed into a TTS system it has to be converted into the format that was used when training that system. The format can be grapheme-based (i.e. alphabetic characters of the language in question are used as input) or phoneme-based (i.e. a phonetic alphabet like IPA or SAMPA are used as input). The TTS Textprocessing Pipeline for Icelandic offers both possibilities. ÍSLENSKA: Þessi hugbúnaðarpakki inniheldur textavinnslupípu fyrir íslenska talgervla. Pípan samanstendur af vinnslu html-skjala fyrir hljóðbækur, hreinsun texta, textanormun, stafsetningarleiðréttingu, innsetningu á þögnum og sjálfvirkri hljóðritun. Áður en hægt er að senda texta á talgervil þarf að forvinna hann, t.d. skrifa út tölustafi og skammstafanir, merkja inn þagnir og koma textanum að lokum á sama form og þjálfunargögn þess talgervils sem á að lesa textann. Yfirleitt eru talgervlar þjálfaðir á hljóðrituðum textum, þar sem textarnir eru hljóðritaðir skv. hljóðritunarstafrófum eins og IPA eða SAMPA, en einnig geta þeir verið þjálfaðir beint á textum skrifuðum með hefðbundnum bókstöfum. Textavinnslupípan býður upp á báða möguleika og einnig að vinna textann einungis að hluta. |
dc.language.iso | isl |
dc.publisher | Grammatek ehf. |
dc.rights | Apache License 2.0 |
dc.rights.uri | https://opensource.org/license/apache2-0-php/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/grammatek/tts-frontend/releases/tag/v1.0.1 |
dc.subject | text-to-speech |
dc.subject | text processing |
dc.subject | frontend processing |
dc.subject | text normalization |
dc.subject | grapheme-to-phoneme |
dc.title | TTS Text Processing (22.10) |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | suiteOfTools |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Anna Björk Nikulásdóttir anna@grammatek.com Grammatek ehf. |
sponsor | Ministry of Education, Science and Culture Text preprocessing, normalization and phrasing (T9) Language Technology for Icelandic 2019-2023 nationalFunds |
files.size | 19761635 |
files.count | 1 |
Files in this item
- Name
- tts-frontend-1.0.1.tar.gz
- Size
- 18.85 MB
- Format
- application/gzip
- Description
- Unknown
- MD5
- 00bcbc3f2ee485643404044a3648e3c9
- tts-frontend-1.0.1
- src
- regina_normalizer
- phrasing-tool
- ice-g2p
- manager
- __init__.py0 B
- settings.py5 kB
- phrasing_manager.py5 kB
- normalizer_manager.py16 kB
- resources
- ice_pron_dict_north_clear.csv1 MB
- ice_pron_dict_english_clear.csv1 MB
- dmii_abbr.txt23 kB
- abbreviations_general.txt450 B
- abbreviations_nonending.txt63 B
- ice_pron_dict_standard_clear.csv1 MB
- g2p_manager.py4 kB
- linked_tokens.py1 kB
- cleaner_manager.py3 kB
- tts_tokenizer.py13 kB
- IceNLP
- ngrams
- corpus.txt10 kB
- computeNgrams173 B
- buildDictTagFreq296 B
- buildDictTagFreq.pl4 kB
- computeNgrams.pl5 kB
- train215 B
- models
- corpus.lex7 kB
- otb.ngram2 MB
- otb.lex1 MB
- corpus.orig.lex7 kB
- corpus.ngram19 kB
- corpus.lambda143 B
- otb.lambda143 B
- corpus.txt.freq10 kB
- bat
- iceparser
- iceparser.bat540 B
- lawtagged.txt7 kB
- wordpl2sentpl.sh359 B
- iceparserOutOld.sh4 kB
- 200sent.txt45 kB
- iceparser.sh152 B
- iceparserOut.bat540 B
- 5.sent1 kB
- errorSearch
- pp_errors.sh102 B
- vp_errors.sh102 B
- np_errors.sh102 B
- testData
- test.tags106 kB
- dev.tags.sent.parsed1 MB
- dev.tags1007 kB
- dev.tags.sent.parsed.orig1 MB
- test.gold.sent106 kB
- dev.tags.sent1002 kB
- test.gold.sent.gold173 kB
- test.gold106 kB
- test.gold.sent.parsed173 kB
- readme.txt355 B
- prufa.out134 B
- 200sent_func.gdc90 kB
- iceparserOut.sh155 B
- srxsegmentizer
- testinput.txt1 kB
- srxsegmentizer.sh186 B
- srxsegmentizer.bat185 B
- readme.txt471 B
- lemmald
- testinput.txt46 B
- lemmatize.sh87 B
- plaintext.txt49 B
- readme.txt1 kB
- lemmatize.bat87 B
- iceNER
- prufa.txt561 B
- iceNER.sh2 kB
- iceparser
- doc
- Tagset.pdf211 kB
- IceNLP.pdf473 kB
- lib
- junit-4.8.2.jar231 kB
- commons-io-1.4.jar106 kB
- segment-1.3.3.jar164 kB
- commons-logging-1.1.1.jar59 kB
- commons-cli-1.2.jar40 kB
- xerces.jar1 MB
- dist
- IceNLPCore.jar8 MB
- dict
- icetagger
- otb.verbObj.dict51 kB
- otbTags.freq.dict5 kB
- otb.verbPrep.dict129 kB
- prefixes.dict160 B
- baseEndings.dict39 kB
- otb.dict1 MB
- otb.endingsProper.dict102 kB
- otb.apertium.dict28 kB
- otb.verbAdverb.dict5 kB
- baseDict.dict79 kB
- idioms.dict6 kB
- otb.endings.dict220 kB
- BIN
- bin2Otb.sh498 B
- buildDictFromBin.pl2 kB
- bin2Icetagger.sh370 B
- combineDicts.pl2 kB
- README195 B
- combineFreqDicts.pl1 kB
- bin2Tritagger.sh529 B
- bin2Stagger.sh149 B
- extractBinData.sh145 B
- bin2Otb.pl12 kB
- combineOtbFreqBinTri263 B
- tokenizer
- lexicon.txt7 kB
- tritagger
- idioms.dict6 kB
- baseDict.dict75 kB
- lemmald
- rule_database_utf8.txt6 MB
- postfixRules.txt175 B
- rule_hand_written_utf8.txt753 B
- readme.txt249 B
- settings.txt306 B
- makeRules.sh437 B
- rule_database_utf8.dat2 MB
- iceNER
- location.txt12 kB
- formald
- segment.srx92 kB
- icetagger
- ngrams
- tokens.py4 kB
- spellchecker_manager.py1 kB
- textprocessing_manager.py14 kB
- tokens_manager.py12 kB
- emoji_dictionary.py323 kB
- unicode_maps.py7 kB
- main.py1 kB
- GreynirCorrect4LT
- text-cleaner
- regina_normalizer
- setup.py3 kB
- README.md5 kB
- .gitmodules543 B
- pyproject.toml103 B
- test
- test_spellchecker.py1 kB
- test_normalizer.py25 kB
- test_cleaner.py18 kB
- test_tts_tokenizer.py487 B
- data
- Akranes_10.txt1 kB
- HBS-2022-06-30
- FST_Toflu_test_7.html4 kB
- FST_Toflu_test_6.html5 kB
- FST_Toflu_test_5.html4 kB
- FST_Toflu_test_4.html4 kB
- FST_Toflu_test_3.html5 kB
- FST_Toflu_test_2.html2 kB
- Textatalgervilsprofun_ur_bok_Fjarmal.html4 kB
- test_manager.py1 kB
- test_transcriber.py11 kB
- .github
- workflows
- python-app.yml1 kB
- workflows
- LICENSE11 kB
- MANIFEST.in152 B
- src
- pax_global_header52 B