Show simple item record

 
dc.contributor.author Nikulásdóttir, Anna Björk
dc.date.accessioned 2022-09-29T11:59:43Z
dc.date.available 2022-09-29T11:59:43Z
dc.date.issued 2022-10-01
dc.identifier.uri http://hdl.handle.net/20.500.12537/279
dc.description ENGLISH: This project provides a TTS textprocessing pipeline for Icelandic. The pipeline includes modules for html parsing, text cleaning, text normalization for TTS, spell and grammar correction, phrasing, and grapheme-to-phoneme (g2p) conversion. Before a text can be fed into a TTS system it has to be converted into the format that was used when training that system. The format can be grapheme-based (i.e. alphabetic characters of the language in question are used as input) or phoneme-based (i.e. a phonetic alphabet like IPA or SAMPA are used as input). The TTS Textprocessing Pipeline for Icelandic offers both possibilities. ÍSLENSKA: Þessi hugbúnaðarpakki inniheldur textavinnslupípu fyrir íslenska talgervla. Pípan samanstendur af vinnslu html-skjala fyrir hljóðbækur, hreinsun texta, textanormun, stafsetningarleiðréttingu, innsetningu á þögnum og sjálfvirkri hljóðritun. Áður en hægt er að senda texta á talgervil þarf að forvinna hann, t.d. skrifa út tölustafi og skammstafanir, merkja inn þagnir og koma textanum að lokum á sama form og þjálfunargögn þess talgervils sem á að lesa textann. Yfirleitt eru talgervlar þjálfaðir á hljóðrituðum textum, þar sem textarnir eru hljóðritaðir skv. hljóðritunarstafrófum eins og IPA eða SAMPA, en einnig geta þeir verið þjálfaðir beint á textum skrifuðum með hefðbundnum bókstöfum. Textavinnslupípan býður upp á báða möguleika og einnig að vinna textann einungis að hluta.
dc.language.iso isl
dc.publisher Grammatek ehf.
dc.rights Apache License 2.0
dc.rights.uri https://opensource.org/license/apache2-0-php/
dc.rights.label PUB
dc.source.uri https://github.com/grammatek/tts-frontend/releases/tag/v1.0.1
dc.subject text-to-speech
dc.subject text processing
dc.subject frontend processing
dc.subject text normalization
dc.subject grapheme-to-phoneme
dc.title TTS Text Processing (22.10)
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType suiteOfTools
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding Clarin IS Repository
contact.person Anna Björk Nikulásdóttir anna@grammatek.com Grammatek ehf.
sponsor Ministry of Education, Science and Culture Text preprocessing, normalization and phrasing (T9) Language Technology for Icelandic 2019-2023 nationalFunds
files.size 19761635
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Apache License 2.0
Icon
Name
tts-frontend-1.0.1.tar.gz
Size
18.85 MB
Format
application/gzip
Description
Unknown
MD5
00bcbc3f2ee485643404044a3648e3c9
 Download file  Preview
 File Preview  
  • tts-frontend-1.0.1
    • src
      • regina_normalizer
        • phrasing-tool
          • ice-g2p
            • manager
              • __init__.py0 B
              • settings.py5 kB
              • phrasing_manager.py5 kB
              • normalizer_manager.py16 kB
              • resources
                • ice_pron_dict_north_clear.csv1 MB
                • ice_pron_dict_english_clear.csv1 MB
                • dmii_abbr.txt23 kB
                • abbreviations_general.txt450 B
                • abbreviations_nonending.txt63 B
                • ice_pron_dict_standard_clear.csv1 MB
              • g2p_manager.py4 kB
              • linked_tokens.py1 kB
              • cleaner_manager.py3 kB
              • tts_tokenizer.py13 kB
              • IceNLP
                • ngrams
                  • corpus.txt10 kB
                  • computeNgrams173 B
                  • buildDictTagFreq296 B
                  • buildDictTagFreq.pl4 kB
                  • computeNgrams.pl5 kB
                  • train215 B
                  • models
                    • corpus.lex7 kB
                    • otb.ngram2 MB
                    • otb.lex1 MB
                    • corpus.orig.lex7 kB
                    • corpus.ngram19 kB
                    • corpus.lambda143 B
                    • otb.lambda143 B
                  • corpus.txt.freq10 kB
                • bat
                  • iceparser
                    • iceparser.bat540 B
                    • lawtagged.txt7 kB
                    • wordpl2sentpl.sh359 B
                    • iceparserOutOld.sh4 kB
                    • 200sent.txt45 kB
                    • iceparser.sh152 B
                    • iceparserOut.bat540 B
                    • 5.sent1 kB
                    • errorSearch
                      • pp_errors.sh102 B
                      • vp_errors.sh102 B
                      • np_errors.sh102 B
                    • testData
                      • test.tags106 kB
                      • dev.tags.sent.parsed1 MB
                      • dev.tags1007 kB
                      • dev.tags.sent.parsed.orig1 MB
                      • test.gold.sent106 kB
                      • dev.tags.sent1002 kB
                      • test.gold.sent.gold173 kB
                      • test.gold106 kB
                      • test.gold.sent.parsed173 kB
                      • readme.txt355 B
                      • prufa.out134 B
                    • 200sent_func.gdc90 kB
                    • iceparserOut.sh155 B
                  • srxsegmentizer
                    • testinput.txt1 kB
                    • srxsegmentizer.sh186 B
                    • srxsegmentizer.bat185 B
                    • readme.txt471 B
                  • lemmald
                    • testinput.txt46 B
                    • lemmatize.sh87 B
                    • plaintext.txt49 B
                    • readme.txt1 kB
                    • lemmatize.bat87 B
                  • iceNER
                    • prufa.txt561 B
                    • iceNER.sh2 kB
                • doc
                  • Tagset.pdf211 kB
                  • IceNLP.pdf473 kB
                • lib
                  • junit-4.8.2.jar231 kB
                  • commons-io-1.4.jar106 kB
                  • segment-1.3.3.jar164 kB
                  • commons-logging-1.1.1.jar59 kB
                  • commons-cli-1.2.jar40 kB
                  • xerces.jar1 MB
                • dist
                  • IceNLPCore.jar8 MB
                • dict
                  • icetagger
                    • otb.verbObj.dict51 kB
                    • otbTags.freq.dict5 kB
                    • otb.verbPrep.dict129 kB
                    • prefixes.dict160 B
                    • baseEndings.dict39 kB
                    • otb.dict1 MB
                    • otb.endingsProper.dict102 kB
                    • otb.apertium.dict28 kB
                    • otb.verbAdverb.dict5 kB
                    • baseDict.dict79 kB
                    • idioms.dict6 kB
                    • otb.endings.dict220 kB
                  • BIN
                    • bin2Otb.sh498 B
                    • buildDictFromBin.pl2 kB
                    • bin2Icetagger.sh370 B
                    • combineDicts.pl2 kB
                    • README195 B
                    • combineFreqDicts.pl1 kB
                    • bin2Tritagger.sh529 B
                    • bin2Stagger.sh149 B
                    • extractBinData.sh145 B
                    • bin2Otb.pl12 kB
                    • combineOtbFreqBinTri263 B
                  • tokenizer
                    • lexicon.txt7 kB
                  • tritagger
                    • idioms.dict6 kB
                    • baseDict.dict75 kB
                  • lemmald
                    • rule_database_utf8.txt6 MB
                    • postfixRules.txt175 B
                    • rule_hand_written_utf8.txt753 B
                    • readme.txt249 B
                    • settings.txt306 B
                    • makeRules.sh437 B
                    • rule_database_utf8.dat2 MB
                  • iceNER
                    • location.txt12 kB
                  • formald
                    • segment.srx92 kB
              • tokens.py4 kB
              • spellchecker_manager.py1 kB
              • textprocessing_manager.py14 kB
              • tokens_manager.py12 kB
              • emoji_dictionary.py323 kB
              • unicode_maps.py7 kB
              • main.py1 kB
            • GreynirCorrect4LT
              • text-cleaner
              • setup.py3 kB
              • README.md5 kB
              • .gitmodules543 B
              • pyproject.toml103 B
              • test
                • test_spellchecker.py1 kB
                • test_normalizer.py25 kB
                • test_cleaner.py18 kB
                • test_tts_tokenizer.py487 B
                • data
                  • Akranes_10.txt1 kB
                  • HBS-2022-06-30
                    • FST_Toflu_test_7.html4 kB
                    • FST_Toflu_test_6.html5 kB
                    • FST_Toflu_test_5.html4 kB
                    • FST_Toflu_test_4.html4 kB
                    • FST_Toflu_test_3.html5 kB
                    • FST_Toflu_test_2.html2 kB
                    • Textatalgervilsprofun_ur_bok_Fjarmal.html4 kB
                • test_manager.py1 kB
                • test_transcriber.py11 kB
              • .github
              • LICENSE11 kB
              • MANIFEST.in152 B
              • pax_global_header52 B

            Show simple item record