------------------------------------------------------------------------------- Faroese Language Models with Pronunciations ------------------------------------------------------------------------------- Authors : Carlos Daniel Hernández Mena, Sandra Saxov Lamhauge, Iben Nyholm Debess, Annika Simonsen. Language : Faroese. Recommended use : speech recognition. ------------------------------------------------------------------------------- Description ------------------------------------------------------------------------------- In the context of Automatic Speech Recognition (ASR), a n-gram language model is a plain-text file containing the probabilities of word sequences with distict lengths or "n-grams" (for example, a sequence of one word is a 1-gram, a sequence of two words is a 2-gram and so on). Acoording to this, the "Faroese Language Models with Pronunciations" is a set of n-gram language models in ARPA format along with pronunciation dictionaries containing the words that are present in such language models. This set was originally created to be used in the ASR field. In specific, it was designed for the following Kaldi [1] recipe: - Kaldi Recipe for Faroese: https://github.com/CarlosDanielMena/Kaldi_Recipe_for_Faroese Nevertheless, due to the flexibility of these kind of resources and their possible application to other tasks, systems or code recipes, it was decided to publish these models as an independent item. ------------------------------------------------------------------------------- The Language Models ------------------------------------------------------------------------------- The language models were created using the Faroese text provided in the "Basic Language Resource Kit 1.0" (BLARK 1.0) [2] developed by the Ravnur Project from the Faroe Islands [3]. The BLARK contains text from newspaper articles, parliamentary speeches, books and more. The normalization process of the sentences utilized to generate the language models included to allow only characters belonging to the Faroese alphabet and removing punctuation marks. The resulting text has a length of more than half million lines of text (106.3MB approximately). It was used to create a 3-gram (recommended for decoding) and a 4-gram (recommended for re-scoring) language models with the SRILM toolkit [4]. Both the 3-gram and 4-gram models come in pruned and unpruned versions. It is also included a 6-gram language model in binary format suitable for ASR experiments with the NeMo toolkit [5]. In particular, this model was created using KenLM [6]. ------------------------------------------------------------------------------- Pronouncing Dictionaries ------------------------------------------------------------------------------- In order to preserve the origin of the pronounciations, a number of pronouncing dictionaries are provided: - Central_Faroese.dic : It contains pronunciations of the variant of Faroese which is considered the most common in the Faroe Islands. - East_Faroese.dic : It contains pronunciation of the East variant of Faroese - Ravnursson_Composite_Words.dic : It contains words with hyphens and/or underscores that are present in the Ravnursson Corpus [7]. These type of composite words can be problematic for a G2P-tool. - BLARK.dic : It contains pronunciations of words that are present in the BLARK 1.0 but that are not present in any othe dictionary of the set. - FAROESE_ASR.dic : This dictionary is recomended for ASR experiments in Kaldi or any other ASR system based on phonemes. The dictionary is the mix of Central_Faroese.dic, East_Faroese.dic and Ravnursson_Composite_Words.dic. It is imporant to clarify that the dictionary can contain words with multiple pronunciations, which is normal in Kaldi-like systems. ------------------------------------------------------------------------------- Lists of Phonemes ------------------------------------------------------------------------------- The set of phonemes present in the Central and East variants of Faroese are similar but the East variant counts with 3 more phonemes than the Central. For this reason, the list of phonemes provided in the set are divided in 3 files: - central.phones : List of the 60 phonemes of the Central variant. - east.phones : List of the 63 phonemes of the East variant. - asr.phones : The mix of the East and Central phonemes with no repetitions. Actually, the list of the Central variant is a subset of the Easter. This list is recommened for ASR experiments in Kaldi or any other system based on phonemes. ------------------------------------------------------------------------------- Citation ------------------------------------------------------------------------------- When publishing results based on the models, please refer to: Hernández Mena, Carlos Daniel; Lamhauge, Sandra Saxov; Nyholm Debess, Iben; Simonsen, Annika. "Faroese Language Models with Pronunciations". Web Download. Reykjavik University: Language and Voice Lab, 2022. Contact: Carlos Daniel Hernández Mena (carlos.mena@ciempiess.org) License: CC BY 4.0 ------------------------------------------------------------------------------- Acknowledgements ------------------------------------------------------------------------------- The authors want to thank to Jón Guðnason, head of the Language and Voice Lab for providing computational power to make these models possible. We also want to thank to the "Language Technology Programme for Icelandic 2019-2023" which is managed and coordinated by Almannarómur, and it is funded by the Icelandic Ministry of Education, Science and Culture. Special thanks to The Ravnur Project for making their "Basic Language Resource Kit"(BLARK 1.0) publicly available through the research paper "Creating a Basic Language Resource Kit for Faroese" https://aclanthology.org/2022.lrec-1.495.pdf ------------------------------------------------------------------------------- References ------------------------------------------------------------------------------- [1] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society. [2] Simonsen, A., Debess, I. N., Lamhauge, S. S., & Henrichsen, P. J. Creating a basic language resource kit for Faroese. In LREC 2022. 13th International Conference on Language Resources and Evaluation. [3] Website. The Project Ravnur under the Talutøkni Foundation https://maltokni.fo/en/the-ravnur-project [4] Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing. [5] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., ... & Cohen, J. M. (2019). Nemo: a toolkit for building AI applications using neural modules. arXiv preprint arXiv:1909.09577. [6] Heafield, K. (2011, July). KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation (pp. 187-197). [7] Hernández Mena, Carlos Daniel; Simonsen Annika. "Ravnursson Faroese Speech and Transcripts" Web Downloading: http://hdl.handle.net/20.500.12537/276 ------------------------------------------------------------------------------- -------------------------------------------------------------------------------