-------------------------------------------------------------------------------
                 Faroese Language Models with Pronunciations
-------------------------------------------------------------------------------

Authors         : Carlos Daniel Hernández Mena, Sandra Saxov Lamhauge,
                  Iben Nyholm Debess, Annika Simonsen.

Language        : Faroese.

Recommended use : speech recognition.

-------------------------------------------------------------------------------
Description
-------------------------------------------------------------------------------

In the context of Automatic Speech Recognition (ASR), a n-gram language model 
is a plain-text file containing the probabilities of word sequences with 
distict lengths or "n-grams" (for example, a sequence of one word is a 1-gram, 
a sequence of two words is a 2-gram and so on). Acoording to this, the "Faroese 
Language Models with Pronunciations" is a set of n-gram language models in ARPA 
format along with pronunciation dictionaries containing the words that are 
present in such language models. 

This set was originally created to be used in the ASR field. In specific, it 
was designed for the following Kaldi [1] recipe:

	- Kaldi Recipe for Faroese:
	https://github.com/CarlosDanielMena/Kaldi_Recipe_for_Faroese

Nevertheless, due to the flexibility of these kind of resources and 
their possible application to other tasks, systems or code recipes, it was 
decided to publish these models as an independent item.

-------------------------------------------------------------------------------
The Language Models
-------------------------------------------------------------------------------

The language models were created using the Faroese text provided in the "Basic 
Language Resource Kit 1.0" (BLARK 1.0) [2] developed by the Ravnur Project 
from the Faroe Islands [3]. The BLARK contains text from newspaper articles, 
parliamentary speeches, books and more. The normalization process of the 
sentences utilized to generate the language models included to allow only 
characters belonging to the Faroese alphabet and removing punctuation marks. 

The resulting text has a length of more than half million lines of text 
(106.3MB approximately). It was used to create a 3-gram (recommended for 
decoding) and a 4-gram (recommended for re-scoring) language models with the 
SRILM toolkit [4]. Both the 3-gram and 4-gram models come in pruned and 
unpruned versions.

It is also included a 6-gram language model in binary format suitable for 
ASR experiments with the NeMo toolkit [5]. In particular, this model was 
created using KenLM [6].

-------------------------------------------------------------------------------
Pronouncing Dictionaries
-------------------------------------------------------------------------------

In order to preserve the origin of the pronounciations, a number of pronouncing
dictionaries are provided:

 - Central_Faroese.dic            : It contains pronunciations of the variant 
                                    of Faroese which is considered the most 
                                    common in the Faroe Islands.
 
 - East_Faroese.dic               : It contains pronunciation of the East 
                                    variant of Faroese
 
 - Ravnursson_Composite_Words.dic : It contains words with hyphens and/or
                                    underscores that are present in the
                                    Ravnursson Corpus [7]. These type of
                                    composite words can be problematic for
                                    a G2P-tool.
 
 - BLARK.dic                      : It contains pronunciations of words that
                                    are present in the BLARK 1.0 but that are
                                    not present in any othe dictionary of the
                                    set.
 
 - FAROESE_ASR.dic                : This dictionary is recomended for ASR
                                    experiments in Kaldi or any other ASR 
                                    system based on phonemes. The dictionary
                                    is the mix of Central_Faroese.dic,
                                    East_Faroese.dic and
                                    Ravnursson_Composite_Words.dic. It is
                                    imporant to clarify that the dictionary
                                    can contain words with multiple 
                                    pronunciations, which is normal in
                                    Kaldi-like systems.

-------------------------------------------------------------------------------
Lists of Phonemes
-------------------------------------------------------------------------------

The set of phonemes present in the Central and East variants of Faroese
are similar but the East variant counts with 3 more phonemes than the Central.

For this reason, the list of phonemes provided in the set are divided in 3
files:

- central.phones : List of the 60 phonemes of the Central variant.

- east.phones    : List of the 63 phonemes of the East variant.

- asr.phones     : The mix of the East and Central phonemes with no 
                   repetitions. Actually, the list of the Central variant
                   is a subset of the Easter. This list is recommened for
                   ASR experiments in Kaldi or any other system based on 
                   phonemes.

-------------------------------------------------------------------------------
Citation
-------------------------------------------------------------------------------

When publishing results based on the models, please refer to:

   Hernández Mena, Carlos Daniel; Lamhauge, Sandra Saxov; Nyholm Debess, 
   Iben; Simonsen, Annika. "Faroese Language Models with Pronunciations".
   Web Download. Reykjavik University: Language and Voice Lab, 2022.

Contact: Carlos Daniel Hernández Mena (carlos.mena@ciempiess.org)

License: CC BY 4.0

-------------------------------------------------------------------------------
Acknowledgements
-------------------------------------------------------------------------------

The authors want to thank to Jón Guðnason, head of the Language and Voice Lab 
for providing computational power to make these models possible. We also want 
to thank to the "Language Technology Programme for Icelandic 2019-2023" which 
is managed and coordinated by Almannarómur, and it is funded by the Icelandic 
Ministry of Education, Science and Culture.

Special thanks to The Ravnur Project for making their "Basic Language Resource 
Kit"(BLARK 1.0) publicly available through the research paper "Creating a Basic 
Language Resource Kit for Faroese" https://aclanthology.org/2022.lrec-1.495.pdf

-------------------------------------------------------------------------------
References
-------------------------------------------------------------------------------

[1] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, 
    N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In 
    IEEE 2011 workshop on automatic speech recognition and understanding 
    (No. CONF). IEEE Signal Processing Society.

[2] Simonsen, A., Debess, I. N., Lamhauge, S. S., & Henrichsen, P. J. Creating 
    a basic language resource kit for Faroese. In LREC 2022. 13th International 
    Conference on Language Resources and Evaluation.
    
[3] Website. The Project Ravnur under the Talutøkni Foundation
    https://maltokni.fo/en/the-ravnur-project    
    
[4] Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In 
    Seventh international conference on spoken language processing.

[5] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, 
    B., ... & Cohen, J. M. (2019). Nemo: a toolkit for building AI 
    applications using neural modules. arXiv preprint arXiv:1909.09577.

[6] Heafield, K. (2011, July). KenLM: Faster and smaller language model 
    queries. In Proceedings of the sixth workshop on statistical machine 
    translation (pp. 187-197).

[7] Hernández Mena, Carlos Daniel; Simonsen Annika. "Ravnursson Faroese Speech 
    and Transcripts" Web Downloading: http://hdl.handle.net/20.500.12537/276

-------------------------------------------------------------------------------
-------------------------------------------------------------------------------