-------------------------------------------------------------------------------- Raddrómur Icelandic Speech 22.09 -------------------------------------------------------------------------------- Language : Icelandic Authors : Carlos Daniel Hernández Mena, Staffan Hedström, Ragnheiður Þórhallsdóttir, Judy Y. Fong, Þorsteinn Daði Gunnarsson, Helga Svala Sigurðardóttir, Helga Lára Þorsteinsdóttir and Jón Guðnason. Recommended use : Speech Recognition -------------------------------------------------------------------------------- Description -------------------------------------------------------------------------------- The "Raddrómur Icelandic Speech 22.09" ("Raddrómur Corpus" for short) is an Icelandic corpus created by the Language and Voice Laboratory (LVL) at Reykjavík University (RU) in 2022. The Raddrómur Corpus is intended for the speech recognition field and it is made out of radio podcasts mostly taken from RÚV (ruv.is). Such podcasts were selected because they contained a text script that matches with certain fidelity what is said during the show. After automatic segmentation of the episodes, the transcriptions were inferred using the scripts along with a forced alignment technique. In order to distinguish the transcriptions with fewer expected mistakes, a quality measure called "MAFIA Score" was added in the metadata file included with the corpus. A MAFIA Score close to zero implies a better quality transcription. -------------------------------------------------------------------------------- Disclaimer and Terms of Use -------------------------------------------------------------------------------- "Raddrómur Icelandic Speech 22.09" by the Language and Voice Laboratory (LVL) from Reykjavík University (RU) is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. To view a copy of this license visit: https://creativecommons.org/licenses/by/4.0/ -------------------------------------------------------------------------------- Corpus Characteristics -------------------------------------------------------------------------------- - The corpus was automatically segmented using the tool inaSpeechSegmenter [1]. - The forced alignment was performed using the tool MAFIA aligner [2]. - The corpus comes with a metadata file wich is in TSV format. This file contains the normalized transcription of the corpus and the filenames among other relevant information. - The corpus is comprised of speech utterances from professional podcasters. - The corpus contains 13,030 utterances, totalling 49 hours and 09 minutes. - The corpus is not split into train/dev/test portions. - The corpus is distrubuted in the following format: flac, 16kHz@16bits mono. - The column "mafia_score" in the metadata file indicates the expected precision of the transcription. Zero is the highest precision. -------------------------------------------------------------------------------- Speech Sources -------------------------------------------------------------------------------- The Raddrómur Corpus is composed of different radio podcasts in Icelandic. More information about the origin of these podcasts comes as follows: - Rokkland Author: Ólafur Páll Gunnarsson Podcast/Radio show hosted by RUV. - A Tonsvidinu Author: Una Margrét Jónsdóttir Podcast/Radio show hosted by RUV. - I ljosu Sogunnar Author: Vera Illugadóttir Podcast/Radio show hosted by RUV. - Nedanmals Authors: Elísabet Rún Þorsteinsdóttir and Marta Eir Sigurðardóttir. Elísabet Rún Þorsteinsdóttir og Marta Eir Sigurðardóttir. - Leikfangavelin Author: Atla Hergeirssonar Independent Podcast/Radio show. -------------------------------------------------------------------------------- The Metadata File (metadata.tsv) -------------------------------------------------------------------------------- The metadata file is a "tab-separated values file" (TSV) containing all the relevant information of the corpus. This file can be read using the Python library called "Pandas" [3]. The metadata.tsv file comprises of the following 11 columns: 01.- id : Filename as explained in the section "Audio Filenames" without the extension ".flac". 02.- filename : Filename as explained in the section "Audio Filenames" with the extension ".flac". 03.- podcast_id : An id used to identify the original raw podcast. 04.- segment_num : Segment number. Every podcast episode was given to us as one single audio file that is very long from the ASR perspective. So, every podcast is subdivided in segments of around 10 seconds using the tool inaSpeechSegmenter to fit most of the modern ASR engines. 05.- start_time : The timestamp indicating the beginning of a segment with respect to the original podcast episode. The format is Hour:Minute:Second.tenths of a second. 06.- sentence_norm : The normalized transcription: no punctuation marks, no digits, lower case letters, one single space between words. 07.- language : "Icelandic" in all the cases. 08.- created_at : The date when the audio file was segemented to be part of the corpus. The format is year-month-day. 09.- mafia_score : This is a measure of the quality of the transcription. It works similar to the Word Error Rate, the lowest is this number, the best is the precision of the transcription. For more information please see the section "The MAFIA Score". 10.- duration : The absolute duration of a segment. The format is Hour:Minute:Second.tenths of a second 11.- sample_rate : 16kHz in all cases. -------------------------------------------------------------------------------- Audio Filenames -------------------------------------------------------------------------------- Every audio file in the Raddrómur Corpus has an individual filename with the following format: nedanmals_000003-0011-00:03:2062-00:00:0628.flac nedanmals_000003 : Podcast Id 0011 : Segment number 00:03:2062 : Starting time: Hour:Minute:Second.tenths of a second In this particular example, the segment starts at minute 3, second 20.62. 00:00:0628 : Duration: Hour:Minute:Second.tenths of a second In this particular example, the segment lasts 6.28 seconds. .flac : The corpus is distributed in flac format. -------------------------------------------------------------------------------- The MAFIA Score -------------------------------------------------------------------------------- The MAFIA aligner is designed to take a podcast episode along with a text script reflecting what is spoken in the podcast, then segment the podcast and find a transcription that better fits what is in the script. When the script is not accurate, MAFIA is able to infer a transcription using Automatic Speech Recognition. In order to find a transcription using the vocabulary of the text script, MAFIA creates a 3-gram language model with SRILM [4] using the text of all the podcasts available at the moment of running it. After this, MAFIA transcribes all the segments using a speech recognizer based on NVIDIA-NeMo [5]. In order to calculate the MAFIA Score, a second round of speech recognition is passed to all the segments but using a way more robust 6-gram language model with a size of 5GB [6]. The MAFIA score is then obtained by measuring the Word Error Rate bewteen the first pass transcriptions (reference) and the second pass transcriptions (hyphotesis). According to this, a MAFIA score of 0 reflects a transcription that is equal in both passes and therefore, it is a high quality transcription. -------------------------------------------------------------------------------- Citation -------------------------------------------------------------------------------- When publishing results based on the corpus please refer to: Mena, Carlos et al. "Raddrómur Icelandic Speech 22.09". Web Download. Reykjavik University: Language and Voice Lab, 2022. Contact: Jón Guðnason (jg@ru.is) License: CC BY 4.0 -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- This project was funded by the Language Technology Programme for Icelandic 2019-2022. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. Special thanks to the podcasters and to Aron Berg from RÚV. -------------------------------------------------------------------------------- References -------------------------------------------------------------------------------- [1] Software. inaSpeechSegmenter. CNN-based audio segmentation toolkit. https://pypi.org/project/inaSpeechSegmenter/ [2] Software. Match-Finder Aligner (MAFIA). Software tool destined to automatically create ASR corpora out of speech files along with scripts reflecting what is spoken in such speech files. http://hdl.handle.net/20.500.12537/215 [3] Software. Pandas (Python Library). https://pandas.pydata.org [4] Software. SRI Language Modeling toolkit. https://www.sri.com/platform/srilm [5] Software. NVIDIA-NeMo. An open-source framework for developers to build and train state-of-the-art conversational AI models. https://developer.nvidia.com/nvidia-nemo [6] Resource. 6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06. A word level n-gram language model in binary format suitable for recognizers based on the NVIDIA-NeMo framework. http://hdl.handle.net/20.500.12537/226 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------