--------------------------------------------------------------------------------
                        Raddrómur Icelandic Speech 22.09
--------------------------------------------------------------------------------

Language        : Icelandic

Authors         : Carlos Daniel Hernández Mena, Staffan Hedström, 
                  Ragnheiður Þórhallsdóttir, Judy Y. Fong, Þorsteinn Daði 
                  Gunnarsson, Helga Svala Sigurðardóttir, Helga Lára 
                  Þorsteinsdóttir and Jón Guðnason.

Recommended use : Speech Recognition

--------------------------------------------------------------------------------
Description
--------------------------------------------------------------------------------

The "Raddrómur Icelandic Speech 22.09" ("Raddrómur Corpus" for short) is an 
Icelandic corpus created by the Language and Voice Laboratory (LVL) at 
Reykjavík University (RU) in 2022. 

The Raddrómur Corpus is intended for the speech recognition field and it is 
made out of radio podcasts mostly taken from RÚV (ruv.is). Such podcasts were 
selected because they contained a text script that matches with certain 
fidelity what is said during the show. After automatic segmentation of the
episodes, the transcriptions were inferred using the scripts along with a
forced alignment technique.

In order to distinguish the transcriptions with fewer expected mistakes, a
quality measure called "MAFIA Score" was added in the metadata file included
with the corpus. A MAFIA Score close to zero implies a better quality
transcription.

--------------------------------------------------------------------------------
Disclaimer and Terms of Use
--------------------------------------------------------------------------------

"Raddrómur Icelandic Speech 22.09" by the Language and Voice Laboratory (LVL) 
from Reykjavík University (RU) is licensed under a Creative Commons Attribution 
4.0 International (CC BY 4.0) License with the hope that it will be useful, 
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 
or FITNESS FOR A PARTICULAR PURPOSE.  

To view a copy of this license visit:
https://creativecommons.org/licenses/by/4.0/

--------------------------------------------------------------------------------
Corpus Characteristics
--------------------------------------------------------------------------------

- The corpus was automatically segmented using the tool inaSpeechSegmenter [1].
  
- The forced alignment was performed using the tool MAFIA aligner [2].
  
- The corpus comes with a metadata file wich is in TSV format. This file 
  contains the normalized transcription of the corpus and the filenames among
  other relevant information.

- The corpus is comprised of speech utterances from professional podcasters.
 
- The corpus contains 13,030 utterances, totalling 49 hours and 09 minutes.

- The corpus is not split into train/dev/test portions.

- The corpus is distrubuted in the following format: flac, 16kHz@16bits mono.

- The column "mafia_score" in the metadata file indicates the expected
  precision of the transcription. Zero is the highest precision.
  
--------------------------------------------------------------------------------
Speech Sources
--------------------------------------------------------------------------------

The Raddrómur Corpus is composed of different radio podcasts in Icelandic. More 
information about the origin of these podcasts comes as follows:

- Rokkland
Author: Ólafur Páll Gunnarsson
Podcast/Radio show hosted by RUV.

- A Tonsvidinu
Author: Una Margrét Jónsdóttir
Podcast/Radio show hosted by RUV.

- I ljosu Sogunnar
Author: Vera Illugadóttir
Podcast/Radio show hosted by RUV.

- Nedanmals
Authors: Elísabet Rún Þorsteinsdóttir and Marta Eir Sigurðardóttir.
Elísabet Rún Þorsteinsdóttir og Marta Eir Sigurðardóttir.

- Leikfangavelin
Author: Atla Hergeirssonar
Independent Podcast/Radio show.

--------------------------------------------------------------------------------
The Metadata File (metadata.tsv)
--------------------------------------------------------------------------------

The metadata file is a "tab-separated values file" (TSV) containing all the 
relevant information of the corpus. This file can be read using the Python 
library called "Pandas" [3]. The metadata.tsv file comprises of the following
11 columns:


01.- id              : Filename as explained in the section "Audio Filenames"
                       without the extension ".flac".
                       
02.- filename        : Filename as explained in the section "Audio Filenames"
                       with the extension ".flac".

03.- podcast_id      : An id used to identify the original raw podcast.

04.- segment_num     : Segment number. Every podcast episode was given to us 
                       as one single audio file that is very long from the ASR 
                       perspective. So, every podcast is subdivided in segments 
                       of around 10 seconds using the tool inaSpeechSegmenter to 
                       fit most of the modern ASR engines.
                       
05.- start_time      : The timestamp indicating the beginning of a segment with 
                       respect to the original podcast episode. The format is 
                       Hour:Minute:Second.tenths of a second.
                       
06.- sentence_norm   : The normalized transcription: no punctuation marks, no 
                       digits, lower case letters, one single space between
                       words.                       
                       
07.- language        : "Icelandic" in all the cases.                       
                          
08.- created_at      : The date when the audio file was segemented to be part
                       of the corpus. The format is year-month-day.

09.- mafia_score     : This is a measure of the quality of the transcription. 
                       It works similar to the Word Error Rate, the lowest is
                       this number, the best is the precision of the 
                       transcription. For more information please see
                       the section "The MAFIA Score".
                   
10.- duration        : The absolute duration of a segment. The format is 
                       Hour:Minute:Second.tenths of a second              

11.- sample_rate     : 16kHz in all cases.

--------------------------------------------------------------------------------
Audio Filenames
--------------------------------------------------------------------------------

Every audio file in the Raddrómur Corpus has an individual filename with the 
following format:

                nedanmals_000003-0011-00:03:2062-00:00:0628.flac

nedanmals_000003 : Podcast Id

0011             : Segment number

00:03:2062       : Starting time: Hour:Minute:Second.tenths of a second
                   In this particular example, the segment starts at
                   minute 3, second 20.62.

00:00:0628       : Duration: Hour:Minute:Second.tenths of a second
                   In this particular example, the segment lasts 
                   6.28 seconds.
                  
.flac            : The corpus is distributed in flac format.

--------------------------------------------------------------------------------
The MAFIA Score
--------------------------------------------------------------------------------

The MAFIA aligner is designed to take a podcast episode along with a text script
reflecting what is spoken in the podcast, then segment the podcast and find a 
transcription that better fits what is in the script. When the script is not
accurate, MAFIA is able to infer a transcription using Automatic Speech 
Recognition. 

In order to find a transcription using the vocabulary of the text script, MAFIA
creates a 3-gram language model with SRILM [4] using the text of all the 
podcasts available at the moment of running it. After this, MAFIA transcribes 
all the segments using a speech recognizer based on NVIDIA-NeMo [5].

In order to calculate the MAFIA Score, a second round of speech recognition
is passed to all the segments but using a way more robust 6-gram language 
model with a size of 5GB [6]. The MAFIA score is then obtained by measuring
the Word Error Rate bewteen the first pass transcriptions (reference) and the
second pass transcriptions (hyphotesis). According to this, a MAFIA score
of 0 reflects a transcription that is equal in both passes and therefore,
it is a high quality transcription.

--------------------------------------------------------------------------------
Citation
--------------------------------------------------------------------------------

When publishing results based on the corpus please refer to:

   Mena, Carlos et al. "Raddrómur Icelandic Speech 22.09". Web 
   Download. Reykjavik University: Language and Voice Lab, 2022.

Contact: Jón Guðnason (jg@ru.is)

License: CC BY 4.0

--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------

This project was funded by the Language Technology Programme for Icelandic 
2019-2022. The programme, which is managed and coordinated by Almannarómur, 
is funded by the Icelandic Ministry of Education, Science and Culture.

Special thanks to the podcasters and to Aron Berg from RÚV.

--------------------------------------------------------------------------------
References
--------------------------------------------------------------------------------

[1] Software. inaSpeechSegmenter. CNN-based audio segmentation toolkit.
              https://pypi.org/project/inaSpeechSegmenter/

[2] Software. Match-Finder Aligner (MAFIA). Software tool destined to 
              automatically create ASR corpora out of speech files along with 
              scripts reflecting what is spoken in such speech files.
              http://hdl.handle.net/20.500.12537/215

[3] Software. Pandas (Python Library). https://pandas.pydata.org

[4] Software. SRI Language Modeling toolkit. https://www.sri.com/platform/srilm
    
[5] Software. NVIDIA-NeMo. An open-source framework for developers to build 
              and train state-of-the-art conversational AI models.
              https://developer.nvidia.com/nvidia-nemo
              
[6] Resource. 6-GRAM Language Model in Icelandic for NeMo (Binary Format) 
              22.06. A word level n-gram language model in binary format 
              suitable for recognizers based on the NVIDIA-NeMo framework. 
              http://hdl.handle.net/20.500.12537/226
              
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------