Icelandic broadcast speech About the Icelandic broadcast speech corpus --------------------------- The Icelandic broadcast speech corpus is 193 hours of radio and TV data from RÚV. The radio data consists of episodes of Spegillinn, morning news, evening news, Morgunútvarpið, Morgunvaktin and Samfélagið. The TV data consists of episodes of Kastljós. All the data is from episodes broadcast in the period the period from January 2020 to August 2021. The data contains 40,746 utterances from 1,360 speakers. The data is aligned and segmented, ready for ASR training. The data set includes both prompted speech (e.g. from the News) and conversational speech (e.g. Morgunvaktin and Kastljósið). This data set is published by RÚV, transcribed by Creditinfo and aligned at Reykjavik University with the help of Tiro's automatic speech recognizer. Special thanks to Tiro for supplying transcriptions with per-word timestamps from their automatic speech recognizer, which were essential in the alignment process. This work is licensed under the Creative Commons Attribution 4.0 International License. This is a broadcast dataset collected from RÚV by Creditinfo and supplied to Rekjavík University in 2020-2021. So all episodes within this dataset aired in 2020 at the latest. All episodes were recorded as digital originals. The text originates from transcription done at Creditinfo. Audio files are 16kHz one channel flac created from the original .ts, .mp3 and .mp4 episodes. The alignment method used was developed at Reykjavík University. This dataset was released in the year 2022 in February (2022-02). The dataset contains data from the following 6 shows: Fréttir kl. 19:00 - prime time news Morgunfréttir kl. 08:00 - prime time news Kastljós - news commentary Spegillinn - news commentary Morgunvaktin - news commentary Samfélagið - An informed and critical discussion on social issues --------------------------- To unzip the data set the .z0x files must be combined using the command: zip -s 0 cut_audio_RELEASE.zip --out unsplit.zip And then the unsplit file can be unzipped with: unzip unsplit.zip --------------------------- Files of the corpus --------------------------- - README - This file - cut_audio_release.z0x - Split zip files. Must be combined to be able to unzip the data set (see above) - cut_audio_RELEASE - Directory with audio files when the zip files have been unzipped. - metadata.tsv - This is a tab separated file containing segment_id, utterance_id, speaker_id, duration and text. Path of the audio file can be constructed from the utterance_id (called utt_id in the metadata file) (data/audio/utt_id) Within each show, the episode. - speaker_information.tsv - This is a tab separated file containing name, initials, gender and speaker_id for all speakers in the corpus. --------------------------- Statistics --------------------------- 281 hrs 221766 utterances 1360 speakers --------------------------- Authors --------------------------- Reykjavík University Ragnar Pálsson - ragnarp@ru.is Michal Borsky Inga Rún Helgadóttir Helga Svala Sigurðardóttir Þorsteinn Daði Gunnarsson Judy Y Fong Ragnheiður Þórhallsdóttir Jón Guðnason - jg@ru.is The Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) Helga Lára Þorsteinsdóttir Creditinfo Fjölmiðlavaktin ehf. Hilmar Daníelsson --------------------------- Acknowledgements --------------------------- This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. --------------------------- License --------------------------- This dataset is licensed under Creative Commons - Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/