Show simple item record

 
dc.contributor.author Pálsson, Ragnar
dc.contributor.author Sigurðardóttir, Helga Svala
dc.contributor.author Þórhallsdóttir, Ragnheiður
dc.contributor.author Borsky, Michal
dc.contributor.author Þorsteinsdóttir, Helga Lára
dc.contributor.author Helgadóttir, Inga Rún
dc.contributor.author Guðnason, Jón
dc.contributor.author Daníelsson, Hilmar
dc.contributor.author Fong, Judy Y
dc.contributor.author Gunnarsson, Þorsteinn Daði
dc.date.accessioned 2022-02-22T14:31:36Z
dc.date.available 2022-02-22T14:31:36Z
dc.date.issued 2022-02-17
dc.identifier.uri http://hdl.handle.net/20.500.12537/193
dc.description [ENGLISH] The Icelandic broadcast speech corpus is 193 hours of radio and TV data from RÚV. The radio data consists of episodes of Spegillinn, morning news, evening news, Morgunútvarpið, Morgunvaktin and Samfélagið. The TV data consists of episodes of Kastljós. All the data is from episodes broadcast in the period from January 2020 to August 2021. The data contains 40,746 utterances from 1,360 speakers. The data is aligned and segmented, ready for ASR training. The data set includes both prompted speech (e.g. from the News) and conversational speech (e.g. Morgunvaktin and Kastljósið). This data set is published by RÚV, transcribed by Creditinfo and aligned at Reykjavik University with the help of Tiro's automatic speech recognizer. Special thanks to Tiro for supplying transcriptions with per-word timestamps, which were essential in the alignment process. This work is licensed under the Creative Commons Attribution 4.0 International License. [ÍSLENSKA] Íslenskt fjölmiðlatal gagnasafnið er 193 klukkustundir af útvarps- og sjónvarpsefni frá RÚV. Útvarpsgögnin samanstanda af þáttum af Speglinum, Morgunfréttum, kvöldfréttum, morgunútvarpinu, Morgunvaktinni og Samfélaginu. Sjónvarpsgögnin samanstanda af þáttum af Kastljósi. Öll gögnin eru úr þáttum sem voru sendir út á tímabilinu janúar 2020 til ágúst 2021. Gagnasafnið inniheldur 40.746 yrðingar frá 1.360 málhöfum. Textinn er staðlaður. Gögnin eru samröðuð og sneidd, tilbúin til þjálfunar talgreina. Gögnin innihalda bæði umbeðið tal (t.d. Fréttir) og samtöl (t.d. Morgunvaktin og Kastljós). Gögnin voru gefin út af RÚV, afrituð af Creditinfo og samröðuð hjá Háskólanum í Reykjavík með hjálp frá Tiro-talgreininum. Sérstakar þakkir fær Tiro fyrir að útvega afrit með tímastimpli á hverju orði. Þessar afritanir voru ómissandi í samröðunarferlinu. Þetta verk er gefið út með Creative Commons 4.0 alþjóðlegu afnotaleyfi.
dc.language.iso isl
dc.publisher Reykjavík University
dc.publisher The Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV)
dc.publisher Creditinfo Fjölmiðlavaktin ehf.
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject broadcast
dc.subject radio broadcast
dc.subject tv broadcast
dc.subject automatic speech recognition
dc.subject ruv radio
dc.subject ruv tv
dc.title Icelandic broadcast speech
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding Clarin IS Repository
contact.person Ragnar Pálsson ragnarp@ru.is Reykjavík University
contact.person Jón Guðnason jg@ru.is Reykjavík University
sponsor Ministry of Education, Science and Culture Transcribe and align radio and TV material (H2) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 66 gb
size.info 193 hours
size.info 40746 utterances
size.info 1966745 words
files.size 27663814912
files.count 10


 Files in this item

 Download all files in item (25.76 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
README.txt
Size
3.83 KB
Format
Text file
Description
Unknown
MD5
bee3bc2439355fee77156ab87078562a
 Download file  Preview
 File Preview  
Icelandic broadcast speech

About the Icelandic broadcast speech corpus
---------------------------
The Icelandic broadcast speech corpus is 193 hours of radio and TV data from RÚV. 
The radio data consists of episodes of Spegillinn, morning news, evening news, 
Morgunútvarpið, Morgunvaktin and Samfélagið. The TV data consists of episodes of Kastljós. 
All the data is from episodes broadcast in the period the period from January 
2020 to August 2021. The data contains 40,746 utterances from 1,360 speakers. 
The data is aligned and segmented, ready for ASR training. The data set includes 
both prompted speech (e.g. from the News) and conversational speech (e.g. Morgunvaktin and Kastljósið).
This data set is published by RÚV, transcribed by Creditinfo  and aligned at Reykjavik University with the help of Tiro's automatic speech recognizer. 
Special thanks to Tiro for supplying transcriptions with per-word timestamps from their automatic speech recognizer, which were essential in the alig . . .
                                            
Icon
Name
metadata.tsv
Size
14.31 MB
Format
Unknown
Description
Unknown
MD5
43c1705145d20497f3bad1560502206a
 Download file
Icon
Name
speaker_information.tsv
Size
49.51 KB
Format
Unknown
Description
Unknown
MD5
512657b60add3e71843b85ab0d7ce0c7
 Download file
Icon
Name
cut_audio_RELEASE.z01
Size
3.91 GB
Format
Unknown
Description
Unknown
MD5
6b3ab042fa8ad276fced07352676f406
 Download file
Icon
Name
cut_audio_RELEASE.z02
Size
3.91 GB
Format
Unknown
Description
Unknown
MD5
459be745d78ac230ed815eb5f00d3e06
 Download file
Icon
Name
cut_audio_RELEASE.z03
Size
3.91 GB
Format
Unknown
Description
Unknown
MD5
aa5de9a1fef84ae23f3782584d124c9b
 Download file
Icon
Name
cut_audio_RELEASE.z04
Size
3.91 GB
Format
Unknown
Description
Unknown
MD5
436e5ccac3187b72d021888739979eac
 Download file
Icon
Name
cut_audio_RELEASE.z05
Size
3.91 GB
Format
Unknown
Description
Unknown
MD5
141bdc0612d9d9cfcae1aafb553a507f
 Download file
Icon
Name
cut_audio_RELEASE.z06
Size
3.91 GB
Format
Unknown
Description
Unknown
MD5
420de02e960b61d27b7045a8a1c1c45c
 Download file
Icon
Name
cut_audio_RELEASE.zip
Size
2.31 GB
Format
application/zip
Description
Unknown
MD5
096ae66512323726f3b3ecaffd9673af
 Download file

Show simple item record