Spjallromur - Icelandic Conversational Speech

About the Spjallrómur corpus
----------------------------
Spjallromur is an open source conversational speech corpus for speech
technology development. The corpus is 21 hrs and 20 mins long, with 54 total
conversations, 102 speakers. The data was collected for one year (September
2020 - September 2021) by Reykjavík University. There are two parts, the first
part has full conversations, while the second part has half conversations.

The dataset was primarily created for automatic speech recognition but due to
the nature of the dataset, it can also be used for other speech technology
fields such as: speaker identification, speaker diarization, and conversational
language modeling.

Spjallrómur was collected using a custom made online chatting platform called
spjall, which is Icelandic for chat.  Each speaker used their own microphones
(some picked up background noise like the neighboring speakers or other
speakers).  and devices.  The audio from each microphone was saved to a
separate audio file, .WAV. There are two speakers per conversation. The speaker
set contains both native and non-native Icelandic speakers. All speakers are
adults.  Each conversation has two sets of demographics metadata, audio file,
and transcript, one file for each speaker.  Due to some network lag there is
sometimes a small difference in length of the two audio files in a
conversation. As there were a limited number of participants , some speakers
may be in more than one conversation. The text has not been aligned with the
audio. 

The full conversations contain 19 hrs of 48 full conversations, 96 speakers. 
The half conversations contain 2 hrs 20 mins of 6 partial conversations, 6 speakers.

Personally identifiable information has been redacted in the audio with a 400Hz
beep and replaced with XXX in the transcript.

Non words are marked with () or []. Partial words are marked with [HIK: ..].

The structure of the corpus
---------------------------
<corpus root>
    |
    . - docs/
            |
            . - spjallromur_README.txt
            |
            . - manual_transcripts.json
    |
    . - data/
            |
            . - half_conversations/
                    |
                    . - <session_id>/
                            |
                            . - speaker_x_convo_<session_id>_demographics.json
                            |
                            . - speaker_x_convo_<session_id>_transcript.json
                            |
                            . - speaker_x_convo_<session_id>.wav
            . - full_conversations/
                    |
                    . - <session_id>/
                            |
                            . - speaker_a_convo_<session_id>_demographics.json
                            |
                            . - speaker_a_convo_<session_id>_transcript.json
                            |
                            . - speaker_a_convo_<session_id>.wav
                            |
                            . - speaker_b_convo_<session_id>_demographics.json
                            |
                            . - speaker_b_convo_<session_id>_transcript.json
                            |
                            . - speaker_b_convo_<session_id>.wav

Session IDs are 8-character hexadecimal identifiers (e.g. 0f2c315c, 3ac74ae1).
Half conversations have one speaker per session (speaker_a or speaker_b).
Full conversations have two speakers per session (speaker_a and speaker_b).

* speaker_x_convo_<session_id>.wav - Each audio file is 16 bit, 16000 Hz, single
channel WAVE. It contains the voice of speaker x in that conversation.

* speaker_x_convo_<session_id>_transcript.json - JSON file of the transcript
and corresponding metadata, generated within the Tiro text editor
(https://tal.tiro.is). Word-level segments with timing. Metadata includes:
name (filename), fileType, languageCode, recordingDuration (float seconds),
speakers, audio_file, session_id.

* speaker_x_convo_<session_id>_demographics.json - Contains session_id (8-char
identifier), the speaker's age, gender, and audio duration in seconds. Gender
and age are in Icelandic. Here's the mapping to English:

gender
id: kona, name: 'female'
id: karl, name: 'male'
id: annad, name: 'other'

age group
id: 'unglingur', name: '18-19'
id: 'tvitugt', name: '20-29'
id: 'thritugt', name: '30-39'
id: 'fertugt', name: '40-49'
id: 'fimmtugt', name: '50-59'
id: 'sextugt', name: '60-69'
id: 'sjotugt', name: '70-79'
id: 'attraett', name: '80-89'
id: 'niraett', name: '90+'

Transcript formats
------------------
Two transcript formats are provided:

1. Automatic transcripts (*_transcript.json) - One file per speaker per
   session, in each session folder. Word-level granularity with segments
   containing timed words. All 54 sessions (48 full + 6 half) have individual
   transcripts.

2. Manual transcripts (docs/manual_transcripts.json) - A subset of full
   conversations (21 sessions, ~7.9 hrs) has been manually transcribed by
   three transcribers (Baldur, Hrafnhildur, Selma). Seven sessions (~3.0 hrs)
   have two or more transcriptions by different transcribers; use transcript_id
   (session_id_transcriber) to distinguish entries.

   Each conversation entry contains: session_id, transcript_id, speakers,
   speaker_a and speaker_b (each with recordingDuration, audio_file, fileType),
   transcriber, and turns (speaker, text, startTime, endTime in float seconds).
   When multiple transcribers transcribed the same session, they appear as
   separate entries in the consolidated file.

   Transcription methodology differed by transcriber:
   - Selma transcribed by listening to each channel (speaker A, speaker B)
     separately. Speaker identity was thus known in advance.
   - Baldur and Hrafnhildur transcribed from a mix of the two channels. They
     performed transcription and speaker diarization at the same time.

   There is no word-level segmentation; turn boundaries reflect each
   transcriber's own judgment of where one utterance ends and the next begins.

Authors
-------

Reykjavík University

Judy Y Fong - judy@judyyfong.xyz
Staffan Hedström
Ólafur Helgi Jónsson
Lára Margrét H. Hólmfriðardóttir
Sunneva Þorsteinsdóttir
Málfriður Anna Eiríksdóttir
David Erik Mollberg
Eydís Huld Magnúsdóttir
Ragnheiður Þórhallsdóttir
Jon Gudnason - jg@ru.is

Acknowledgements
----------------
Special thanks to the other members of the Language and Voice Lab
(https://lvl.ru.is), the student employees, Róbert Kjaran, and Magnús Teitsson.

This project was funded by the Language Technology Programme for Icelandic
2019-2023. The programme, which is managed and coordinated by Almannarómur, is
funded by the Icelandic Ministry of Education, Science and Culture.

This project was funded in part by the the Icelandic Directorate of Labour's
student summer job program in 2021.

Citations
---------
 @misc{fong-spjallromur,
 title = {Spjallromur - Icelandic Conversational Speech},
 author = {Fong, Judy Y and Hedstr{\"o}m, Staffan and J{\'o}nsson, {\'O}lafur
Helgi and H{\'o}lmfri{\dh}ard{\'o}ttir, L{\'a}ra Margr{\'e}t H. and
{\TH}orsteinsd{\'o}ttir, Sunneva and Eir{\'{\i}}ksd{\'o}ttir, M{\'a}lfri{\dh}ur
Anna and Mollberg, David Erik and Magn{\'u}sd{\'o}ttir, Eyd{\'{\i}}s Huld and
{\TH}{\'o}rhallsd{\'o}ttir, Ragnhei{\dh}ur and Gudnason, Jon},
 url = {},
 note = {{CLARIN}-{IS}},
 copyright = {Creative Commons - Attribution 4.0 International ({CC} {BY} 4.0)},
 year = {2022} }

License
------
This dataset is released under a Creative Commons Attribution 4.0 International
(CC BY 4.0) license. (https://creativecommons.org/licenses/by/4.0/)