--------------------------------------------------------------------------------
          Parallel Speech Recordings for Icelandic L1 and L2 speakers
--------------------------------------------------------------------------------

Language        : Icelandic

Authors         : Caitlin Richter, Þorsteinn Daði Gunnarsson, Bjarni Barkarson, 
Kolbrún Friðriksdóttir, Branislav Bédi, Jón Guðnason

--------------------------------------------------------------------------------
Description
--------------------------------------------------------------------------------

This release of data is a parallel learner speech corpus containing the same 
material read aloud by both native and non-native Icelandic speakers. The corpus
contains 28,747 (26.2 hours) of mostly un-verified speech recordings in 
Icelandic.

The corpus is a result of the crowd-sourcing effort run by the Language and
Voice Lab (LVL) at Reykjavik University, in cooperation with University of 
Iceland and Árni Magnússon Institute for Icelandic Studies. The recording 
process took place in October 2021 through October 2024.

The present edition of the corpus has been authorized for release in November 
2024. The aim is to create an open-source speech corpus to enable research and
development for Icelandic Language Technology and Computer Assisted Language 
Learning. The corpus consists of audio recordings and a metadata file containing
the prompts read by the participants.

To see more open resources developed by the Language and Voice Lab (LVL) see the
github and huggingface repositories at https://github.com/cadia-lvl/samromur-asr
and https://huggingface.co/language-and-voice-lab

--------------------------------------------------------------------------------
Corpus Characteristics
--------------------------------------------------------------------------------

- Speech in the corpus has not been validated for adherence to text prompts.

- The utterances were recorded by a smartphone or web app.

- Participants self-reported their age group, gender, native language, and 
  Icelandic proficiency level.

- Participants' ages are from 8 and up to 70-79 years. 26 speakers are under 18 
  while 144 are 18+.

- The corpus contains 28,747 utterances from 170 speakers, totalling 26.2 hours.

- The number of female speakers is 112, and the number of male speakers is 58. 
  No speakers in this collection had other or unknown gender information.

- The amount of utterances from female speakers are 17994, and the utterances
  from male speakers are 10753.

- The corpus is NOT split into train, dev, and test sets.

- If any of the information in the metadata is unavailable this will is
  indicated with a NAN in the metadata file.

--------------------------------------------------------------------------------
Collection Procedure
--------------------------------------------------------------------------------

The data was collected using the website https://samromur.is, code of which is
available at https://github.com/cadia-lvl/samromur. The collection Procedure
is well described in "Samrómur: Crowd-sourcing Data Collection for Icelandic
Speech Recognition" [1].

Each time a device visits the website for the first time they are assigned a
client id, this client id together with a combination of gender, age and native
language was used to assign the speaker id. If any of these variables were
changed, a new speaker id was also created. The corpus is distributed with a
metadata file with detailed information on each utterance and speaker. The
metadata file is encoded as UTF-8 Unicode.

The original audio was collected at 16, 44.1, 48, or 96 kHz sampling rate as 
_.wav files, according to participants' devices. Each recording contains one 
read prompt from a script. The script contains 483 unique prompts consisting of 
1901 tokens and 897 word types. The corpus contains at least 50 recordings from 
each speaker, and up to 461. There are at least 2 recordings of each prompt, 
and up to 117.

The prompts were produced by linguists and teachers of Icelandic as a second 
language at University of Iceland and Árni Magnússon Institute of Icelandic 
Studies. The prompt list covers main phonetic exercises currently presented to 
students at University of Iceland and has been considered both with respect to 
phonetics and pedagogy.

--------------------------------------------------------------------------------
Data Format Specifics
--------------------------------------------------------------------------------

- Text : The corpus does not contain separate transcription or prompt files.
         The metadata file contains the prompts in their original text form,
         as the participants saw them, and also in their normalized form.

- Audio: The distributed audio files are encoded as 16 bit linear PCM, 1 
         channel, \*.wav format. Sampling rates vary across 16 kHz, 44.1 kHz, 48
         kHz, or 96 kHz, depending on the device used for recording; the 
         original audio is distributed in this release. The audio for the 
         utterances is located in the audio folder and contains folders that 
         correspond to speaker IDs, and the audio files inside use the following
         naming convention: {speaker_ID}-{utterance_ID}.wav.

--------------------------------------------------------------------------------
Citation
--------------------------------------------------------------------------------

When publishing results based on the corpus please refer to:

   Richter et al. "Parallel Speech Recordings for Icelandic L1 and L2 
   speakers". Web Download. Reykjavik University: Language and Voice Lab, 2024.

Contact: Jon Gudnason (jg@ru.is)

License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode)

--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------

This project was funded by the Ministry of Culture and Business Affairs
Project code: Parallel Speech Recordings for Icelandic L1 and L2 speakers
Project name: Language Technology for Icelandic

Special thanks to the assisting LVL members and summer students for all the hard
work.

--------------------------------------------------------------------------------
Stats for the dataset
--------------------------------------------------------------------------------

Language background and gender split:
|                  | Speakers | Recordings |
| ---------------- | -------- | ---------- |
| L1-Icelandic:    |     79   |   13987    |
| L2-Icelandic:    |     91   |   14760    |
| ---------------- | -------- | ---------- |
| Female:          |    112   |   17994    |
| Male:            |     58   |   10753    |
| ---------------- | -------- | ---------- |


Detailed language background of speakers:
| -------- | ---------- | -------------- |
| Speakers | Recordings | Language       |
| -------- | ---------- | -------------- |
|    79    |   13987    | Icelandic      |
|    24    |    3706    | English        |
|     7    |    1374    | Russia         |
|     5    |    1106    | French         |
|     8    |    1098    | Polish         |
|     4    |     985    | Italian        |
|     8    |     902    | Spanish        |
|     7    |     888    | German         |
|     4    |     830    | Hungarian      |
|     2    |     486    | Latvian        |
|     2    |     474    | Persian        |
|     1    |     460    | other          |
|     1    |     425    | Ukrainian      |
|     3    |     392    | Serbo-Croatian |
|     2    |     354    | Czech          |
|     1    |     297    | Romania        |
|     2    |     210    | Filipino       |
|     2    |     188    | Slovak         |
|     1    |     172    | Norwegian      |
|     2    |     124    | Danish         |
|     1    |      70    | Vietnamese     |
|     1    |      60    | Portuguese     |
|     1    |      56    | katalonska     |
|     1    |      53    | Greek          |
|     1    |      50    | Turkish        |
| -------- | ---------- | -------------- |


Total speakers and utterances:
Speakers: 170
Utterances: 28,747

Average utterance length: 3.28s

--------------------------------------------------------------------------------
References
--------------------------------------------------------------------------------

[1] Mollberg et al. "Samrómur: Crowd-sourcing Data Collection for Icelandic 
    Speech Recognition," 12th International Conference on Language Resources and
    Evaluation (LREC), France, 2020.

--------------------------------------------------------------------------------