--------------------------------------------------------------------------------
                  Samrómur L2 24.10
--------------------------------------------------------------------------------

Language        : Icelandic

Authors         : Luke O'Brien, Þorsteinn Daði Gunnarsson, Eydís Huld 
                  Magnúsdóttir, Jon Gudnason

Recommended use : speech recognition, speaker verification, speaker
                  identification and speaker enrollment

--------------------------------------------------------------------------------
Description
--------------------------------------------------------------------------------

This release of data from the Samrómur collection focuses on speakers where
Icelandic is not their native language. The corpus contains 
(50.1 hours) of speech recordings in Icelandic.

The corpus is a result of the crowd-sourcing effort run by the Language and
Voice Lab (LVL) at Reykjavik University, in cooperation with Almannarómur, the
Icelandic Center for Language Technology. The recording process has started in
June 2024 and ended October 2024.

The present edition of the corpus has been authorized for release in November 
2024. The aim is to expand the the open-source data for Icelandic to include
more non-native speakers. This corpus consists of audio recordings and a 
metadata file containing the prompts read by the participants.

To see more open resources developed by the Language and Voice Lab (LVL) see the
GitHub repository at https://github.com/cadia-lvl/samromur-asr or the Icelandic
Clarin repository at https://repository.clarin.is/

--------------------------------------------------------------------------------
Corpus Characteristics
--------------------------------------------------------------------------------

- The utterances were recorded by a smartphone or the web app.

- Participants self-reported their age group, gender, native language and Icelandic proficiency.

- The corpus contains 36,891 automatically verified utterances from 273 speakers,
  totalling 50.1 hours.

- The amount of data from female speakers is 35h35m and the amount of data from
  male speakers are 14h27m and the amount of data from speakers with an
  unknown gender information is 0h7m.

- The number of female speakers is 185, and the number of male speakers is 85.
  The number of speakers with unknown gender information is 3.

- The number of utterances from female speakers is 25,867; the utterances
  from male speakers are 10,944; and the utterances from speakers with
  unknown gender information is 80.

- Icelandic proficiency was split into 3 options:
  - "Beginner", which made 4,627 utterances
  - "Intermediate, which made up 20,184 utterances
  - "Advanced", which made up 7,943 utterances

- For similar datasets please use 
  Samrómur L2 22.09, Samrómur 21.05, Samrómur Queries 21.12 or
  Samrómur Children 21.09.

- If any of the information in the metadata is unavailable this will be
  indicated with a NAN in the metadata file.

--------------------------------------------------------------------------------
Collection Procedure
--------------------------------------------------------------------------------

The data was collected using the website https://samromur.is, the code of which
is available at https://github.com/cadia-lvl/samromur. The collection procedure
is well described in "Samrómur: Crowd-sourcing Data Collection for Icelandic
Speech Recognition" [1] and "Samrómur: Crowd-sourcing large amount of data” [2].

The original audio was collected at 44.1 kHz or 48 kHz sampling rate as _.wav
files, which was down-sampled to 16 kHz and converted to _.flac. Each recording
contains one read prompt from a script.

Each time a device visits the website for the first time they are assigned a
client id, this client id together with a combination of gender, age and native
language was used to assign the speaker id. The corpus is distributed with a
metadata file with detailed information on each utterance and speaker. The
metadata file is encoded as UTF-8 Unicode.

The prompts were gathered from The Icelandic Gigaword Corpus, which is available
at http://clarin.is/en/resources/gigaword. These prompts are specifically from
books for children and teenagers that were in the Icelandic Gigaword Corpus. Only
a few sentences were taken from each source and the prompts were then randomly
shuffled.

Prompts were pulled from these sources if they met the criteria:
Total sentence length between 2 and 13 (exclusive)
- Maximum word length of 15 characters
- Starts with a capital letter
- Ends with a punctuation mark (".", "!", "?", "“", """)
Exclude sentences with:
- Numbers
- Abbreviations
- Punctuation marks other than: ":", "„", """, "“", "”", "„", "‟", ",", ";", "!", ".", "?".
- Odd number of quotation marks ("„", """, "“", "”", "„", "‟")
- A period (.) somewhere in the middle of a sentence

--------------------------------------------------------------------------------
Data Format Specifics
--------------------------------------------------------------------------------

- Text : The metadata file contains the prompts in their original text form,
         as the participants saw them.

- Audio: The distributed audio files are encoded at 16 kHz sampling rate, 16 bit
         linear PCM, 1 channel, \*.flac format. The corpus is split into train,
	 dev and test subsets. Each subset contains folders that correspond to
         speaker IDs and the audio files inside use the following naming
         convention: {speaker_ID}-{utterance_ID}.flac.

--------------------------------------------------------------------------------
Citation
--------------------------------------------------------------------------------

When publishing results based on the corpus please refer to:

   O'Brien et al. "Samrómur L2 24.09". Web Download. Reykjavik
   University: Language and Voice Lab, 2024.

Contact: Jon Gudnason (jg@ru.is)

License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode)

--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------

This project was funded by the Language Technology Programme for Icelandic.
The programme, which is managed and coordinated by Almannarómur,
is funded by the Icelandic Ministry of Education, Science and Culture.

A big thank you to the volunteers who gave their time to this project.

--------------------------------------------------------------------------------
Stats for the dataset
--------------------------------------------------------------------------------


Age and gender split:
|                  | Total |
| ---------------- | ----- |
| 0-19:            | 0%    |
| 20-29:           | 45.9% |
| 30-39:           | 28.3% |
| 40-49:           | 18.2% |
| 50-59:           | 6.9% |
| 60-69:           | 0.5%  |
| 70-79:           | 0.2%  |
| 80+:             | 0%  |
| ---------------- | ----- |
| Female:          | 70.1% |
| Male:            | 29.7% |
| Other:           | 0.2%  |
| ---------------- | ----- |
| Duration (h):    | 50.1 |
| Unique speakers: | 273  |


Total speakers and utterances:
Speakers: 273
Utterances: 36,891

Average utterance length: 4.89s

--------------------------------------------------------------------------------
References
--------------------------------------------------------------------------------

[1] Mollberg et al. "Samrómur: Crowd-sourcing Data Collection for Icelandic 
    Speech Recognition," 12th International Conference on Language Resources and
    Evaluation (LREC), France, 2020.

[2] Hedström et al. "Samrómur: Crowd-sourcing large amount of data”,
    13th International Conference on Language Resources and Evaluation (LREC), 
    France, 2022.

--------------------------------------------------------------------------------