# Talrómur 2

Talrómur 2 is a public domain speech corpus for Text-To-Speech (TTS) research and development. 
The corpus consists of 56,225 audio clips of forty different speakers reading short sentences. 
The audio was recorded in 2021 by Reykjavík University and The Icelandic National Broadcasting Service as part of The Icelandic Language Technology Program. 

The speakers in the corpus are divided into four cohorts. Each cohort includes 10 speakers with similar voice characteristics. 

Each set shares the same structure and format. Each voice includes an additional 29 long sentences intended for modelling higher level prosody. 
These are located in separate folders but share the same speaker ID and format.


## Format

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.
Transcriptions and file names are provided in `index.tsv`. The file contains one record per line, delimited by a tab character and has the following fields:

 - **ID:** the name of the corresponding audio file (without extension) 
 - **Transcription:** Sentence spoken in the recording
 - **Normalized text:** The transcription normalized
 - **Transcription ID:** unique id for the utterance (many utterances are shared by multiple voices)
 - **Bad recording:** 1 if recording has flaws, else 0
 - **Comments:** zero or more comments separated by a tab character

Things to note:

 - Transcription IDs coincide between speakers. Furthermore the IDs coincide with IDs from the Talrómur corpus, meaning if a sentance occurs in both corpuses they share the same transcription ID
 - Comments are either a remark about the quality of the recording or a hint on how to normalize the utterance where the pronunciation is ambiguous (i.e. phone numbers)
 - You can filter good recordings by running `cat */index.tsv | awk -F'\t' '$4 ~ 0'`

 
## Voices

The corpus includes forty different voices in four cohorts. 

| Cohort | Id | Gender | # of recordings | Total duration (h) | 
| ------ | -- | ------ | --------------: | -----------------: |
| 1 | s146 | f | 1403 | 1.05 |
| 1 | s180 | f | 1579 | 2.25 |
| 1 | s186 | f | 1098 | 2.15 |
| 1 | s208 | f | 1400 | 2.11 |
| 1 | s209 | f | 1500 | 1.99 |
| 1 | s214 | f | 1549 | 2.54 |
| 1 | s215 | f | 1324 | 2.32 |
| 1 | s221 | f | 1044 | 1.63 |
| 1 | s264 | f | 1450 | 2.24 |
| 1 | s268 | f | 1500 | 2.43 |
| 2 | s169 | f | 1875 | 2.60 |
| 2 | s185 | f | 1555 | 2.64 |
| 2 | s187 | f | 1700 | 2.56 |
| 2 | s200 | f | 1752 | 2.60 |
| 2 | s226 | f | 1507 | 2.26 |
| 2 | s228 | f | 1650 | 2.49 |
| 2 | s247 | f | 1550 | 2.20 |
| 2 | s251 | f | 1499 | 2.49 |
| 2 | s256 | f | 1278 | 1.84 |
| 2 | s258 | f | 1400 | 2.08 |
| 3 | s124 | m | 1335 | 2.22 |
| 3 | s176 | m | 1131 | 2.13 |
| 3 | s178 | m | 1322 | 2.32 |
| 3 | s181 | m | 1149 | 1.75 |
| 3 | s188 | m | 1257 | 2.31 |
| 3 | s206 | m | 1600 | 2.18 |
| 3 | s220 | m | 1494 | 2.26 |
| 3 | s225 | m | 1015 | 1.88 |
| 3 | s234 | m | 1023 | 1.61 |
| 3 | s235 | m |  955 | 1.63 |
| 4 | s157 | m | 1195 | 2.01 |
| 4 | s162 | m | 1442 | 1.90 |
| 4 | s216 | m | 1348 | 2.33 |
| 4 | s222 | m | 1399 | 2.61 |
| 4 | s223 | m | 1649 | 2.65 |
| 4 | s231 | m | 1498 | 2.32 |
| 4 | s236 | m | 1350 | 2.13 |
| 4 | s240 | m |  950 | 1.57 |
| 4 | s250 | m | 1900 | 2.84 |
| 4 | s273 | m | 1600 | 2.29 |


## Text and audio

The audio was recorded in a soundproof radio studio at RÚV. The speaker was alone in the room and read prompts of a screen controlled by a director in another room.
The text for the reading list was chosen to contain a good coverage of all possible diphones that occur in the Icelandic language.
The text is obtained from [The Icelandic Gigaword Corpus](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/15) with two exceptions, a list of sentences including rare triphones was added to the script as well as programmatically generated sentences including digits and phone numbers.


## License

This corpus is published under the CC BY 4.0 public licence.


## Authors

Reykjavik University
Þorsteinn Daði Gunnarsson <thorsteinng@ru.is>
Gunnar Thor Örnólfsson <gunnaro@ru.is>
Ragnheiður Þórhallsdóttir <ragnheidurth@ru.is>
Atli Þór Sigurgeirsson <atlithors@ru.is>
Jón Guðnason <jg@ru.is>


### Acknowledgements

This project was funded by The Icelandic Language Technology Program 2019-2023. The program, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.


### Versions

## V2

 - Normalized text added for all utterances.