# Talrómur

Talrómur is a public domain speech corpus for text-to-speech research and development.
The corpus consists of 122,417 short audio clips of eight different speakers reading short sentences.
The audio was recorded in 2020 by Reykjavík University and The Icelandic National Broadcasting Service as part of The Icelandic Language Technology Program.

The corpus is divided into eight smaller sets, one for each voice. Each set shares the same structure and format.

## Format

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.
Transcriptions and file names are provided in `index.tsv`. The file contains one record per line, delimited by a tab character and has the following fields:

 - **ID:** the name of the corresponding audio file (without extension) 
 - **Transcription:** Sentence spoken in the recording
 - **Transcription ID:** unique id for the utterance (many utterances are shared by multiple voices)
 - **Normalized Transciption:** transcription with non-standard words, like numbers and abbreviations, spelled out

## Voices

The corpus includes eight different voices, four male and four female. 

| ID | Name | Age | Dialect | # of recordings | Total duration | Size | 
| -- | ---- | --- | ------- | --------------: | ------------: | ---: | 
| A | Rósa (f) | 60 | Linmæli | 9,899 | 16h32m12s | 2.6GB | 
| B | Bjartur (m) | 70 | Linmæli | 12,048 | 25h43m05s | 3.9GB | 
| C | Diljá (f) | 71 | Linmæli | 13,691 | 27h57m33s | 4.3GB | 
| D | Búi (m) | 50 | Linmæli | 12,357 | 22h32m58s | 3.5GB | 
| E | Ugla (f) | 26 | Linmæli | 20,050 | 31h28m04s | 4.8GB | 
| F | Álfur (m) | 35 | Linmæli | 19,849 | 29h07m18s | 4.5GB | 
| G | Salka (f) | 33 | Harðmæli | 16,886 | 30h09m38s | 4.6GB | 
| H | Steinn (m) | 39 | Harðmæli | 17,637 | 29h49m01s | 4.6GB | 
| **Total** | | | | **122,417** | **213h19m49s** | **32.8GB** | 

## Statistics

| ID | Recordings | Words | Characters | Distinct words | Min | Max | Mean |
| -- | ---------: | ----: | ---------: | -------------: | --: | --: | ---: | 
| A |  9,899 |  93,002 |  556,767 | 19,272 | 1.30s | 14.38s | 6.01s |
| B | 12,048 | 118,564 |  713,578 | 22,617 | 2.22s | 18.68s | 7.68s |
| C | 13,443 | 139,636 |  843,530 | 25,492 | 2.71s | 17.76s | 7.48s |
| D | 12,357 | 126,814 |  766,037 | 23,857 | 0.91s | 15.97s | 6.57s |
| E | 20,050 | 215,176 | 1,298,318 | 33,629 | 1.86s | 14.46s | 5.65s |
| F | 19,849 | 212,979 | 1,284,508 | 33,401 | 1.78s | 12.96s | 5.28s |
| G | 16,886 | 178,818 | 1,078,978 | 29,966 | 2.26s | 14.82s | 6.43s |
| H | 17,637 | 187,868 | 1,134,244 | 30,977 | 1.44s | 14.57s | 6.09s |


| ID | Words per min.\* | Chars per min.\* | 
| -- | -------------: | -------------: | 
| A | 140 | 822 |
| B | 104 | 612 |
| C | 114 | 672 |
| D | 135 | 797 |
| E | 176 | 1,043 |
| F | 196 | 1,160 |
| G | 143 | 848 |
| H | 156 | 926 |

\* approximate calculations assuming 1 second padding on each audio file

## Text and audio

The audio was recorded in a soundproof radio studio at RÚV. The speaker was alone in the room and read prompts of a screen controlled by a director in another room.
The text for the reading list was chosen to contain a good coverage of all possible diphones that occur in the Icelandic language.
The text is obtained from [The Icelandic Gigaword Corpus](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/15) with two exceptions, a list of sentences including rare triphones was added to the script as well as programmatically generated sentences including digits and phone numbers.

## License

This corpus is published under the CC BY 4.0 public licence.

## Authors

Reykjavik University
Atli Þór Sigurgeirsson <atlithors@ru.is>
Þorsteinn Daði Gunnarsson <thorsteinng@ru.is>
Gunnar Thor Örnólfsson <gunnaro@ru.is>
Ragnheiður Þórhallsdóttir <ragnheidurth@ru.is>
Eydís Huld Magnúsdóttir <eydishm@ru.is>
Jón Guðnason <jg@ru.is>

### Acknowledgements

This project was funded by The Icelandic Language Technology Program 2019-2023. The program, which is managed and coordinated by [Almannarómur](https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.