#############################################################################
#########            Althingi Parliamentary speach corpus           #########
#########            http://hdl.handle.net/20.500.12537/277         #########
#############################################################################

THE FILES

The corpus contains two packages, delivered in two compressed files:

A) althingi_texts.tar.gz:

The file althingi_texts.zip contains the training-, evaluation- and 
development sets and two language models (pruned trigram model, used in 
decoding and a unpruned constant arpa 5-gram model, used for rescoring 
decoding results).

The three sets are located in the folders train, dev and eval. Each folder contains file files:

- segments: links each text segment to its place in the audio files 
- spk2gender: lists all the speakers and their gender 
- spk2utt: lists all the speakers and their utterances/segments
- text: lists the ID of each segment and its text
- utt2spk: lists each segment to a speaker

The two models are located in the folders lang_3gsmall and lang_5glarge.

The file metadata.csv contains a comma-separated list of speakers, timestamp 
of the beginning of speach, the timestamp of the end of the speach, the 
number of the parliamentary session, the status of speaker and ID of file.

The file name_id_gender.tsv contains name of speaker, abbreviation and gender.

The file pron_dict.txt contains the pronunciation dictionary that is based on 
an edited version of Hjal’s pronunciation dictionary (http://hdl.handle.net/
20.500.12537/198), plus common words from the Althingi texts and from 
Málrómur. It currently contains ~181,000 words. 

B) althingi_upptokur.zip:

The folder 'audio' contains the recordings of all the speaches.
The folder 'text_bb' contains the texts of all the speaches in xml-format.
The folder 'text_endanlegt' contains the texts after cleaning.
The file metadata.csv contains the same information as the file metadata.csv 
in althingi_texts.tar.gz

Due to the size of this package is was split into four parts during 
compression. In order to unzip it you first use the zip command to 
combine the split zip files into a single zip-archive:

 zip -F althingi_upptokur.zip --out althingi_upptokur.single-archive.zip

Now we can use unzip to open the combined archive.

 unzip althingi_upptokur.single-archive.zip
 

ABOUT THE ALTHINGI PARLIAMENTARY speach CORPUS

This is an aligned and segmented corpus of 6493 Althingi recordings with 196 
speakers. The recordings consist of 199,614 segments, with average duration 
of 9.8 s. A file called segments links each text segment to its place in the
audio files. The total duration of the data set is 542 hours and 25 minutes of
data and it contains 4,583,751 word tokens. The corpus is split up into a 
training-, development- and an evaluation set. The training set contains 
speaches from 2005 to 2015, with a total duration of 514.5 hours. The 
speaches from 2016 were split evenly between the development- and evaluation 
sets, with 14 hours in duration each. The evaluation set is cleaner than the 
development set, and both are cleaner than the training set.

The pronunciation dictionary is based on an edited version of Hjal’s 
pronunciation dictionary (E. Rögnvaldsson, 2003), which is available at 
Málföng, plus common words from the Althingi texts and from Málrómur (J. 
Guðnason et al., 2012). It currently contains ~181,000 words. Sequitur’s 
grapheme to phoneme converter (M. Bisani et al., 2008), trained on the edited 
pronunciation dictionary from Hjal, plus the Málrómur data, was used 
to get the phonemes for the new words from the Althingi data.

The language models were built using transcripts of Althingi speaches dating 
back to 2003, excluding speaches from 2016. One is a pruned trigram model, 
used in decoding. The other one is a unpruned constant arpa 5-gram model, 
used for rescoring decoding results.

Using this data, pronunciation dictionary and language model, an automatic 
speach recognizer with a 10.23% word error rate has been developed. This 
error rate was obtained using an acoustic model based on lattice-free maximum 
mutual information neural network architecture with both time-delay and long 
short term memory layers. It is based on the Switchboard recipe in the Kaldi 
toolkit (D. Povey et al., 2011) (https://github.com/kaldi-asr/kaldi/tree/
master/egs/swbd). Our training recipe from start to finish will be made 
public soon.

-----------

When publishing results based on the texts in the corpus please refer to:

Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir og Jón 
Guðnason, 
2017. Building an ASR corpus using Althingi’s Parliamentary speaches. 
Proceedings of Interspeach 2017.

____________________________________________________________

[ICELANDIC]

SKJÖLIN:

Sjáið kaflann FILES hér að ofan.

UM ALÞINGISGÖGNIN

Gögnin samanstanda af 6493 Alþingisræðum, frá 196 ræðumönnum. Þau eru 
samröðuð og skipt niður í hæfilega stórar einingar fyrir þjálfun. Meðallengd 
hverrar einingar er 9,8 s. Skrá sem kallast segments tengir hvern textabút 
við réttan stað í hljóðskránum. Heildarlengd hljóðgagnanna er 542 klst. og 25 
min og textinn inniheldur tæplega 4,6 milljónir orða.

Gagnasafninu er skipt upp í þjálfunarsett og tvö prófunarsett “dev” og 
“eval”. Þjálfunarsettið er 514,5 klst. að lengd og inniheldur gögn frá 
2005-2015. Ræðubútunum frá 2016 var skipt jafnt upp á milli prófunarsafnanna, 
með 14 klst. í hvoru. “eval” safnið er hreinna en “dev” safnið og bæði eru 
þau hreinni en þjálfunargögnin.

Framburðarorðabókin er endurbætt útgáfa af framburðarorðabók Hjal 
verkefnisins (E. Rögnvaldsson, 2003), sem er aðgengileg hér á Málföngum, auk 
þess sem algengum orðum úr Alþingisræðum frá 2003-2015 og frá Málróms 
gagnasafninu (J. Guðnason et al., 2012) er bætt við. Hún inniheldur um 181 
þús. orð. Framburðarlýsing nýju orðanna fékkst með Sequitur G2P líkani (M. 
Bisani et al., 2008), sem þjálfað hafði verið á framburðarorðabók Hjal 
verkefnisins auk Málrómsgagnanna.

Alþingistextar frá árunum 2003-2015 voru notaðir til að gera mállíkönin. 
Annað þeirra er lítið 3-gram líkan, notað í afkóðun. Hitt er stórt 5-gram 
líkan, í “constant-arpa formati”, notað til að leiðrétta upphaflegu afkóðunina.

Þessi gögn, framburðarorðabók og mállíkön voru notuð til að þjálfa talgreini 
með 10.23% villutíðni orða. Notast var við hljóðlíkan sem byggði á samblandi 
af djúptauganetum með tímaseinkun (e. time-delay deep neural networks, TD-
DNN) og djúptauganetum með lang-skammtímaminni (e. long short term memory 
DNN, LSTM-DNN). Notast var við Switchboard uppskriftina í Kalda tólinu (D. 
Povey et al., 2011) (https://github.com/kaldi-asr/kaldi/tree/master/egs/swbd) 
við þjálfun hljóðlíkansins. Alþingis uppskriftin mun fljótlega vera gerð 
aðgengileg almenningi.

------

Þegar birtar eru niðurstöður sem eru fengnar með því að nota gögnin 
vinsamlegast vitnið í:

Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir og Jón 
Guðnason, 2017. Building an ASR corpus using Althingi’s Parliamentary 
speaches. Proceedings of Interspeach 2017.