Show simple item record

 
dc.contributor.author Friðriksdóttir, Steinunn Rut
dc.contributor.author Ingason, Anton Karl
dc.date.accessioned 2020-02-05T14:24:37Z
dc.date.available 2020-02-05T14:24:37Z
dc.date.issued 2020-01-28
dc.identifier.uri http://hdl.handle.net/20.500.12537/13
dc.description The Icelandic Confusion Set Corpus (ICoSC) is available under a CC-BY licence. It was compiled during the course of three months in 2019 by Steinunn Rut Friðriksdóttir and Anton Karl Ingason of the language technology department in the University of Iceland. Included in the ICoSC are CSV spreadsheets containing all collected confusion sets of each category and their frequencies. The spreadsheets are organized so that for each set, the total frequency of each candidate is calculated along with the frequency of each possible PoS tag for that candidate. The seventh and eight column of the tables contain binary values referring to whether the confusion set is grammatically disjoint (all PoS tags differ for the two candidates) or grammatically identical (all PoS tags are identical for the two candidates). The final column shows the frequency of the less frequent candidate of the set which can be used to determine which sets are viable in an experiment. Also included are text files containing the list of words from each category and text files containing all sentence examples from the IGC which contain the words for each category. As the n/nn examples are by far the most frequent confusion sets, the corpus also includes a word list and sentence examples for the 55 most frequent sets. All files have UTF-8 encoding. The ICoSC consists of the following categories of confusion sets, selected for their linguistic properties as homophones, separated orthographically by a single letter. The categories are: - 196 pairs containing y/i (leyti ’extent’ / leiti ’search’). - 150 pairs containing ý/í (sýn ’vision’ / sín ’theirs (possessive reflexive)’). - 1203 pairs containing nn/n (forvitinn ’curious(masc.)’ / forvitin ’curious (fem.)’). - 24 pairs containing hv/kv (hvað ’what’ / kvað ’chanted’). - 42 pairs containing rð/ðr (veðri ’weather (dative)’ / verði ’will become’). - 110 pairs containing rr/r (klárri ’smart (indef. fem. dative)’ / klári ’smart (def. masc. nominative)’). - 8 pairs commonly confused by Icelandic speakers, i.e. mig/mér (me (accusative) / me (dative)).
dc.language.iso isl
dc.publisher Háskóli Íslands
dc.relation.isreferencedby https://www.insticc.org/Primoris/Resources/PaperPdf.ashx?idPaper=93715
dc.relation.isreplacedby http://hdl.handle.net/20.500.12537/19
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/steinunnfridriks/ICoSC
dc.subject homophones
dc.subject confusion sets
dc.subject context dependency
dc.subject rich morphology
dc.subject disambiguation
dc.title The Icelandic Confusion Set Corpus (ICoSC) 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Steinunn Rut Friðriksdóttir srf2@hi.is Háskóli Íslands
files.size 225339230
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
ICoSC.zip
Size
214.9 MB
Format
application/zip
Description
zip folder containing word lists, sentence examples and frequency tables
MD5
c52924f781f87beccd60ecb518069cd7
 Download file  Preview
 File Preview  
  • ICoSC
    • Frequency tables
      • rr_freq.csv11 kB
      • y_freq.csv17 kB
      • various_freq.csv1 kB
      • nn_freq.csv101 kB
      • y_elongated_freq.csv13 kB
      • rddr_freq.csv4 kB
      • nn_mostfreq_freq.csv7 kB
      • hvkv_freq.csv2 kB
    • Wordlists
      • y_wordlist.txt3 kB
      • y_elongated_wordlist.txt2 kB
      • hvkv_wordlist.txt464 B
      • rr_wordlist.txt2 kB
      • rddr_wordlist.txt890 B
      • nn_mostfreq_wordlist.txt836 B
      • various_wordlist.txt99 B
      • nn_wordlist.txt27 kB
    • README.md4 kB
    • Most common words - frequency tables
      • Grammatically identical.csv652 B
      • NeitherGInorGD.csv4 kB
      • Grammaticaly disjoint.csv10 kB
    • Sentence examples
      • y_sent.txt107 MB
      • nn_mostfreq_sent.txt147 MB
      • hvkv_sent.txt16 MB
      • various_sent.txt87 MB
      • ylong_sent.txt96 MB
      • nn_sent.txt389 MB
      • rddr_sent.txt42 MB
      • rr_sent.txt69 MB

Show simple item record