Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 1.0

Friðriksdóttir, Steinunn Rut; Jasonarson, Atli

dc.contributor.author	Friðriksdóttir, Steinunn Rut
dc.contributor.author	Jasonarson, Atli
dc.date.accessioned	2020-09-30T16:06:30Z
dc.date.available	2020-09-30T16:06:30Z
dc.date.issued	2020-09-30
dc.identifier.uri	http://hdl.handle.net/20.500.12537/79
dc.description	The purpose of the lexicon acquisition tool is to facilitate the development and expansion of online dictionaries and glossaries, particularly the Database of Modern Icelandic Inflection (DMII/BÍN) and ISLEX. The tool is designed around the Icelandic Gigaword Corpus (IGC) and the information contained within its TEI-formatted documents. That is to say, its best performance comes from using the available part-of-speech tags, lemmas and word forms defined in the IGC. The lexicon acquisition tool can however use any corpus as input that uses either the same TEI-format as is used in the IGC or a plain text file format, depending on the user's preference. The output files, examples of which are included, are the following: Frequency per word form with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per lemma with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per word form, including information on all possible lemmas for the given word forms. Provides information on whether the word form can belong to more than one word class, as well as whether or not the automatic lemmatization is working correctly. Frequency per lemma, including information on all possible word forms for the given lemma. To examine if a certain word form appears much more or less frequently than the others and thus if the word form is only used as a part of a certain expression. Frequency per lemma, including information in which types of text the particular lemma appears. The frequency for each individual text type can also be examined in descending order. Facilitates the creation of a specialized glossary (e.g. a glossary of sport related words). Also included is a list of approximately 60 thousand subwords, manually chosen from the ICG. These include foreign words, typos, misspelled words, lemmatization errors and acronyms. Tilgangur orðtökutólsins er að einfalda þróun og smíði netorðabóka og netorðasafna, einkum og sér í lagi Beygingarlýsingu íslensks nútímamáls (BÍN) og Nútímamálsorðabókarinnar (ISLEX). Smíði tólsins byggist að miklu leyti á notkun Risamálheildarinnar (RMH) og þeirra upplýsinga sem eru skilgreindar innan tei-sniðsins sem hún notar, en þar er helst átt við notkun málfræðilegra marka, nefnimynda og orðmynda sem þar er að finna. Orðtökutólið má aftur á móti nota með hvaða málheild sem er sé hún annað hvort á sama tei-sniði og Risamálheildin eða á einföldu txt-sniði. Dæmi um úttaksskjöl orðtökutólsins má finna í meðfylgjandi möppu. Þau eru eftirfarandi: Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar orðmyndir viðkomandi lemmu sem koma fyrir taldar upp. Nýtist til að kanna hvort tiltekin orðmynd er mun algengari en aðrar og þar með hvort orðið tilheyri einkum ákveðnu orðtaki. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar lemmur viðkomandi orðmyndar sem koma fyrir taldar upp. Veitir upplýsingar um hvort tiltekin orðmynd getur tilheyrt fleiri en einum orðflokki og hvort sjálfvirk lemmun skili réttum niðurstöðum. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en auk þess tíðni hverrar lemmu innan ákveðinnar gerðar texta (t.d. fréttir, stærðfræði eða fótbolti). Má nýta við smíði íðorðasafna. Meðfylgjandi er einnig listi sem inniheldur um 60 þúsund stopporð sem hefur verið safnað handvirkt úr Risamálheildinni. Þetta eru erlend orð, stafsetningar- og innsláttarvillur, lemmuvillur og skammstafanir.
dc.language.iso	isl
dc.publisher	The Árni Magnússon Institute for Icelandic Studies
dc.relation.isreplacedby	http://hdl.handle.net/20.500.12537/95
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/license/apache2-0-php/
dc.rights.label	PUB
dc.source.uri	https://github.com/steinunnfridriks/ALEXIA_ordtokutol
dc.subject	lexicon acquisition
dc.subject	lexicon acquisition tool
dc.subject	corpus harvesting
dc.subject	gigaword corpus
dc.subject	dmii
dc.subject	islex dictionary
dc.title	Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 1.0
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	Clarin IS Repository
contact.person	Steinunn Rut Friðriksdóttir srf2@hi.is The Árni Magnússon Institute for Icelandic Studies
contact.person	Atli Jasonarson atlijas@simnet.is The Árni Magnússon Institute for Icelandic Studies
sponsor	Ministry of Education, Science and Culture Lexicon Acquisiton Tool (I2) Language Technology for Icelandic 2019-2023 nationalFunds
files.size	297878
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: ordtokutol_clarin.zip
Size: 290.9 KB
Format: application/zip
Description: zip containing tool
MD5: 29ca899e7d14cd0e218e4c4fc8ed797b

Download file Preview

File Preview

ordtokutol_clarin
- all_filters.txt-1 B
- setup.py-1 B
- README.md-1 B
- add_word_to_filter_database.py-1 B
- databases
  - temp.txt-1 B
- requirements.txt-1 B
- run.py-1 B
- ordtaka
  - lemmabase_wordforms.py-1 B
  - rmh_extractor.py-1 B
  - __pycache__
    - txt_to_data.cpython-38.pyc-1 B
    - prepare_data.cpython-38.pyc-1 B
    - request_file.cpython-38.pyc-1 B
    - rmh_extractor.cpython-38.pyc-1 B
    - find_texttype_freqs.cpython-38.pyc-1 B
    - lemmabase_wordforms.cpython-38.pyc-1 B
    - compare_rmh_islex.cpython-38.pyc-1 B
    - base_output.cpython-38.pyc-1 B
    - compare_rmh_bin.cpython-38.pyc-1 B
  - prepare_data.py-1 B
  - find_texttype_freqs.py-1 B
  - request_file.py-1 B
  - txt_to_data.py-1 B
  - compare_rmh_islex.py-1 B
  - base_output.py-1 B
  - compare_rmh_bin.py-1 B
  - sql
    - corpus_to_sql.py-1 B
    - sql_lookup.py-1 B
    - __pycache__
      - corpus_to_sql.cpython-38.pyc-1 B
      - sql_lookup.cpython-38.pyc-1 B
    - word_to_db.py-1 B
- corpora
  - teimalheild
    - prufuskjal.xml-1 B
    - undirmappa
      - prufuskjal.xml-1 B
  - RMH
    - MIM
      - prufuskjal.xml-1 B
      - undirmappa
        prufuskjal.xml-1 B
    - CC_BY
      - prufuskjal.xml-1 B
      - undirmappa
        prufuskjal.xml-1 B
  - txtmalheild
    - undirmappa
      - txt_example.txt-1 B
    - txt_example.txt-1 B
- uttaksskjol
  - islex
    - RMH_texttypes_ISLEX.csv-1 B
    - .gitkeep-1 B
    - teimalheild_texttypes_ISLEX.csv-1 B
    - teimalheild_wordform_ISLEX.freq-1 B
    - txtcorpus_ISLEX.csv-1 B
    - RMH_wordform_ISLEX.freq-1 B
  - bin
    - txtcorpus_BIN.csv-1 B
    - RMH_texttypes_BIN.csv-1 B
    - RMH_lemmabase_bin.freq-1 B
    - teimalheild_texttypes_BIN.csv-1 B
    - RMH_lemma_BIN.csv-1 B
    - RMH_wordform_BIN.freq-1 B
    - RMH_wf_BIN.csv-1 B
    - teimalheild_lemmabase_bin.freq-1 B
    - .gitkeep-1 B
    - teimalheild_lemma_BIN.csv-1 B

Show simple item record

Files in this item

Partners, Coordination, Funding

Repository

More