Show simple item record

 
dc.contributor.author Friðriksdóttir, Steinunn Rut
dc.contributor.author Jasonarson, Atli
dc.date.accessioned 2021-08-12T13:25:35Z
dc.date.available 2021-08-12T13:25:35Z
dc.date.issued 2021-08-12
dc.identifier.uri http://hdl.handle.net/20.500.12537/124
dc.description The total list of stop words includes 59.664 words or non-words that were handpicked from the Icelandic Gigaword Corpus. The sublists are as follows: - 6.576 abbreviations. - 27.144 foreign words (especially proper names). - 588 function words. - 147 last names or company names. - 978 mislemmatized words. - 9.736 outdated words. - 12.473 typos and OCR errors. The list is compiled from the 2019 version of the IGC and should not be considered exhaustive. Heildarlistinn inniheldur 59.664 orð eða orðleysur sem voru handvalin úr Risamálheildinni. Undirlistarnir eru eftirfarandi: - 6.576 styttingar, skammstafanir og annað slíkt. Inniheldur bæði styttingar á borð við Alþingisfrv (frumvarp) og A-Skaftafellssýsla (austur) og skammstafanir á borð við LHÍ (Listaháskóli Íslands). - 27.144 erlend orð (einkum sérnöfn). - 588 kerfisorð (t.d. sér, hann, í, hvenær...). - 147 föðurnöfn (sum stytt) eða fyrirtækjanöfn (t.d. Friðleifsd, hannesson, Essó). - 978 rangt lemmuð orð (t.d guðspjallur, notönd, allsher). - 9.736 úrelt orð (t.d. íslenzkir, rjettur). - 12.473 rangt skrifuð orð og ljóslestrarvillur (t.d. klukkka, komuþeir, skattakerfl). Listanum er safnað úr 2019 útgáfu Risamálheildarinnar og það ætti ekki að líta á hann sem tæmandi.
dc.language.iso isl
dc.publisher The Árni Magnússon Institute for Icelandic Studies
dc.rights Apache License 2.0
dc.rights.uri https://opensource.org/license/apache2-0-php/
dc.rights.label PUB
dc.source.uri https://github.com/steinunnfridriks/rmh_filters
dc.subject stop-words
dc.subject word list
dc.subject filters
dc.title Stopporðalisti fyrir Risamálheildina / Stop-words for the Icelandic Gigaword Corpus (21.08)
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Steinunn Rut Friðriksdóttir srf2@hi.is The Árni Magnússon Institute for Icelandic Studies
files.size 1024789
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Apache License 2.0
Icon
Name
rmh_filters.zip
Size
1000.77 KB
Format
application/zip
Description
zipped txt files
MD5
3b020d6f334f027440a325abacc17b1c
 Download file  Preview
 File Preview  
  • rmh_filters
    • README.md-1 B
    • IGC_filters_all.txt-1 B
    • mislemmatized.txt-1 B
    • lastnames_companies.txt-1 B
    • other.txt-1 B
    • abbrevs.txt-1 B
    • function_words.txt-1 B
    • foreign.txt-1 B
    • typos_ocr.txt-1 B
    • .git
      • logs
      • info
        • exclude-1 B
      • config-1 B
      • packed-refs-1 B
      • index-1 B
      • HEAD-1 B
      • refs
      • description-1 B
      • hooks
        • applypatch-msg.sample-1 B
        • pre-push.sample-1 B
        • commit-msg.sample-1 B
        • post-update.sample-1 B
        • pre-rebase.sample-1 B
        • pre-receive.sample-1 B
        • update.sample-1 B
        • pre-applypatch.sample-1 B
        • pre-commit.sample-1 B
        • pre-merge-commit.sample-1 B
        • fsmonitor-watchman.sample-1 B
        • prepare-commit-msg.sample-1 B
      • objects
        • 07
          • bd9e6357e7e34dfd4df3b787e62abec5b59392-1 B
        • b8
          • 9266c8f7b08ca7a072c29ca72241b63fe5be31-1 B
        • 0e
          • 749894ac2f90add66ee52201a5865031b3f65b-1 B
        • b6
          • 0a614e950e07d8d5066300ba22f797ae83cf5b-1 B
        • e5
          • e8f5a20a8e9f12ea4f23025da3c4e125defb08-1 B
        • 65
          • 844a95975e8d4d017913d559467a630ad520e0-1 B
        • b4
          • 0b4774d332f472f49f7b579e9d70ef11647255-1 B
        • e3
          • b1d8c33af1146b36d0cd48ba280334894631fd-1 B
        • 32
          • 3a72387050e6638c5c4fcfd7f0e53011fdb09f-1 B
        • e0
          • 8d81f25df70034dbe0b04c6df8392a8b10fb20-1 B
        • 19
          • fadb6dc170e93fe257f5ebba065395c57e789c-1 B
        • 15
          • 388d65afd9b10d45b29a2a9285f9109db78c64-1 B
        • f5
          • 597c39490a4c420a61ceca496873cc3bf0556c-1 B
        • pack
          • info
            • d7
              • a1a900303f7053da925f98d81ac4f7eec92dfb-1 B
            • a6
              • 0b9c85c72e83227f859df9a2e2efa82d6bcc15-1 B
            • 26
              • 1eeb9e9f8b2b4b0d119366dda99c6fd7d35c64-1 B
            • ad
              • 8453dd0e8552e3ec60980af6e103f5a941bdc8-1 B
            • 84
              • 4954303fa8cd07349b4394554b8ef2da40c2a5-1 B
          • branches
          • LICENSE-1 B
          • outdated.txt-1 B

        Show simple item record