Sýna einfalda færslu atriðis
dc.contributor.author |
Friðriksdóttir, Steinunn Rut |
dc.contributor.author |
Jasonarson, Atli |
dc.date.accessioned |
2021-08-12T13:25:35Z |
dc.date.available |
2021-08-12T13:25:35Z |
dc.date.issued |
2021-08-12 |
dc.identifier.uri |
http://hdl.handle.net/20.500.12537/124 |
dc.description |
The total list of stop words includes 59.664 words or non-words that were handpicked from the Icelandic Gigaword Corpus. The sublists are as follows:
- 6.576 abbreviations.
- 27.144 foreign words (especially proper names).
- 588 function words.
- 147 last names or company names.
- 978 mislemmatized words.
- 9.736 outdated words.
- 12.473 typos and OCR errors.
The list is compiled from the 2019 version of the IGC and should not be considered exhaustive. |
dc.description |
ÍSLENSKA:
Heildarlistinn inniheldur 59.664 orð eða orðleysur sem voru handvalin úr Risamálheildinni. Undirlistarnir eru eftirfarandi:
- 6.576 styttingar, skammstafanir og annað slíkt. Inniheldur bæði styttingar á borð við Alþingisfrv (frumvarp) og A-Skaftafellssýsla (austur) og skammstafanir á borð við LHÍ (Listaháskóli Íslands).
- 27.144 erlend orð (einkum sérnöfn).
- 588 kerfisorð (t.d. sér, hann, í, hvenær...).
- 147 föðurnöfn (sum stytt) eða fyrirtækjanöfn (t.d. Friðleifsd, hannesson, Essó).
- 978 rangt lemmuð orð (t.d guðspjallur, notönd, allsher).
- 9.736 úrelt orð (t.d. íslenzkir, rjettur).
- 12.473 rangt skrifuð orð og ljóslestrarvillur (t.d. klukkka, komuþeir, skattakerfl).
Listanum er safnað úr 2019 útgáfu Risamálheildarinnar og það ætti ekki að líta á hann sem tæmandi. |
dc.language.iso |
isl |
dc.publisher |
The Árni Magnússon Institute for Icelandic Studies |
dc.rights |
Apache License 2.0 |
dc.rights.uri |
https://opensource.org/license/apache2-0-php/ |
dc.rights.label |
PUB |
dc.source.uri |
https://github.com/steinunnfridriks/rmh_filters |
dc.subject |
stop-words |
dc.subject |
word list |
dc.subject |
filters |
dc.title |
Stopporðalisti fyrir Risamálheildina / Stop-words for the Icelandic Gigaword Corpus (21.08) |
dc.type |
lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType |
wordList |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
yes |
branding |
Clarin IS Repository |
contact.person |
Steinunn Rut Friðriksdóttir srf2@hi.is The Árni Magnússon Institute for Icelandic Studies |
files.size |
1024789 |
files.count |
1 |
Files in this item
This item is
Publicly Available
and licensed under:
Apache License 2.0
- Name
- rmh_filters.zip
- Size
- 1000.77
KB
- Format
- application/zip
- Description
- zipped txt files
- MD5
- 3b020d6f334f027440a325abacc17b1c
Download file
Preview
- rmh_filters
- README.md-1 B
- IGC_filters_all.txt-1 B
- mislemmatized.txt-1 B
- lastnames_companies.txt-1 B
- other.txt-1 B
- abbrevs.txt-1 B
- function_words.txt-1 B
- foreign.txt-1 B
- typos_ocr.txt-1 B
- .git
- logs
- info
- config-1 B
- packed-refs-1 B
- index-1 B
- HEAD-1 B
- refs
- description-1 B
- hooks
- applypatch-msg.sample-1 B
- pre-push.sample-1 B
- commit-msg.sample-1 B
- post-update.sample-1 B
- pre-rebase.sample-1 B
- pre-receive.sample-1 B
- update.sample-1 B
- pre-applypatch.sample-1 B
- pre-commit.sample-1 B
- pre-merge-commit.sample-1 B
- fsmonitor-watchman.sample-1 B
- prepare-commit-msg.sample-1 B
- objects
- 07
- bd9e6357e7e34dfd4df3b787e62abec5b59392-1 B
- b8
- 9266c8f7b08ca7a072c29ca72241b63fe5be31-1 B
- 0e
- 749894ac2f90add66ee52201a5865031b3f65b-1 B
- b6
- 0a614e950e07d8d5066300ba22f797ae83cf5b-1 B
- e5
- e8f5a20a8e9f12ea4f23025da3c4e125defb08-1 B
- 65
- 844a95975e8d4d017913d559467a630ad520e0-1 B
- b4
- 0b4774d332f472f49f7b579e9d70ef11647255-1 B
- e3
- b1d8c33af1146b36d0cd48ba280334894631fd-1 B
- 32
- 3a72387050e6638c5c4fcfd7f0e53011fdb09f-1 B
- e0
- 8d81f25df70034dbe0b04c6df8392a8b10fb20-1 B
- 19
- fadb6dc170e93fe257f5ebba065395c57e789c-1 B
- 15
- 388d65afd9b10d45b29a2a9285f9109db78c64-1 B
- f5
- 597c39490a4c420a61ceca496873cc3bf0556c-1 B
- pack
- info
- d7
- a1a900303f7053da925f98d81ac4f7eec92dfb-1 B
- a6
- 0b9c85c72e83227f859df9a2e2efa82d6bcc15-1 B
- 26
- 1eeb9e9f8b2b4b0d119366dda99c6fd7d35c64-1 B
- ad
- 8453dd0e8552e3ec60980af6e103f5a941bdc8-1 B
- 84
- 4954303fa8cd07349b4394554b8ef2da40c2a5-1 B
- branches
- LICENSE-1 B
- outdated.txt-1 B
Sýna einfalda færslu atriðis