Show simple item record

 
dc.contributor.author Daðason, Jón Friðrik
dc.contributor.author Loftsson, Hrafn
dc.contributor.author Sigurðardóttir, Salome Lilja
dc.contributor.author Björnsson, Þorsteinn
dc.date.accessioned 2022-09-30T13:34:35Z
dc.date.available 2022-09-30T13:34:35Z
dc.date.issued 2021-01-11
dc.identifier.uri http://hdl.handle.net/20.500.12537/285
dc.description IceSum is a collection of 1,000 Icelandic news articles from mbl.is, which have been manually annotated with summaries. The corpus contains local (50%), world (26%), business (14%) and sports (10%) news articles which were published between 1998-2019. The summaries are extractive and consist of sentences and sentence fragments from the original articles. Each article in the corpus is listed with a unique ID, title, original text, extractive summary, category, publication date and source URL. This version of IceSum includes a script for generating training data for the TransformerSum library with the same training, validation and test splits that were used in the original paper. Jón Friðrik Daðason, Salome Lilja Sigurðardóttir and Þorsteinn Björnsson contributed to this project, under the supervision of Hrafn Loftsson. IceSum er safn 1.000 íslenskra frétta af vefmiðlinum mbl.is með handgerðum samantektum. Málheildin inniheldur innlendar fréttir (50%), erlendar fréttir (26%), viðskiptafréttir (14%) og íþróttafréttir (10%) sem gefnar voru út á tímabilinu 1998-2019. Samantektirnar samanstanda af setningum og setningarhlutum úr upphaflegu fréttatextunum. Hverri frétt fylgir auðkennisnúmer, fyrirsögn, upprunalegur texti, handgerð samantekt, efnisflokkur, útgáfudagsetning og vefslóð. Þessi útgáfa af IceSum inniheldur skriftu til að útbúa þjálfunargögn fyrir TransformerSum forritasafnið með sömu skiptingu á gögnunum og var notuð í upphaflegu greininni. Jón Friðrik Daðason, Salome Lilja Sigurðardóttir og Þorsteinn Björnsson unnu að verkefninu undir umsjón Hrafns Loftssonar.
dc.language.iso isl
dc.publisher Reykjavik University
dc.relation.isreferencedby https://aclanthology.org/2021.naacl-srw.2/
dc.relation.replaces http://hdl.handle.net/20.500.12537/96
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/cadia-lvl/icesum
dc.subject text corpus
dc.subject summarization
dc.subject news articles
dc.title IceSum - Icelandic Text Summarization Corpus (22.09)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Jón Friðrik Daðason jond19@ru.is Reykjavik University
contact.person Hrafn Loftsson hrafn@ru.is Reykjavik University
sponsor Strategic Research and Development Programme for Language Technology 180037-5301 Sjálfvirk samantekt íslensks texta ("Automatic Text Summarization for Icelandic") nationalFunds
size.info 1000 articles
files.size 784369
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
icesum.zip
Size
765.99 KB
Format
application/zip
MD5
188e54da0a1347d3d086e8c128b7a1a1
 Download file  Preview
 File Preview  
  • icesum
    • readme.txt4 kB
    • requirements.txt22 B
    • process_transformersum.py6 kB
    • icesum.json2 MB
    • splits.json15 kB

Show simple item record