Bilingual-Serbian-English-KE-Dataset
(Bilingual Serbian-English Dataset for Keyword Extraction Task)

Version: 1.0
Release date: November 25, 2016

*********************
**   DESCRIPTION   **
*********************
The Bilingual-Serbian-English-KE-Dataset comprises 50 abstracts from the scientific journal "Underground Mining Engineering" published by the University of Belgrade, Faculty of Mining and Geology. Abstracts are collected from Journal published during the period of 2004-2012. The journal published papers bilingually, in Serbian and in English. These papers are available online as aligned parallel text in the Biblisha digital library (http://jerteh.rs/biblisha/ListaDokumenata.aspx?JCID=2&lng=en).

For the research presented in the paper:

Beliga, S., Kitanovi, O., Stankovi, R., & Martini-Ipi, S. (2018). Keyword Extraction from Parallel Abstracts of Scientific Publications. IKC 2017, Semantic Keyword-Based Search on Structured Data Sources, pages 44-55.

we used a collection of 50 bilingual documents with approximately 4,800 aligned sentences. The statistics of the used English and Serbian parallel abstract are presented in mentioned paper. All the documents are supplied with keywords, annotated by human experts  the authors of the articles. The number of annotated keywords ranges from 3 to 18 in the Serbian and from 3 to 15 in the English texts (the average in both is 7). For details on dataset construction, and obtained benchmark results of keyword extraction with fully unsupervised graph-based Selectivity-Based Keyword Extraction Method (SBKE), please refer to the paper.

Datasets contain parallel texts written in English and in Serbian language but in separate folders (Serbian, and English). 
The particular text file contains the title of the paper in the first row, while the rest of the text represents the abstract.
Certain Serbian-English text pair in the fail name contains the same number (from 1 to 50). This is also true for files with keyword annotations, which are in the folders "keywords", separate for English and Serbian. Authors sometimes use keywords that are not necessarily contained in the original text of their abstract. Therefore, there are out-of-vocabulary words in keywords obtained by authors. For details, please refer to the paper in Section 3. Textual resources.


If you use Bilingual-Serbian-English-KE-Dataset, please cite the paper. The BibTeX citation is:

@InProceedings{10.1007/978-3-319-74497-1_5,
author="Beliga, Slobodan
and Kitanovi{\'{c}}, Olivera
and Stankovi{\'{c}}, Ranka
and Martin{\v{c}}i{\'{c}}-Ip{\v{s}}i{\'{c}}, Sanda",
editor="Szyma{\'{n}}ski, Julian
and Velegrakis, Yannis",
title="Keyword Extraction from Parallel Abstracts of Scientific Publications",
booktitle="Semantic Keyword-Based Search on Structured Data Sources",
year="2018",
publisher="Springer International Publishing",
address="Cham",
pages="44--55",
isbn="978-3-319-74497-1"
}



LINK: https://link.springer.com/chapter/10.1007/978-3-319-74497-1_5 

*********************
**       DATA      **
*********************

The dataset is available from the scientific journal "Underground Mining Engineering" published by the University of Belgrade, Faculty of Mining and Geology. The texts are downloaded in their original form from the Biblia digital library: http://jerteh.rs/biblisha/ListaDokumenata.aspx?JCID=2&lng=en.


*********************
**      LICENSE    **
*********************

Bilingual Serbian-English Dataset for Keyword Extraction Task - all files available by LangNet team from Department of Informatics, University of Rijeka and the Human Language Technology Group from the University of Belgrade are licensed under the CC BY-NC-SA 4.0 license.