Scientific digital libraries (e.g. arXiv, dblp, HAL) play a critical role in the development and dissemination of scientific literature. Despite dedicated search engines, retrieving relevant publications from the ever-growing body of scientific literature remains challenging and time-consuming. Indexing scientific articles is indeed a difficult matter, and current models solely rely on a small portion of the articles (title and abstract) and on author-assigned keyphrases when available. Thus, there is a pressing need for better indexing models, and an important step towards this direction is to develop automatic keyphrase extraction models that efficiently operate on the full text. However, existing keyphrase extraction models perform poorly on scientific articles because of the large number of candidates (i.e. phrases that are relevant for indexing) and the error-prone content (e.g. mathematical formulas, references) that they have to cope with


In this collaborative research project, we aim at addressing the issue of indexing scientific articles with robust, neural network based keyphrase extraction models. More precisely, we will focus on improving the automatic recognition of keyphrase candidates, which is a prerequisite for accurate and consistent indexing. To this end, we will turn our attention on recently proposed neural architectures and study how these can be applied to weed out spurious candidates.