Natural Language Processing

The Natural Language Processing group has been involved in human language technology research since 1998, and by now, it has become one of the leading workshops of Hungarian computational linguistics. The Group is engaged in processing mostly Hungarian and English texts. Its general objective is to develop language-independent or easily adaptable technologies.

Recently, members of the group have focused on three main research topics. First, they have been working on the development of magyarlanc, a morphological and syntactic parser for Hungarian. Currently, the analysis of non-standard texts is the main challenge in this field and novel datasets and techniques have been developed to improve performance. Second, several research topics from computational semantics have also been investigated for some years within the group. For instance, the group has developed some machine-learning based solutions for the identification of non-factual (e.g. negated or speculative) text spans. Moreover, our researchers' interest include multiword expressions (linguistic units that consist of several tokens but have some special characteristics concerning their syntactic and semantic features): several datasets and machine learning tools have been created within the framework of an international cooperation, involving more than 20 countries. Third, the group has also been investigating the role of natural language processing tools in the detection of dementia for some years, together with experts of speech technology and psychiatrists.

The group has organized several national and international conferences, for instance, the annual Hungarian Computational Linguistics Conferences, the Second International Workshop on Computational Linguistics for Uralic Languages in 2016 and the PARSEME Shared Tasks in 2017 and 2018.


Veronika Vincze (contact), László Vidács, Gábor Berend, László Tóth, András Kicsi, Péter Pusztai


The IC1207 COST Action, PARSEME, is an interdisciplinary scientific network devoted to the role of multi-word expressions (MWEs) in parsing. It gathers interdisciplinary experts (linguists, computational linguists, computer scientists, psycholinguists, and industrials) from 31 countries. It addresses different methodologies (symbolic, probabilistic and hybrid parsing) and language technology applications (machine translation, information retrieval, etc.). The most important results of the action include the creation of a manually annotated database of verbal multiword expressions for 20 languages, the organization of several workshops and two shared tasks.

e-magyar is a new toolset for the analysis of Hungarian texts. It was produced as a collaborative effort of the Hungarian language technology community integrating the best state-of-the-art tools, enhancing them where necessary, making them interoperable and releasing them with a clear license. It is a free, open, modular text processing pipeline which is integrated in the GATE system offering further prospects of interoperability. It analyses Hungarian texts from tokenization to syntactic parsing, together with named entity recognition. Members of the Szeged NLP team were involved in the development of morphologicl and syntactic parsing modules.

Information retrieval from hungarian radiology reports
In this project information retrieval of MR reports is carried out using manual annotations of anonymized reports. Machine learning is applied on the annotated reports to label Body parts, changes and their properties in free text. In the near future we plan to apply deep learning and ontology based method to make the analysis more credible. Based on this phase the project will seek answers for questions related to concrete illnesses.

Classification of non-functional requirements
Requirements engineering is one of the very first tasks of the software development processes which fundamentally influences the quality of the software under development. The requirements are mostly given in natural language form which can be both functional and non-functional requirements. The non-functional requirements are the foundation of the quality aspects of the software such as security, usability, reliability. Classifying the non-functional requirements is one of the most important tasks of software engineering. The object of the project is to develop machine-learning (and deep-learning) based methods and tools which can support system analysts in classifying non-functional requirements given in natural language form. The collection of classified non-functional requirements can be used for both analysis and design phases.

Multi-label classification for tagging user feedbacks given in natural language form
When users or customers express their expectations relating to the software, they use natural languages. These sentences of feedbacks or requirements often contain more than one aspect of the expectations, therefore, they can be classified more than one classes.The object of the project is to develop machine-learning (and deep-learning) based methods which can be applied to multi-label classification and to develop tagger tool based on these methods. The method is to be extended also to support multi-label tagging process of sentences.

Selected publications