Machine learning of syntax rules (application of machine learning methods for the generation of Hungarian syntactic rules)
- University of Szeged, Department of Informatics, HLT Group (coordinator) [link]
- MorphoLogic Ltd. Budapest [link]
- Research Institute for Linguistics at HAS, Department of Corpus Linguistics [link]
Brief project summary
Parsing, or syntactic analysis of texts plays a key role in natural language processing (NLP). Similarly to many other languages, Hungarian heavily relies on the use and interrelation of suffixes (morphemes) and elementary word structures (syntagmas). The recognition of syntagmas and identification of their relation to each other is essential in NLP systems. Lacking this, semantic analysis of natural language sentences would not be executable. Also, artificial intelligence programs could work much more efficiently by the introduction of a thorough syntactical analysis. Promising fields of application include machine translation, automatic information extraction and text analysis for scientific or commercial purposes.
Research groups studying the structure of Hungarian sentences have made a great effort to produce a consistent syntax rule system, yet these have not been adaptable to practical, computer related purposes so far. This implies that there is a strong demand for the development of a technology that would be able to divide a Hungarian sentence into syntactical segments, recognize their structure, and based on this recognition, would assign an annotated tree representation to each sentence. Such, so called treebank representations have already been developed for most West European languages and some Central and East European languages as well.
In relation to the above, the project's main goal was twofold. On the one hand, we aimed to create a gold standard database, a treebank that represents the characteristics of Hungarian syntax. On the other hand, we aimed to develop a generally applicable syntactic parser for Hungarian language, where the development of the parser was supported by machine learning algorithms. Results delivered by the developed parsing system are expected to meet the level of human annotation.
Results of the IKTA 27/2000 (Development of a Part-of-Speech Tagging Method for Hungarian by using Machine Learning Algorithms) and the NKFP 2/017/2001 national RTD projects (Information Extraction from Short Business News) were used in the currently described project.
Creation of the Szeged Treebank 2.0 - full syntactic annotation
Initially, the emphasis was on the preparation of texts to be annotated and on the definition of the annotation model to be used. The Szeged Treebank 1.0 (developed in the previously mentioned projects) served as the basis for further annotation. The corpus already contained morpho-syntactic annotation and POS tagging, as well as the labelling of noun phrases and clause boundaries. Before further analysis was initiated, the consortium had developed a technology for the management of special tokens. An intermediate version of the treebank was built with this technology marking the hard-to-handle units such as codes, mathematic formulas, e-mail and web addresses, named entities etc.
In the following step, the syntactic annotation model was defined taking into consideration the aspects of automation. Theoretical background of the model adapted the sentence analysis mechanism of Hungarian generative syntax. The annotation scheme based on this theory was applied to mark adjectival, adverbial, postpositional phrases, preverbs, infinitive and verb phrase structures, negations and conjunctions as well.
Following the PennTreebank building scheme, the insertion of the new syntactic labels contained an automatic pre-analysis phase, which was followed by a manual validation and correction phase. Annotation and correction primarily aimed at the inclusion of the syntactic units listed above, but noun phrases - already annotated in previous projects - were also revised and their inner structure was defined in more detail. In order to support the manual work, the consortium developed a software package called Syntax Editor that ensures an easy management of sentences and syntactic units.
The resulting Szeged Treebank 2.0 serves as a reference database for further NLP activities. Files of the treebank are stored in XML format (http://www.xml.org), their inner structure is described by the TEI (xLite and P4) DTD (Document Type Definition) (http://www.tei-c.org).
Machine learning of syntax rules and the development of the automatic parser
After the completion of the treebank annotation, machine learning methods were applied to learn syntax rules for the automated analysis of Hungarian natural language sentences.
Consortium members developed the syntactic analyzer based on two major grounds. On the one hand, they investigated the available literature and the already developed theories in the research area. Studying and comparing existing results, linguistic experts of the consortium outlined a consistent system of syntax rules. On the other hand, partners used machine learning algorithms to automatically retrieve regularities from the annotated treebank, thereby increasing the quality of the manually defined set. One of the most significant results of the project is exactly this technology that is able to combine human creativity with machine learning methods in order to achieve higher accuracy and better coverage of the rules.
One of the basic requirements for the automated syntax rules system is that it should be easily modifiable and extensible. As natural languages tend to change continuously and at a relatively fast pace, the technology should be able to follow these changes. A further aim of the work, therefore, was to develop the technology in flexible and language independent way. This ensures that the technology can be easily adapted to other languages as well, creating links between Hungarian and other languages.
Direct application of the developed syntax rule system was realized in the form of a syntactic parser prototype. The technology was integrated into the so-called MetaMorpho analyser of MorphoLogic Ltd. MetaMorpho supports a variety of natural language applications such as machine translation, text understanding, information extraction and the like.