Construction of the Hungarian WordNet Ontology and its Application in Information Extraction Systems
- University of Szeged, Department of Informatics, HLT Group (coordinator) [link]
- MorphoLogic Ltd. Budapest [link]
- Research Institute for Linguistics at HAS, Department of Corpus Linguistics [link]
Brief project summary
Computer application development concerning Hungarian language calls for the development of a Hungarian vocabulary database manageable by automated processes. In computational linguistics, ontology can be defined as the data structure of formally defined concepts and relations, by means of which semantic inferences can be drawn. The so-called language ontologies form an important subclass of computational ontologies.
The objective of the project was to create a semantically structured, general purpose Hungarian concept set on the basis of the results and formalism of EuroWordNet language ontology. Further it was aimed to supplement the created ontology with a special sublanguage already examined by the consortium and a domain-specific ontology including expressions of business language. Finally, we wanted to present a potential application of the thus created concept network in the field of information extraction.
The main result of the project is the development of a large, strictly structured natural language concept set (ontology), which will help in finding solutions to several important scientific and technological problems. Regarding scientific achievements, it is important to emphasize that developments concern the semantics of Hungarian language, i.e. of a language which typologically and morphologically significantly differs from other investigated European languages.
Further scientific and technical objectives of the project include:
- research and development of machine learning algorithms to support automatic, heuristic-based ontology building (algorithms help reduce manual work to validation);
- research in fields of word sense disambiguation and anaphora resolution;
- development of an ontology-based information extraction software prototype for the domain of business news, which is capable of demonstrating the advantages of the application of the concept network.
As the structure of WordNet ontologies is much more complex than that of any simple lexicon or thesaurus, its application potentials are far richer. As a mental encyclopedia of native speakers of Hungarian, a Hungarian WordNet ontology could - to a large extent - assist language teaching in schools. Its standardized interconnection with the other WordNets guarantees its applicability in teaching foreign languages as well. The proper acquisition of the lexical material of the studied foreign language, for example, may significantly contribute to the learner's clear understanding of the differences and similarities of his/her native and the target language. Apart from this, the concept network of WordNet may have a great role in psycho-linguistic experiments concerning Hungarian language.
Beyond purely scientific applicability, electronic-based language technology applications of a Hungarian WordNet may also open new vistas. Search efficiency of different search engines is greatly increased if these tools have reliable access to the semantic environment of the search expression. This may lead to the improvement of future search engines that are capable of satisfying user needs to a greater extent. This may also increase the efficiency of information extraction and machine translation technologies by providing information about the semantic attributes of the analysed text. Automatisms supported by ontologies can handle the context of the information that has to be extracted or translated, therefore, it is likely to produce more reliable results than mere pattern matching or word-by-word translating methods.
Larger scale language technology developments are largely supported by a general, well-structured lexical network. The most important forerunner of language ontologies is the Princeton WordNet, which arranges lexical patterns of English language into a concept network. Princeton WordNet, originally intended to be a mere theoretical model of the mental encyclopedia, and it was only later that it became popular among application developers. Basic units of Princeton WordNet are the so-called synsets ("synonym sets") comprising synonyms of certain words. Synsets are interconnected by semantic relations, the most important of which are semantic hyper-hyponymy and mero-holonymy relations.
As an improvement of the Princeton WordNet, EuroWordNet ontologies were created, as a result of which the WordNet system presently supports eight languages (English, Dutch, Spanish, Italian, French, German, Czech and Estonian). An explicit objective of EuroWordNet is to support multilingualism, and this objective is assured by a common linguistic interface guaranteeing total mutual accessibility. It is characteristic of the advanced state of EuroWordNet that its top, language-independent layer (Base Concepts) has become the standard for candidate languages. This allowed for the Balkan languages to join the system in the framework of the BalkaNet project. This project has connected five further languages (Greek, Romanian, Serbian, Turkish and Bulgarian) to WordNet.
Joining EuroWordNet and BalkaNet would have serious, long-term advantages for research and development in Hungarian language technology since the system provides an elaborated interface for semantic networks of numerous languages. EuroWordNet, as an intellectual product taking multilingualism into consideration, integrates the favourable features and theoretical achievements of independent computer-ontology research of the past decades. Apart from this, the formalism of EuroWordNet provides a high-level, cost-saving starting-point for the development of a Hungarian language ontology, which constituted the objective of the current project.
Prior to this project, some achievements were already accomplished in basic ontology research. These efforts were supported by MorphoLogic Ltd. The achievements published in international forums, as well as the arrangement of 10,000 Hungarian nouns into a network according to Princeton WordNet principles were financed from own resources.
Apart from international WordNet initiatives and preliminary research, a former project of the consortium also contributes to the current work in progress. Within the framework of the NKFP 2/17/2001 project, the consortium has already experimented with automated semantic analysis of Hungarian language. Within the same project, a prototype of an information extraction system was developed, which already made use of semantic information identified and marked during analyses. At that time, however, an event-describing technology was employed that relied only on primary semantic attributes and not on a well-structured concept network. The consortium expects that the development of a Hungarian WordNet greatly contributes to the efficiency improvement of the formerly developed information extraction system since the ontology contains far more information compared to the semantic attributes utilized previously.
Participants defined the development principles and methodology of the Hungarian version of the WordNet, furthermore, they worked out a technique for the treatment of incidental deviations from the original structure. This task involved creating the necessary working environment, databases, software technology and the design of the data storage model. The consortium examined software programs used in former EuroWordNet projects, and selected the most suitable ones, which then were adapted to Hungarian language.
Partners also created the database of the information extraction system, which was compiled along certain previously assigned criteria from daily economic and business news of the National News Agency. The database thus created serves as the training database for semantic analysis methods and allows for continuous validation.
Simultaneously, the building of the Hungarian ontology started. Methods standardized in the course of EuroWordNet developments were employed. Firstly, Hungarian equivalents of the most basic EuroWordNet concepts were identified, thereby creating the core of the concept network. Afterwards, the concept network was gradually expanded with a top-down approach. Correspondence with the EuroWordNet ontologies means the substitution of available conceptual nodes (synsets) with Hungarian words (nouns, verbs, adjectives, adverbs), where semantic features of Hungarian language must be taken into consideration. This is especially important since the differentiation of meanings is the basis of word sense disambiguation.
Another achievement was the compilation of a test database, selected from the short news collection (see above), and linguistically pre-processed. The test database thus created contains manual annotation of semantic information and the identification of significant events. The annotation was conducted on the basis of the ontology model previously created.