Uncertainty and negation detection

In computational linguistics, especially in information extraction and retrieval it is of the utmost importance to distinguish between uncertain statements and factual information. In most cases, what the user needs is factual information, hence, uncertain and negated propositions should be treated in a special way: depending on the exact task, the system should either ignore such texts or separate them from factual information.

Corpora

Our researchers developed the BioScope corpus in which biomedical texts are annotated for uncertainty and negation cues and their scopes. The objective of the ConLL-2010 Shared Task organized by the members of the human language technology group was the automatic identification of uncertainty cues and their scopes. Following the annotation principles applied in the construction of the databases used in the shared task, we created a database of Hungarian Wikipedia articles annotated for uncertainty cues called weasels. This corpus can have an essential role in implementing and evaluating uncertainty detectors for Hungarian.

Several corpora annotated for uncertainty have been constructed for different genres and domains (BioScope, FactBank, WikiWeasel, MPQA just to name a few). However, these corpora cover different aspects of uncertainty, being grounded on different linguistic models, making it hard to exploit cross-domain knowledge in applications. These differences in part stem from the varied application needs across application domains since different types of uncertainty and classes of linguistic expressions are relevant for different domains. A fine grained categorization of semantic uncertainty enables the individual treatment of each subclass, which is less dependent on domain differences than using one coarse-grained uncertainty class.

Based on the above fine-grained categorization, we manually harmonized the uncertainty annotations of three corpora, yielding the Szeged Uncertainty Corpus:

  • BioScope (Vincze et al., 2008)
  • FactBank (Saurí and Pustejovsky, 2009)
  • WikiWeasel (Farkas et al., 2010)

Experiments

The feasibility of this categorization of uncertainty phenomena was supported by training an accurate semantic uncertainty detector on the above corpora, i.e. texts from several domains and genres. Our experiments with domain adaptation techniques also highlight that the unified subcategorization and domain adaptation, taken together, offer an efficient solution for cross-domain and cross-genre semantic uncertainty recognition. Our results are reported in Szarvas et al. (2012).

Downloads

  • Databases used in the CoNLL-2010 Shared Task: BioScope 1.5 and WikiWeasel 1.0
  • BioScope 1.0
  • Hungarian weasel corpus
  • Szeged Uncertainty Corpus (including BioScope 2.0, WikiWeasel 2.0 and FactBank 2.0)
  • software used for experiments and data files containing the features
  • trained models for uncertainty detection [coming soon...]
  • WikiWeasel 3.0 [coming soon...]
  • hUnCertainty 1.0 [coming soon...]
  • Hungarian Facebook posts annotated for uncertainty
  • The resources can be used free of charge under the licence Creative Commons Attribution Share Alike.

References