English Disambiguation Grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
The English Disambiguation Grammar, which is part of the [[English grammar]], is used to control the [[tokenization]] of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. It follows the formalism described at [[UNL_Grammar_Specs#Disambiguation_Rules|UNL Grammar Specs]] and comprises two different types of rules:
+
The English disambiguation grammars, or English d-grammars, are a part of the [[English grammar]] and are used to improve the results of the [[tokenization]] and to control the application of [[t-rule]]s. They follow the formalism described at [[UNL_Grammar_Specs#Disambiguation_Rules|UNL Grammar Specs]] and are used both in natural language analysis ([[UNLization]]) and in natural language generation ([[NLization]]).
 +
 
 +
== UNLization ==
 +
In natural language analysis, the d-grammar is used to control the [[tokenization]] of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of rules:
 
*'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices
 
*'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices
 
*:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
 
*:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
 
*'''Positive''' rules, where the probability is more than 0, force lexical choices
 
*'''Positive''' rules, where the probability is more than 0, force lexical choices
 
*:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
 
*:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
== Examples of disambiguation rules ==
+
=== File ===
 +
The English d-grammar for the [[Corpus500]] may be downloaded from [http://www.unlweb.net/resources/corpus500/eng_ana_dgrammar.txt eng_ana_dgrammar.txt]. The complete English d-grammar may be exported from the [[UNLarium]]: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.
 +
=== How to use d-grammars ===
 +
The d-grammar must be uploaded to or provided directly at the tab d-rules in [[IAN]].
 +
 
 +
=== Examples of disambiguation rules ===
  
 
;Preventing hyper-segmentation of temporary entries:
 
;Preventing hyper-segmentation of temporary entries:

Revision as of 16:08, 28 July 2012

The English disambiguation grammars, or English d-grammars, are a part of the English grammar and are used to improve the results of the tokenization and to control the application of t-rules. They follow the formalism described at UNL Grammar Specs and are used both in natural language analysis (UNLization) and in natural language generation (NLization).

Contents

UNLization

In natural language analysis, the d-grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of rules:

  • Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
    For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
  • Positive rules, where the probability is more than 0, force lexical choices
    For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)

File

The English d-grammar for the Corpus500 may be downloaded from eng_ana_dgrammar.txt. The complete English d-grammar may be exported from the UNLarium: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.

How to use d-grammars

The d-grammar must be uploaded to or provided directly at the tab d-rules in IAN.

Examples of disambiguation rules

Preventing hyper-segmentation of temporary entries
"asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
(^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
(^DIGIT)({[st]|[nd]|[rd]|[th]})=0;
The subwords [st], [nd], [rd] and [th] may only appear after a number ("1st" or "tenth"). This prevents hyper-segmentation as in "asdfgst" = [asdfg][st], which would not be blocked by the rule above, because [st] is a suffix.
Preventing the generation of two temporary words in sequence
"asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
Determiners x pronouns
There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them:
(D,^AFT)({PUT|STAIL})=0;
Determiners cannot come at the end of the sentence, except if their distribution is AFT (after) ("He said that", but "There are books enough")
(D)(BLK)(V)=0;
Determiners cannot come before a verb ("That is the book")
Auxiliary verbs x main verbs
Many auxiliary verbs may also play the role of main verbs: "He has done that" ("has" is an auxiliary) x "He has a car" ("has" is the main verb). The following rule helps differentiating them:
(AUX,^COP)(BLK)(^V,^[not])=0;
Auxiliary verbs which are not copula must be followed by a verb or the adverb [not]
Software