English Disambiguation Grammar: Difference between revisions

From UNLwiki
Jump to navigationJump to search
imported>Martins
No edit summary
imported>Martins
No edit summary
Line 1: Line 1:
The English Disambiguation Grammar, which is part of the [[English grammar]], is used to control the [[tokenization]] of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. It follows the formalism described at [[UNL_Grammar_Specs#Disambiguation_Rules|UNL Grammar Specs]] and comprises two different types of rules:
The English disambiguation grammars, or English d-grammars, are a part of the [[English grammar]] and are used to improve the results of the [[tokenization]] and to control the application of [[t-rule]]s. They follow the formalism described at [[UNL_Grammar_Specs#Disambiguation_Rules|UNL Grammar Specs]] and are used both in natural language analysis ([[UNLization]]) and in natural language generation ([[NLization]]).
 
== UNLization ==
In natural language analysis, the d-grammar is used to control the [[tokenization]] of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of rules:
*'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices
*'''Negative''' (blocking) rules, where the probability is equal to 0, prevent lexical choices
*:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
*:For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
*'''Positive''' rules, where the probability is more than 0, force lexical choices
*'''Positive''' rules, where the probability is more than 0, force lexical choices
*:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
*:For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)
== Examples of disambiguation rules ==
=== File ===
The English d-grammar for the [[Corpus500]] may be downloaded from [http://www.unlweb.net.br/resources/corpus500/eng_ana_dgrammar.txt eng_ana_dgrammar.txt]. The complete English d-grammar may be exported from the [[UNLarium]]: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.
=== How to use d-grammars ===
The d-grammar must be uploaded to or provided directly at the tab d-rules in [[IAN]].
 
=== Examples of disambiguation rules ===


;Preventing hyper-segmentation of temporary entries:
;Preventing hyper-segmentation of temporary entries:

Revision as of 14:08, 28 July 2012

The English disambiguation grammars, or English d-grammars, are a part of the English grammar and are used to improve the results of the tokenization and to control the application of t-rules. They follow the formalism described at UNL Grammar Specs and are used both in natural language analysis (UNLization) and in natural language generation (NLization).

UNLization

In natural language analysis, the d-grammar is used to control the tokenization of the English sentences, i.e., to prevent wrong lexical choices and to induce the best matches. The d-grammar comprises two different types of rules:

  • Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
    For instance, the rule (D)(BLK)(V)=0; informs that the sequence determiner+blank space+verb is not allowed, i.e., there cannot be a determiner before a verb.
  • Positive rules, where the probability is more than 0, force lexical choices
    For instance, the rule (['s],V)(BLK)(GER)=1; informs that the entry ['s] as a verb is to be preferred before a gerund (there are three entries ['s] in the dictionary: the contracted form of "is", the particle used to form the genitive and a plural suffix; if this rule is not stated, the system would simply select the first one appearing in the dictionary with the highest frequency)

File

The English d-grammar for the Corpus500 may be downloaded from eng_ana_dgrammar.txt. The complete English d-grammar may be exported from the UNLarium: UNLWEB>UNLARIUM>GRAMMAR>ENGLISH>EXPORT.

How to use d-grammars

The d-grammar must be uploaded to or provided directly at the tab d-rules in IAN.

Examples of disambiguation rules

Preventing hyper-segmentation of temporary entries
"asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
(^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
(^DIGIT)({[st]|[nd]|[rd]|[th]})=0;
The subwords [st], [nd], [rd] and [th] may only appear after a number ("1st" or "tenth"). This prevents hyper-segmentation as in "asdfgst" = [asdfg][st], which would not be blocked by the rule above, because [st] is a suffix.
Preventing the generation of two temporary words in sequence
"asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
Determiners x pronouns
There are many ambiguities in the dictionary between determiners and pronouns. The string "that", for instance, is represented in the dictionary as a demonstrative determiner ("that book") and as a demonstrative pronoun ("that is the book"). The following rules help differentiating them:
(D,^AFT)({PUT|STAIL})=0;
Determiners cannot come at the end of the sentence, except if their distribution is AFT (after) ("He said that", but "There are books enough")
(D)(BLK)(V)=0;
Determiners cannot come before a verb ("That is the book")
Auxiliary verbs x main verbs
Many auxiliary verbs may also play the role of main verbs: "He has done that" ("has" is an auxiliary) x "He has a car" ("has" is the main verb). The following rule helps differentiating them:
(AUX,^COP)(BLK)(^V,^[not])=0;
Auxiliary verbs which are not copula must be followed by a verb or the adverb [not]