English Disambiguation Grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Created page with "The English Disambiguation Grammar is used to control the tokenization of the English sentences. It comprises two different types of rules: *Negative (blocking) rules, whe...")
 
Line 2: Line 2:
 
*Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
 
*Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
 
*Positive rules, where the probability is more than 0, force lexical choices
 
*Positive rules, where the probability is more than 0, force lexical choices
 +
== Negative rules ==
 
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
 
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
 
;Preventing the hyper-segmentation of temporary entries:
 
;Preventing the hyper-segmentation of temporary entries:
Line 13: Line 14:
 
:(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
 
:(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
 
::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
 
::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)
 +
== Positive rules ==

Revision as of 15:42, 27 July 2012

The English Disambiguation Grammar is used to control the tokenization of the English sentences. It comprises two different types of rules:

  • Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
  • Positive rules, where the probability is more than 0, force lexical choices

Negative rules

The most important negative rules are used to avoid hyper-segmentation of temporary entries:

Preventing the hyper-segmentation of temporary entries
"asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
(^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
Preventing the generation of two temporary words in sequence
"asdfg hijkl" will be represented as a single temporary word "asdfg hijkl" instead of two temporary words "asdfg" and "hijkl" isolated by blank space
(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)

Positive rules

Software