English Disambiguation Grammar: Difference between revisions
From UNLwiki
				
				
				Jump to navigationJump to search
				
				
| imported>Martins No edit summary | imported>Martins No edit summary | ||
| Line 2: | Line 2: | ||
| *Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices | *Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices | ||
| *Positive rules, where the probability is more than 0, force lexical choices | *Positive rules, where the probability is more than 0, force lexical choices | ||
| == Negative rules == | |||
| The most important negative rules are used to avoid hyper-segmentation of temporary entries: | The most important negative rules are used to avoid hyper-segmentation of temporary entries: | ||
| ;Preventing the hyper-segmentation of temporary entries: | ;Preventing the hyper-segmentation of temporary entries: | ||
| Line 13: | Line 14: | ||
| :(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | :(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | ||
| ::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ||
| == Positive rules == | |||
Revision as of 13:42, 27 July 2012
The English Disambiguation Grammar is used to control the tokenization of the English sentences. It comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- Positive rules, where the probability is more than 0, force lexical choices
Negative rules
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
- Preventing the hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
 
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" will be represented as a single temporary word "asdfg hijkl" instead of two temporary words "asdfg" and "hijkl" isolated by blank space
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
 
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)