English Disambiguation Grammar: Difference between revisions
From UNLwiki
				
				
				Jump to navigationJump to search
				
				
| imported>Martins No edit summary | imported>Martins No edit summary | ||
| Line 9: | Line 9: | ||
| ::This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes. | ::This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes. | ||
| ;Preventing the generation of two temporary words in sequence: | ;Preventing the generation of two temporary words in sequence: | ||
| :"asdfg hijkl"  | :"asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary | ||
| :(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0; | :(TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0; | ||
| ::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)   | ::(a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)   | ||
| :(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | :(^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0; | ||
| ::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ::(a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence) | ||
| == Positive rules == | == Positive rules == | ||
Revision as of 13:44, 27 July 2012
The English Disambiguation Grammar is used to control the tokenization of the English sentences. It comprises two different types of rules:
- Negative (blocking) rules, where the probability is equal to 0, prevent lexical choices
- Positive rules, where the probability is more than 0, force lexical choices
Negative rules
The most important negative rules are used to avoid hyper-segmentation of temporary entries:
- Preventing the hyper-segmentation of temporary entries
- "asdfg" must be tokenized as [asdfg] (one single temporary entry) instead of [as][dfg], which would be the case, because [as] is in the dictionary
- (^W,^" ",^PUT,^SBW,^DIGIT,^PFX,^SHEAD)(^W,^" ",^PUT,^SBW,^DIGIT,^SFX,^STAIL)=0;
- This rule states that words must be isolated by blank spaces, punctuation sign or SHEAD and STAIL, except in the case of subwords, prefixes and suffixes.
 
- Preventing the generation of two temporary words in sequence
- "asdfg hijkl" must be represented as a single temporary word [asdfg hijkl] instead of [asdfg][ ][hijkl], which would be the case, because [ ] is in the dictionary
- (TEMP,^" ",^DIGIT,^W)(^" ",^STAIL)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be followed by a blank space or the end of the sentence)
 
- (^" ",^PUT,^SHEAD)(TEMP,^" ",^W)=0;
- (a temporary word, i.e., a word not found in the dictionary, must be preceded by a blank space, a punctuation sign or the beginning of the sentence)