N-rule
N-rule or normalization rule is a special type of transformation rule used to prepare the natural language input for automatic processing. They constitute the pre-processing module that applies over the input as a string and runs prior to the tokenization. The set of N-rules forms the Normalization Grammar, or N-Grammar.
Contents |
When to use N-rules
N-rules are used to normalize the input string PRIOR to the processing, i.e., before any dictionary search. They have two roles:
- to normalize the input text (to replace abbreviations by their extend forms, to extend contractions, etc.)
- to segment the natural language text into sentences (i.e., to create the tags <SHEAD> (beginning of a sentence), <STAIL> (end of a sentence), <CHEAD> (beginning of a scope) and <CTAIL> (end of a scope) inside the input text). These tags are used as sentence and clause boundaries, and define the units of processing of the Transformation and Disambiguation grammars.
When not to use N-rules
N-rules cannot be used when we depend on information extracted from the dictionary (such as part-of-speech, number, gender, etc.)
Syntax
N-rules comply with the syntax below:
(<NODE>)(<NODE>)...(<NODE>) := (<NODE>)(<NODE>)...(<NODE>);
Where:
- <NODE> is a string or a regular expression. Strings are always represented between "quotes"; regular expressions (for strings) between "/forward slashes inside quotes/".
- the left side of the operator := states the condition
- the right side of the operator := states the action to be performed over each string of the condition.
Types
N-rules are used to:
- replace strings: "axb" > "ayb"
- delete strings: "axb" > "ab"
- create strings: "ab" > "axb"
- reorder strings: "ab" > "ba"
- assign sentence boundaries: "ab" > "a"<STAIL>"b"
Examples
- Replacement of strings
- ("Mr."):=("Mister"); (replace "Mr." by "Mister")
- ("Mr")("."):=("Mister"); (the same as above)
- ("doctor"):=("dr."); (replace "doctor" by "dr.")
- ("an "):=("a "); (replace "an " by "a ")
- ("don't"):=("do not"); (replace "don't" by "do not")
- ("don't"):=("do")(" ")("not"); (the same as above)
- Deletion of strings
- ("/[A-Z]/",%x)(".",%y):=(%x); (deletes the "." after capital letters)
- Creation of strings
- (SHEAD,%x)(^" ",%y):=(%x)(" ",%z)(%y); (add a blank space after the beginning of the sentence)
- Reordering of strings
- ("Am",%x)(" ",%y)("I",%z):=(%z)(%y)(%x); (reorder "Am I" as "I Am")
- Segmentation (see below)
- (".",%x):=(%x)(+STAIL,%y); (creates an STAIL node after a ".";[1])
Segmentation
In the UNL framework, natural language segmentation is done through the following tags:
- <SHEAD> indicates the beginning of a sentence
- <STAIL> indicates the end of a sentence
- <CHEAD> indicates the beginning of a scope (any portion of text smaller than a sentence)
- <CTAIL> indicates the beginning of a scope (any portion of text smaller than a sentence)
The tags <SHEAD> and <STAIL> defines the sentence boundaries and are automatically assigned by the system according to line breaks and paragraph breaks. No punctuation sign is used as a sentence boundary by default. In order to break the input text into other portions, the corresponding N-rules must be provided. This is done by appending empty nodes with the features SHEAD, STAIL, CHEAD or CTAIL to the left or to the right of existing strings.
- Original text: <SHEAD>abcde<STAIL>
- Rule: ("c",%x):=(%x)(STAIL);
- Modified text: <SHEAD>abc<STAIL><SHEAD>de<STAIL>
- Observations
- The tag <SHEAD> is assigned automatically after <STAIL>
- The tag <STAIL> is assigned automatically before <SHEAD>
- The tag <CHEAD> is assigned automatically after <CTAIL>
- The tag <CTAIL> is assigned automatically before <CHEAD>
Properties
- N-rules can only manipulate strings or regular expressions. Features (such as N, NOU, MCL, etc.) cannot be used in N-rules.
- ("Mr."):=("Mister"); (string manipulation)
- ("/[A-Z]/",%x)(".",%y):=(%x); (regular expression manipulation)
("Mr.",ABB):=("Mister");(this is not a N-rule, because it involves a non-string element, i.e., ABB)
- Regular expressions may only be used in the left side.
- ("/[A-Z]/",%x)(".",%y):=(%x);
("/[A-Z]/")("."):=("/[A-Z]/");
- N-rules are recursive: rules will apply while conditions are true:
- The rule "(" "):=("-");" will transform "a b c d e" into "a-b-c-d-e" (and not only in "a-b c d e")
- The symbol ^ is used for negation and may be used to prevent infinite loops:
- The rule (".",%x):=(%x)(+STAIL,%y); contains a loop, and will lead to (".")(STAIL)(STAIL)(STAIL)(STAIL).... In order to prevent that, we have to indicate that STAIL must be added if it does not exist yet, i.e.: (".",%x)(^STAIL,%z):=(%x)(+STAIL,%y)(%z);
- In the right side, changes may be expressed by the right side of A-rules inside each form. The default is replacement.
- The rule "("a")(" ")("/[aeiou].*/"):=("an")( )( );" could also be expressed as "("a")(" ")("/[aeiou].*/"):=(0>"n")( )( );", i.e., the change from "a" to "an" could be expressed either by "an" or 0>"n".
- Rules apply only if all conditions are true.
- The rule "("a")(" ")("/[aeiou].*/"):=("an")( )( );" will apply only in case of "a" before a blank and a node starting with "a", "e", "i", "o" or "u".
- Nodes may be deleted through replacement by zero:
- (" "):=; (deletes all the blank spaces)
Indexes
Indexes are used to associate nodes in the left side of the rule (CONDITION) to nodes in the right side of the rule (ACTION):
- (%a)(%b)(%c):=(%b); (delete the first and the third nodes, and keep the second)
- (%a)(%b)(%c):=(%c)(%b)(%a); (reverse the order)
Indexation is done automatically by the machine, as follows:
- if the number of nodes is the same in the left and in the right side, NODES ARE CO-INDEXED
- ("a")("b")("c"):=("d")("e")("f"); is the same as ("a",%01)("b",%02)("c",%03):=("d",%01)("e",%02)("f",%03); (i.e., "a" will be replaced by "d", "b" by "e", and "c" by "f")
- if the number of nodes is not the same in both sides, NODES ARE NOT CO-INDEXED
- ("a")("b")("c"):=("d")("e"); is the same as ("a",%01)("b",%02)("c",%03):=("d",%04)("e",%05); (i.e., "a", "b" and "c" will be deleted, and "d" and "e" will be created
In order to avoid ambiguities, it is highly recommended that indexes are replaced by user-defined labels made of any sequence of alphabetic characters and underscore:
- (A,%a)(B,%b):=(C,%a)(D,%b);
Numeric characters cannot be used as user-defined indexes:
(A,%03)(B,%05):=(C,%03)(D,%05);
Nodes from the left side that are not co-indexed to nodes on the right side are deleted; nodes from the right side that are not co-indexed to nodes on the left side are created.
- ("a",%x)("b",%y):=(%x); (the node %y is deleted)
- ("a")("b")("c"):=("a")("c"); (all the nodes from the left side are deleted, and the nodes from the right side are created in their place)
Common mistakes
"Mr":="Mister";- Conditions and actions must always come between parentheses: ("Mr"):=("Mister");
(Mr):=(Mister);- Strings must come between quotes (inside the parentheses): ("Mr"):=("Mister");
("Mr"):=("Mister")- Rules must end in semicolon: ("Mr"):=("Mister");
("a")(" ")("/[aeiou].*/"):=("an");- "a adjective">"a": the blank and the following form are deleted because they are not present at the right side
("de")(" ")("/[aeiou].*/"):=("d'")("/[aeiou].*/");- "de avoir">"d' ": coindexation is based on ordering and not on features. The third form is deleted because it's not present at the right side; the second form, which is BLK, receives the feature VOW;
N-rules and L-rules
N-rules and L-rules are basically the same. The only difference is that L-rules are part of the Transformation Grammar and, therefore, applies after tokenization, whereas N-rules constitute the N-grammar, and apply before tokenization. This means that N-rules may only deal with strings or regular expressions, whereas L-rules may also deal with other elements (such as features and UW's):
- L-rule
- ("I")(BLK)("am"):=("I'm"); (I am>I'm)
- ("a",PRE)(BLK)("a",ART):=("à",+ART,+CTC); (a a>à)
- ("de",PRE)(BLK)("le",ART):=("du",+ART,+CTC); (de le>du)
- N-rule
- ("I")(" ")("am"):=("I'm"); (replace "I am" by "I'm")
Note, in the above, that we may use dictionary features (such as BLK, PRE, ART) in L-rules, but we cannot use any dictionary feature in N-rules. The only features available in N-rules are the system-defined features, such as SHEAD (beginning of the sentence) and STAIL (end of the sentence).