Node
(→Indexation) |
(→Indexation) |
||
Line 114: | Line 114: | ||
:(" ",[ ],[[]],LEX=O,POS=PUT,BLK) | :(" ",[ ],[[]],LEX=O,POS=PUT,BLK) | ||
:("an",[apple],[[apple(icl>fruit)]],LEX=N,POS=NOU) | :("an",[apple],[[apple(icl>fruit)]],LEX=N,POS=NOU) | ||
− | Note, in the above, that we have simply replaced the string "an" by "apple", and "apple" by "an", preserving all the other features, which is not the intended behavior (after the rule, "apple" is ART and "an" is NOU). In order to manipulate entire nodes (and not only some elements), we have to create indexes such as: | + | Note, in the above, that we have simply replaced the string "an" by "apple", and "apple" by "an", preserving all the other features, which is not the intended behavior (after the rule, "apple" is ART and "an" is NOU).<br /> |
+ | In order to manipulate entire nodes (and not only some elements), we have to create indexes such as: | ||
*("an",%index1)(" ",%index2)("apple",%index3):=(%index3)(%index2)(%index1); | *("an",%index1)(" ",%index2)("apple",%index3):=(%index3)(%index2)(%index1); | ||
In this case, the output would be the expected one: | In this case, the output would be the expected one: |
Revision as of 17:47, 16 August 2013
A node is the most elementary unit in the grammar. It is the result of the tokenization process, and corresponds to the notion of "lexical item". At the surface level, a natural language sentence is considered a list of nodes, and a UNL graph a set of relations between nodes.
Contents |
Basic symbols
Symbol | Definition | Example |
---|---|---|
( ) | node | (%a) |
" " | string | "went" |
[ ] | natural language entry (headword) | [go] |
[[ ]] | UW | [[to go(icl>to move)]] |
// | regular expression | /a{2,3}/ = aa,aaa |
^ | not | ^a = not a |
{ | } | or | {a|b} = a or b |
% | index for nodes, attributes and values | %x |
# | index for sub-NLWs | #01 |
= | attribute-value assignment | POS=NOU |
! | rule trigger | !PLR |
& | merge operator | %x&%y |
? | dictionary lookup operator | ?[a] |
Elements
Any node is a vector (one-dimensional array) containing the following necessary elements:
- a string, to be represented between "quotes", which expresses the actual state of the node;
- a headword, to be represented between [square brackets], which expresses the original value of the node in the dictionary;
- a UW, to be represented between [[double square brackets]], which expresses the UW value of the node;
- a feature or set of features, which express the features of the node;
- an Index, preceded by the symbol %, which is used to reference the node.
The elements of a node can be:
- native, if defined in the dictionary; or
- non-native, if assigned by transformation rules.
Example
Consider the input string "an apple" and the dictionary[1] below:
- [an]{111}""(LEX=D,POS=ART)<eng,0,0>;
- [ ]{3333}""(LEX=O,POS=PUT,BLK)<eng,0,)>;
- [apple]{222}"apple(icl>fruit)"(LEX=N,POS=NOU)<eng,0,0>;
In the tokenization process, the input string is segmented into nodes according to the dictionary. This means that the input string above is analyzed as a list of three nodes:
- ("an",[an],[[]],LEX=D,POS=ART)
- (" ",[ ],[[]],LEX=O,POS=PUT,BLK)
- ("apple",[apple],[[apple(icl>fruit)]],LEX=N,POS=NOU)
Each node consists of:
- a string, between quotes ("an", " ", "apple");
- a headword, between brackets ([an],[ ],[apple]);
- a UW, between double brackets ([[]],[[]],[[apple(icl>fruit)]]);
- a set of features (LEX=D, POS=ART, LEX=O, ..., BLK)
These elements are said to be native because they are inherited from the dictionary.
During the processing, we may change any of these elements with T-rules. The resulting new elements are said to be non-native because they are assigned by rules.
In any case, we may use any of these elements (native or non-native) to refer to a node:
- ("apple") (if the only relevant information is the string)
- ([apple]) (if the only relevant information is the headword retrieved in the dictionary)
- ([[apple(icl>fruit)]]) (if the only relevant information is the UW)
- (NOU) (if the only relevant information is the feature NOU)
- (POS=NOU) (if the only relevant information is the pair POS=NOU)
- ("apple",[apple],[[apple(icl>fruit)]],LEX=N,POS=NOU) (if all the elements of the node are important)
This is necessary in rules. Consider, for instance, the example below:
- INITIAL STATE: ("an",[an],[[]],LEX=D,POS=ART)
- RULE APPLIED: ("an"):=("a");[2]
- FINAL STATE: ("a",[an],[[]],LEX=D,POS=ART)
Note, in the above, that the node changed its string value from "an" to "a". As this was the only change intended, the rule referred only to the string value of the node ("an").
In addition to changing the string, we could have changed any element of the node:
- (ART):=(-ART); (delete the feature ART from the nodes having the feature ART)
- (ART,^NDEF):=(+NDEF); (add the feature NDEF to the nodes having the feature ART and not having the feature NDEF)
- ("an",ART,^NDEF):=("a",-ART,+NDEF); (set the string to "a", remove the feature ART and add the feature NDEF to the nodes having the feature ART and not having the feature NDEF whose string is "an").
Indexation
In most cases, we have to assign an index to the node. This happens when we want to perform operations over nodes instead of elements of nodes.
Consider, for instance, the need for reversing the order of the input string "an apple" in order to generate "apple an". If we write a rule as:
- ("an")(" ")("apple"):=("apple")(" ")("an");
We would have the following output:
- ("apple",[an],[[]],LEX=D,POS=ART)
- (" ",[ ],[[]],LEX=O,POS=PUT,BLK)
- ("an",[apple],[[apple(icl>fruit)]],LEX=N,POS=NOU)
Note, in the above, that we have simply replaced the string "an" by "apple", and "apple" by "an", preserving all the other features, which is not the intended behavior (after the rule, "apple" is ART and "an" is NOU).
In order to manipulate entire nodes (and not only some elements), we have to create indexes such as:
- ("an",%index1)(" ",%index2)("apple",%index3):=(%index3)(%index2)(%index1);
In this case, the output would be the expected one:
- ("apple",[apple],[[apple(icl>fruit)]],LEX=N,POS=NOU)
- (" ",[ ],[[]],LEX=O,POS=PUT,BLK)
- ("an",[an],[[]],LEX=D,POS=ART)
Indexes, which are introduced by the symbol %, are always temporary (they are valid only within rules using them) and are used for co-indexing nodes. For further information on indexes, see indexation.
Properties
- Nodes are enclosed between (parentheses)
- ("a") is a node
- "a" is not a node
- The elements of a node are separated by comma
- ("a",[a],[[a]],A,B,A=C,%a)
- The order of elements inside a node is not relevant.
- ("a",[a],[[a]],A,B,A=C,%a) is the same as ([[a]],B,A,"a",[a],A=C,%a)
- Nodes may have one single string, headword, UW and index, but may have as many features as necessary
("a","b")(a node may not contain more than one string)([a],[b])(a node may not contain more than one headword)([[a]],[[b]])(a node may not contain more than one UW)(%a,%b)(a node may not contain more than one index)- (A,B,C,D,...,Z) (a node may contain as many features as necessary)
- A node may be referred by any of its elements, but only the index make it unique
- ("a") refers to all nodes where actual string = "a"
- ([a]) refers to all nodes where headword = [a]
- ([[a]]) refers to all nodes where UW = [[a]]
- (A) refers to all nodes having the feature A
- ("a",[a],[[a]],A) refers to all nodes having the feature A where string = "a" and headword = [a] and UW = [[a]]
- (%a) refers to the specific node with the index %a
- Nodes are automatically indexed according to a position-based system if no explicit index is provided (see Indexation)
- ("a")("b") is actually ("a",%01)("b",%02)
- Regular expressions may be used to make reference to any element of the node, except the index
- ("/a{2,3}/") refers to all nodes where string is a sequence of 2 to 3 characters "a"
- ([/a{2,3}/]) refers to all nodes where headword is a sequence of 2 to 3 characters "a"
- ([[/a{2,3}/]]) refers to all nodes where UW is a sequence of 2 to 3 characters "a"
- (/a{2,3}/) refers to all nodes having a feature that is a sequence of 2 to 3 characters "a"
- Nodes may contain disjoint features enclosed between {braces} and separated by vertical bar
- ({A|B}) refers to all nodes having the feature A OR B
- Node features may be expressed as simple attributes, or attribute-value pairs
- (MCL) - feature as an attribute: refers to all nodes having the feature MCL
- (GEN=MCL) - feature as an attribute-value pair, which is the same as (GEN,MCL): refers to all nodes having the features GEN and MCL.
- Attribute-value pairs may be used to create co-reference between different nodes (as in agreement)
- (%x,GEN)(%y,GEN=%x) - the value of the attribute GEN of the node %x is the same of the attribute GEN of the node %y (see Indexation)
Strings, headwords and UW's
During the tokenization
- [a] will match the node associated to the entry [a] retrieved from the dictionary, no matter its current realization, which may be affected by other rules (the original [a] may have been replaced, for instance, by "b", but will still be indexed to the entry [a])
- "Double quotes" are always used to represent strings: "a" will match only the string "a"
- [Simple square brackets] are always used to represent natural language entries (headwords) in the dictionary
- [[Double square brackets]] are always used to represent UWs: [[a]] will match the node associated to the UW [[a]]
Notes
- ↑ For the structure of the dictionary, please consult dictionary.
- ↑ This is a T-rule and means: replace the string value from "an" to "a". For further information on T-rules, please consult T-rule.