Node

From UNL Wiki
Revision as of 15:43, 16 August 2013 by Martins (Talk | contribs)
Jump to: navigation, search

A node is the most elementary unit in the grammar. It is the result of the tokenization process, and corresponds to the notion of "lexical item". At the surface level, a natural language sentence is considered a list of nodes, and a UNL graph a set of relations between nodes.

Contents

Elements

Any node is a vector (one-dimensional array) containing the following necessary elements:

  • a string, to be represented between "quotes", which expresses the actual state of the node;
  • a headword, to be represented between [square brackets], which expresses the original value of the node in the dictionary;
  • a UW, to be represented between [[double square brackets]], which expresses the UW value of the node;
  • a feature or set of features, which express the features of the node;
  • an Index, preceded by the symbol %, which is used to reference the node.

Basic symbols

Basic symbols used in the UNL framework
Symbol Definition Example
( ) node (%a)
" " string "went"
[ ] natural language entry (headword) [go]
[[ ]] UW [[to go(icl>to move)]]
// regular expression /a{2,3}/ = aa,aaa
^ not ^a = not a
{ | } or {a|b} = a or b
% index for nodes, attributes and values %x
# index for sub-NLWs #01
= attribute-value assignment POS=NOU
! rule trigger !PLR
& merge operator %x&%y
? dictionary lookup operator ?[a]

Examples

Examples of nodes:

  • ("ing") (a node making reference only to its actual string value)
  • ([book]) (a node making reference only to its headword,i.e., its original state in the dictionary)
  • ([[book(icl>document)]]) (a node making reference only to its UW value)
  • (NUM) (a node making reference only to one of its features)
  • (POS=NOU) (a node making reference only to one of its features in the attribute-value pair format)
  • (%x) (a node making reference only to its unique index)
  • ("string",[headword],[[UW]],feature1,feature2,...,attribute1=value1,attribute2=value2,...,%x) (complete node)

Properties

Nodes are enclosed between (parentheses)
("a") is a node
"a" is not a node
The elements of a node are separated by comma
("a",[a],[[a]],A,B,A=C,%a)
The order of elements inside a node is not relevant.
("a",[a],[[a]],A,B,A=C,%a) is the same as ([[a]],B,A,"a",[a],A=C,%a)
Nodes may have one single string, headword, UW and index, but may have as many features as necessary
("a","b") (a node may not contain more than one string)
([a],[b]) (a node may not contain more than one headword)
([[a]],[[b]]) (a node may not contain more than one UW)
(%a,%b) (a node may not contain more than one index)
(A,B,C,D,...,Z) (a node may contain as many features as necessary)
A node may be referred by any of its elements, but only the index make it unique
("a") refers to all nodes where actual string = "a"
([a]) refers to all nodes where headword = [a]
([[a]]) refers to all nodes where UW = [[a]]
(A) refers to all nodes having the feature A
("a",[a],[[a]],A) refers to all nodes having the feature A where string = "a" and headword = [a] and UW = [[a]]
(%a) refers to the specific node with the index %a
Nodes are automatically indexed according to a position-based system if no explicit index is provided (see Indexation)
("a")("b") is actually ("a",%01)("b",%02)
Regular expressions may be used to make reference to any element of the node, except the index
("/a{2,3}/") refers to all nodes where string is a sequence of 2 to 3 characters "a"
([/a{2,3}/]) refers to all nodes where headword is a sequence of 2 to 3 characters "a"
([[/a{2,3}/]]) refers to all nodes where UW is a sequence of 2 to 3 characters "a"
(/a{2,3}/) refers to all nodes having a feature that is a sequence of 2 to 3 characters "a"
Nodes may contain disjoint features enclosed between {braces} and separated by vertical bar
({A|B}) refers to all nodes having the feature A OR B
Node features may be expressed as simple attributes, or attribute-value pairs
(MCL) - feature as an attribute: refers to all nodes having the feature MCL
(GEN=MCL) - feature as an attribute-value pair, which is the same as (GEN,MCL): refers to all nodes having the features GEN and MCL.
Attribute-value pairs may be used to create co-reference between different nodes (as in agreement)
(%x,GEN)(%y,GEN=%x) - the value of the attribute GEN of the node %x is the same of the attribute GEN of the node %y (see Indexation)

Strings, headwords and UW's

During the tokenization


[a] will match the node associated to the entry [a] retrieved from the dictionary, no matter its current realization, which may be affected by other rules (the original [a] may have been replaced, for instance, by "b", but will still be indexed to the entry [a])



  • "Double quotes" are always used to represent strings: "a" will match only the string "a"
  • [Simple square brackets] are always used to represent natural language entries (headwords) in the dictionary
  • [[Double square brackets]] are always used to represent UWs: [[a]] will match the node associated to the UW [[a]]
Software