Dictionary
(→Inflection rules for dictionary entries*) |
|||
Line 92: | Line 92: | ||
:::In the entry above, the NLW has been split into two different sub-NLWs ([bring] and [back] with a blank space in between). Each of these sub-NLWs has different features, referred to in the embedded parentheses inside the feature list. The sub-NLW [bring], which was the first to appear, has the feature "IFX(ET0:=4>"ought")", while the sub-NLW [back], which was the second, has the feature "pos=PRE". The feature "pos=VER", which is outside the specific feature lists, is shared by both of them. | :::In the entry above, the NLW has been split into two different sub-NLWs ([bring] and [back] with a blank space in between). Each of these sub-NLWs has different features, referred to in the embedded parentheses inside the feature list. The sub-NLW [bring], which was the first to appear, has the feature "IFX(ET0:=4>"ought")", while the sub-NLW [back], which was the second, has the feature "pos=PRE". The feature "pos=VER", which is outside the specific feature lists, is shared by both of them. | ||
− | == Inflection rules | + | == Inflection rules inside dictionary entries* == |
In order to deal with '''exceptions''' and '''irregular''' forms, dictionaries may contain rules, which must be included inside the feature list, as follows: | In order to deal with '''exceptions''' and '''irregular''' forms, dictionaries may contain rules, which must be included inside the feature list, as follows: | ||
Line 109: | Line 109: | ||
::if the last character of the string is "2", then replace "2" by "2nd"; and | ::if the last character of the string is "2", then replace "2" by "2nd"; and | ||
::if the last character of the string is "3", then replace "3" by "3rd". | ::if the last character of the string is "3", then replace "3" by "3rd". | ||
+ | |||
+ | == Regular expressions inside dictionary entries* == | ||
+ | Both the NLW and the UW may be replaced by Regular Expressions. In both cases, regular expressions must be included between a pair of "/" and should comply with the [http://www.pcre.org/ PCRE - Perl Compatible Regular Expressions]]. They should be represented as follows: | ||
+ | |||
+ | Regular expression in NLW (used in UNL-ization) | ||
+ | [RegEx] "UW" (FEATURE LIST) <LANG,FRE,PRI>; | ||
+ | Regular expression in UW (used in NL-ization) | ||
+ | [NLW] "RegEx" (FEATURE LIST) <LANG,FRE,PRI>; | ||
+ | |||
+ | === Examples === | ||
+ | ;Regular expressions in NLW | ||
+ | :<nowiki>[/colo(u)?r/] "color" (POS=NOU) <eng,0,0>; (NLW = {color, colour})</nowiki> | ||
+ | :<nowiki>[/cit(y|ies)/] "city" (POS=NOU) <eng,0,0>; (NLW = {city, cities})</nowiki> | ||
+ | :<nowiki>[/(\d){4}/] "" (ENT=YEAR) <eng,0,0>; (NLW = any sequence of four digits)</nowiki> | ||
+ | ;Regular expressions in UW | ||
+ | :[city] "/city(.)*/" (POS=NOU) <eng,0,0>; (UW = any UW that starts by the string "city") | ||
+ | :[city] "/(.)\(iof\>city\)/" (POS=NOU) <eng,0,0>; (UW = any UW that ends by the string "(iof>city)" | ||
== Examples of dictionary entries == | == Examples of dictionary entries == |
Revision as of 17:36, 9 February 2010
The UNL-NL dictionaries are bilingual dictionaries linking UWs to natural language (NL) words. They can be unidirectional (UNL-to-NL or NL-to-UNL) or bidirectional (NL-to-UNL-to-NL). UNL-to-NL dictionaries are used for NL-ization, while NL-to-UNL are used for UNL-ization. In what follows, we present the current specifications for UNL-NL dictionaries. They are not mandatory but are required from those interested in using UNL Centre's and UNDL Foundation's tools. The features marked with an * are only supported by UNDL Foundation's tools.
Contents |
General syntax
In the UNL System, the UNL-NL dictionaries are plain text files with a single entry per line in the following format:
[NLW] {ID} “UW” (ATTR , ... ) < LG , FRE , PRI >; COMMENTS
Where:
- NLW
- The lexical item of the natural language. Its format should be decided by the dictionary builder. It can be:
- a multiword expression: [United States of America]
- a compound: [hot-dog]
- a simple word: [happiness]
- a simple morpheme: [happ]
- a non-motivated linguistic entity: [g]
- a complex structure (see below)*: [[bring] [back]]
- a regular expression (see below)*: [colou{0,1}r]
- ID
- The unique identifier (primary-key) of the entry.
- UW
- The Universal Word of UNL. This field can be empty if a word does not need a UW. It can also be a regular expression.
- ATTR
- The list of features of the NLW. It can be:
- a list of simple features: NOU, MCL, SNG
- a list of attribute-value pairs*: pos=NOU, gen=MCL, num=SNG
- a list of inflection rules (see below)*: IFX(PLR:=”oo”:”ee”)
Attributes should be separated by “,”.
- FLG
- The three-character language code according to ISO 639-3.
- FRE
- The frequency of NLW in natural texts. Used for natural language analysis (NL-UNL). It can range from 0 (less frequent) to 255 (most frequent).
- PRI
- The priority of the NLW. Used for natural language generation (UNL-NL). It can range from 0 to 255.
- COMMENT
- Any comment necessary to clarify the mapping between NL and UNL entries. It should end with the return code.
The features marked with * are not supported by the UNL Centre's tools
Formal syntax
<dictionary entry> ::= <NLW><ID><UW><FEATURE LIST>”<”<LANG>”,”<PRI>”,”<FRE>”>;”
<NLW>::= “[”(<SIMPLE NLW>|<COMPOUND NLW>|<RESERVED NLW>|<REGULAR EXPRESSION>)”]” <SIMPLE NLW> ::= <text> <COMPOUND NLW> ::= (“[”<text>”]”)+ <RESERVED NLW> ::= “RegEx” <ID> ::= “{”<positive integer>”}” <UW> ::= “””<text>”””|<REGULAR EXPRESSION> <FEATURE LIST> ::= “(”<FEATURE> (”,”<FEATURE>)+”)” <FEATURE> ::= (<VALUE>|<ATTRIBUTE>”=”<VALUE>|<RULE LIST>|”#”<SUBNLWID><FEATURE LIST>) <SUBNLWID> ::= [01..99] <RULE LIST> ::= <RULE>(”;”<RULE>)* <RULE> ::= <ATTRIBUTE>”(”<VALUE>”:=”<a-rule>(”;”<VALUE>”:=”<a-rule>)*”)” <ATTRIBUTE> ::= <text> <VALUE> ::= <text>(”&”<text>)* <LANG> ::= ISO 639-3 language codes <PRI> ::= [0..255] <FRE> ::= [0..255] <REGULAR EXPRESSION> ::= "/"<PERL COMPATIBLE REGULAR EXPRESSIONS>"/"
Where:
+ = 1 or more times
* = 0 or more times
| = alternative
Horizontal blank spaces are allowed and ignored except inside quoted text (string literals).
Complex structures as NLW*
In order to deal with multiple word expressions, the NLW can be represented as a complex structure comprising several sub-NLW entries. The syntax for complex NLWs is:
[[sub-NLW][sub-NLW]...[sub-NLW]] {ID} “UW” (ATTR , ..., 01#(ATTR, ...), 02#(ATTR, ...), ...) < LG , FRE , PRI >; COMMENTS
Where:
[sub-NLW] is a part of the NLW;
01#(ATTR, ...) are the specific features for the first sub-NLW to appear in the NLW;
02#(ATTR, ...) are the specific features for the second sub-NLW to appear in the NLW;
and so on.
The first sub-NLW to appear in a NLW will be always the 01, the second the 02, and so on.
The feature list preceded by <number># will apply only to the corresponding sub-NLW.
The features outside the sub-NLW feature lists are shared by all sub-NLWs.
- Example
- [[bring] [back]] {12343} "to bring back(icl>to bring)" (pos=VER, 01#(IFX(ET0:=4>"ought")), 02#(pos=PRE)) <eng, 0, 0>;
- In the entry above, the NLW has been split into two different sub-NLWs ([bring] and [back] with a blank space in between). Each of these sub-NLWs has different features, referred to in the embedded parentheses inside the feature list. The sub-NLW [bring], which was the first to appear, has the feature "IFX(ET0:=4>"ought")", while the sub-NLW [back], which was the second, has the feature "pos=PRE". The feature "pos=VER", which is outside the specific feature lists, is shared by both of them.
- [[bring] [back]] {12343} "to bring back(icl>to bring)" (pos=VER, 01#(IFX(ET0:=4>"ought")), 02#(pos=PRE)) <eng, 0, 0>;
Inflection rules inside dictionary entries*
In order to deal with exceptions and irregular forms, dictionaries may contain rules, which must be included inside the feature list, as follows:
<RULE> ::= <ATTRIBUTE>”(”<VALUE>”:=”<a-rule>(”;”<VALUE>”:=”<a-rule>)*”)”
Where:
<ATTRIBUTE> is the attribute that will be used to call the rule
<VALUE> is the value of the attribute that will trigger the rule
<a-rule> is an affixation rule (describe in a-rule)
Examples
- NUM(PLR:="men")
- If the value of the attribute "number" (NUM) is "plural" (PLR) then replace the whole natural language word by "men"
- POS(ORD:="1">"1st","2">"2nd","3">"3rd")
- If the value of the attribute "part of speech" (POS) is "ordinal" (ORD) then:
- if the last character of the string is "1", then replace "1" by "1st"; and
- if the last character of the string is "2", then replace "2" by "2nd"; and
- if the last character of the string is "3", then replace "3" by "3rd".
Regular expressions inside dictionary entries*
Both the NLW and the UW may be replaced by Regular Expressions. In both cases, regular expressions must be included between a pair of "/" and should comply with the PCRE - Perl Compatible Regular Expressions]. They should be represented as follows:
Regular expression in NLW (used in UNL-ization)
[RegEx] "UW" (FEATURE LIST) <LANG,FRE,PRI>;
Regular expression in UW (used in NL-ization)
[NLW] "RegEx" (FEATURE LIST) <LANG,FRE,PRI>;
Examples
- Regular expressions in NLW
- [/colo(u)?r/] "color" (POS=NOU) <eng,0,0>; (NLW = {color, colour})
- [/cit(y|ies)/] "city" (POS=NOU) <eng,0,0>; (NLW = {city, cities})
- [/(\d){4}/] "" (ENT=YEAR) <eng,0,0>; (NLW = any sequence of four digits)
- Regular expressions in UW
- [city] "/city(.)*/" (POS=NOU) <eng,0,0>; (UW = any UW that starts by the string "city")
- [city] "/(.)\(iof\>city\)/" (POS=NOU) <eng,0,0>; (UW = any UW that ends by the string "(iof>city)"
Examples of dictionary entries
[China]{24} "China(iof>Asian country)" (POS=NOU, LEX=WRD, NUM=SNG, INF=P0, FRA=F0) <en,0,0>;
[choose]{106} "to choose(icl>to decide)" (POS=VER, LEX=WRD, INF=P1, FRA=F76, FLX(3PS&ET1&IND:=0>"s"; ET0:="chose"; PTP:="chosen"; GER:="chosing")) <en,0,0>;
[clear-eyed]{25} "clear-eyed(icl>discerning)" (POS=ADJ, LEX=WRD, INF=P0, FRA=F0) <en,0,0>;
[Peter]{177}"Peter(iof>person)"(NOU)<en,10,30>;
[kill]{5987}"kill(icl>do)"(TEN(past:=0>"ed"))<en,70,80>;
[[bring] [back]]{2345}"bring back"(POS=VER,01#(POS=VER,TEN(ETO:=3>"ought")),02#(POS=PRE))<en,50,34>;