English dictionary

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Dicionary Entry Structure)
Line 12: Line 12:
 
*The feature list is a list of attribute-value pairs as defined in the [[Tagset]].
 
*The feature list is a list of attribute-value pairs as defined in the [[Tagset]].
 
*FREQUENCY is used only in the ENG-UNL dictionary: higher frequency words prevail over other candidate words <u>with the same length</u> during the tokenization of the natural language sentence.
 
*FREQUENCY is used only in the ENG-UNL dictionary: higher frequency words prevail over other candidate words <u>with the same length</u> during the tokenization of the natural language sentence.
*PRIORITY is used only in the UNL-GEN dictonary: higher priority words prevail over other candidate words during the tokenization of the UNL graph.
+
*PRIORITY is used only in the UNL-ENG dictonary: higher priority words prevail over other candidate words during the tokenization of the UNL graph.
 
The features used in the dictionary are the following
 
The features used in the dictionary are the following
 
{|border=1 align=center
 
{|border=1 align=center

Revision as of 18:50, 27 July 2012

The English dictionaries are used, along with the English grammars, in the process of UNLization and NLization with IAN and EUGENE, respectively. They follow the UNL Dictionary Specs and use the tags provided in the Tagset. They are provided in two different formats:

  • The ENG-UNL (Analysis) Dictionary, to be used in UNLization (IAN), is presented in the enumerative format, i.e., it brings all word forms ("die","dies","dying","dead") and not only base forms
  • The UNL-ENG (Generation) Dictionary, to be used in NLization (EUGENE), is presented in the generative format, i.e., it brings only base forms ("die") along with the instructions to generate the corresponding inflections


Contents

Dicionary Entry Structure

According to the UNL Dictionary Specs, entries are provided in the following format:

[Natural Language Entry]{}"UW"(feature list)<eng,FREQUENCY,PRIORITY>;

Where:

  • [Natural Language Entry] is a word form ("die","dies","dying","dead") in the ENG-UNL (Analysis) Dictionary and a base form ("die") in the UNL-ENG (Generation) Dictionary
  • "UW" may bring an ordinary UW ("book"); a pro-form ("00"), in case of pronouns; or can also be empty (""), in case of natural language entries, such as determiners and prepositions, that are not associated to any UW.
  • The feature list is a list of attribute-value pairs as defined in the Tagset.
  • FREQUENCY is used only in the ENG-UNL dictionary: higher frequency words prevail over other candidate words with the same length during the tokenization of the natural language sentence.
  • PRIORITY is used only in the UNL-ENG dictonary: higher priority words prevail over other candidate words during the tokenization of the UNL graph.

The features used in the dictionary are the following

Feature structure of the ENG-UNL Dictionary
Class ENG-UNL UNL-ENG
Common noun and proper names (N) LEX,POS,NUM LEX,POS,NUM,PAR,FLX
Adjectives (J) LEX,POS LEX,POS
Adverbs (A) LEX,POS,att LEX,POS,att
Verbs (V) LEX,POS,TRA,ATE,PER LEX,POS,TRA,PAR,FLX
Conjunctions (C) LEX,POS,att,rel not represented
Determiners (D) LEX,POS,att LEX,POS,att
Prepositions (P) LEX,POS,att,rel not represented
Pronouns (R) LEX,POS,CAS,PER,GEN,NUM LEX,POS,CAS,PER,GEN,NUM
Numerals (U) LEX,POS not represented
The tags follow the structure defined in the Tagset
att
the atrribute "att", along with its corresponding value, is used when the word is associated to an attribute (instead of a UW):
[the]{}"" (LEX=D,POS=ART,att=@def)<eng,255,0>;
In this case, the English word "the" does not correspond to any UW, but to the attribute "@def".
rel
the atrribute "rel", along with its corresponding value, is used when the word is associated to a relation (instead of a UW):
[of]{}"" (LEX=P,POS=PRE,rel=mod)<eng,255,0>;
In this case, the English word "of" does not correspond to any UW, but to the relation "mod"
Some entries may have both "att" and "rel"
[under]{}"" (P,rel=plc,att=@under)<eng,255,0>;
In this case, the English word "under" corresponds both to the relation "plc" (place) and the attribute "@under".
Pronouns are associated to the UW 00 with attributes
[I]{}"00.@1" (LEX=R,POS=PPR,CAS=NOM,PER=1PS)<eng,255,0>;
The English word "I" is associated to the UW "00.@1" (the pro-form for first person singular) and not only to "00", which is simply a pro-form.
Pronouns and determiners
It's important to differentiate pronouns from determiners. The former behave as nouns and are associated to the UW 00 with the corresponding attributes; the latter are associated to the NULL UW. Compare the difference:
This is a book. ("this" is a demonstrative pronoun, to be represented as [this]{}"00.@proximal" (LEX=R,POS=DEP,NUM=SNG)<eng,255,0>;
This book ("this" is a demonstrative determiner, to be represented as [this]{}"" (LEX=D,POS=DEM,NUM=SNG,att=@proximal)<eng,255,0>;
This difference is very important, regardless the conventions adopted in the different grammar traditions, because these two "this" will be represented differently in UNL.
Possessive pronouns and possessive determiners, because associated to the same UW, may be represented in a single entry
This book is his. ("his" is a possessive pronoun, to be represented as [his]{}"00.@3.@male" (LEX=R,LEX=D,POS=SPR,POS=POD, PER=3PS)<eng,255,0>;
This is his book. ("his" is a possessive determiner, to be represented as [his]{}"00.@3.@male" (LEX=R,LEX=D,POS=SPR,POS=POD,PER=3PS)<eng,255,0>;
The entry [his] has two attributes LEX and two attributes POS, one for each use of the word. They can be included in the same entry because they are both associated to the same UW.
LST
The attribute LST (lexical structure) is assigned only in case of subwords (SBW) and contractions (CTT):
[shan't]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@conviction;@not)<eng,255,0>; (contraction)
['ve]{}"" (LEX=V,POS=AUX,ATE=INF,LST=SBW)<eng,255,0>; (subword)
Multiple values for the same attributes inside the same entry
An entry may have several different values for the same features if all its instances are associated to the same UW:
[killed]{}"kill"(LEX=V,POS=VER,TRA=TSTD,ATE=PAS,ATE=PTP)<eng,0,0>;
The entry [killed] may have two values for the attribute ATE (absolute tense): PAS (past) and PTP (participle), because in both cases the entry is associated to the same UW "kill"
[be]{}"" (LEX=V,POS=COP,POS=AUX,ATE=INF)<eng,255,0>;
The entry [be] may have two values for the attribute POS (part-of-speech): COP (copula) and AUX (auxiliary), because in both cases the entry is associated to the same UW ""
[this]{}"00.@proximal" (LEX=R,LEX=D,POS=DEP,POS=DEM,NUM=SNG)<eng,255,0>;
The entry [this] may not have two values for the attributes LEX and POS because it is associated to different UWs when a determiner (LEX=D,POS=DEM) and a pronoun (LEX=R,POS=DEP). Therefore:
[this]{}"00.@proximal" (LEX=R,POS=DEP,NUM=SNG)<eng,255,0>;
[this]{}"" (LEX=D,POS=DEM,NUM=SNG,att=@proximal)<eng,255,0>;
Cumulative attributes
Cumulative attributes for the same entry must be informed inside the same attribute "att" isolated by ";"
[can't]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@ability;@not)<eng,255,0>;
The auxiliary [can't] is not @ability OR @not, but @ability AND @not at the same time(.@ability.@not)
[may]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@permission,att=@possibility)<eng,255,0>;
The auxiliary [may] is either @permission OR @possibility, and not both at the same time.
The attribute FLX is used only in the UNL-ENG dictionary and is assigned only in case of irregular forms
The attribute FLX brings the inflectional rules used to generate the inflections of irregular forms.
[foot]{}"foot"(LEX=N,POS=NOU,PAR=M1,FLX(SNG:=0>"";PLR:=0>"s"/))<eng,0,0>;
The values of the attribute FLX inform how to create the inflections for SNG (singular) and PLR (plural) in case of this particular irregular word (PAR=M1)




The dictionary is divided into three parts:

  • Corpus-specific words bring open-class words (i.e., nouns, adjectives, verbs and adverbs) appearing in the Corpus500
  • Grammar words bring closed-class words (determiners, prepositions, conjunctions and numbers) of English
  • The Default Dictionary brings punctuation signs and regular expressions to process URLs, dates and other canned structures

ENG-UNL Dictionary

The ENG-UNL Dictionary for Corpus500 may be downloaded from ana_dic_eng.txt. The complete ENG-UNL Dictionary may be exported from the UNLarium: UNLWEB>UNLARIUM>DICTIONARY>ENGLISH>EXPORT.

Numbers in the dictionary

  1. The English dictionary does not contain digits as natural language entries, but only words. The entries [one] and [first] are in the dictionary, but not the entry [1].
  2. All natural language words corresponding to numbers are associated to UWs represented by digits, because numbers are always represented in UNL by digits
    [zero]{}"0" (LEX=U,POS=CDN,DIGIT)<eng,255,0>;
    The English word "zero" is associated to the UW "0".
  3. The English dictionary contains only simple forms (such as [one], [eleven], [twenty], [hundred]). Compound forms (such as [twenty-one]) and complex forms (such as [one hundred]) are not included in the dictionary, but are generated by the grammar.
  4. In order to handle compound and complex forms, dozens are associated to units, as in:
    [/(?i)twent(y|ie)/]{}"2" (LEX=U,POS=CDN,DIGIT,DOZEN)<eng,255,0>;
    Where "twenty" is associated to "2", instead of "20", because "twenty-one" is not "201", but "21".
  5. In order to handle compound and complex forms, [hundred], [thousand], [million], [billion] and [trillion] are associated to the NULL UW, as in:
    [hundred]{}"" (LEX=U,POS=CDN)<eng,255,0>;
    Where [hundred] is associated to the UW "", instead of "100" or "1", because "two hundred" is not "2100" or "21", but "200".
  6. The English dictionary does not contain words for ordinals, except the irregular ones ([first], [second], [third], [fifth], [eighth], [ninth] and [twelfth]). All the other ordinals are expected to be recognized by the grammar because they follow the general rule CARDINAL + "th" ("seventh" = "seven" + "th"). This is the reason for including the entry [th] is included in the dictionary:
    [th]{}"" (LEX=I,POS=SFX,LST=SBW)<eng,255,0>;
  7. In order to cope with spelling changes in the formation of ordinals ("twenty" = "twentieth", and not "twentyth"), dozens have been represented as regular expressions:
    [/(?i)twent(y|ie)/]{}"2" (LEX=U,POS=CDN,DIGIT,DOZEN)<eng,255,0>;
    The regular expression /(?i)twent(y|ie)/ corresponds to both "twenty" and "twentie", which allows for the recognition of "twentieth" as "twentie"+"th". The feature (?i) means "case insensitive", i.e., the entry will match: "twenty", "Twenty", "TWENTY", "twentie", etc.
  8. The irregular ordinals included in the dictionary were associated to the corresponding UW and the attribute @ordinal, as in
    [first]{}"1.@ordinal" (LEX=U,POS=ORD,DIGIT)<eng,255,0>;
  9. The feature DIGIT was introduced whenever the UW is a digit:
    [zero]{}"0" (LEX=U,POS=CDN,DIGIT)<eng,255,0>;
    but [hundred]{}"" (LEX=U,POS=CDN)<eng,255,0>;
  10. Decimals and fractions were not represented inside the dictionary because they are generated either from cardinals or ordinals.

Default dictionary

The default dictionary, which appears at the end of the English dictionary, is actually language-dependent and may be used by any language. It contains the punctuation signs and regular expressions to deal with URL's, time expressions, currency expressions, Roman numbers, phone numbers and formulas in general.

Software