English dictionary

From UNL Wiki
Revision as of 18:02, 27 July 2012 by Martins (Talk | contribs)
Jump to: navigation, search

The English dictionaries are used, along with the English grammars, in the process of UNLization and NLization with IAN and EUGENE, respectively. They follow the UNL Dictionary Specs and use the tags provided in the Tagset. They are provided in two different formats:

  • The ENG-UNL Dictionary, to be used in UNLization (IAN), is presented in the enumerative format, i.e., it brings all word forms ("die","dies","dying","dead") and not only base forms
  • The UNL-ENG Dictionary, to be used in NLization (EUGENE), is presented in the generative format, i.e., it brings only base forms ("die") along with the instructions to generate the corresponding inflections

Contents

ENG-UNL Dictionary

The ENG-UNL Dictionary for Corpus500 may be downloaded from ana_dic_eng.txt. The complete ENG-UNL Dictionary may be exported from the UNLarium: UNLWEB>UNLARIUM>DICTIONARY>ENGLISH>EXPORT.

Dictionary entry structure

In the ENG-UNL Dictionary, entries are provided in the following format:

[English Word Form]{}"UW"(feature list in attribute-value pair format)<eng,FREQUENCY,PRIORITY>;

The dictionary is divided into three parts:

  • Corpus-specific words bring open-class words (i.e., nouns, adjectives, verbs and adverbs) appearing in the Corpus500
  • Grammar words bring closed-class words (determiners, prepositions, conjunctions and numbers) of English
  • Default dictionary brings punctuation signs and regular expressions to process URLs, dates and other canned structures

The features used in the dictionary are the following

Feature structure of the ENG-UNL Dictionary
Class Attributes
Common noun and proper names (N) LEX,POS,NUM
Adjectives (J) LEX,POS
Adverbs (A) LEX,POS,att*
Verbs (V) LEX,POS,TRA,ATE,PER
Conjunctions (C) LEX,POS,att*,rel**
Determiners (D) LEX,POS,att*,rel**
Prepositions (P) LEX,POS,att*,rel**
Pronouns (R) LEX,POS,CAS,PER,GEN,NUM
Numerals (U) LEX,POS
The tags follow the structure defined in the Tagset
*att
the atrribute "att", along with its corresponding value, is used when the word is associated to an attribute (instead of a UW):
[the]{}"" (LEX=D,POS=ART,att=@def)<eng,255,0>;
In this case, the English word "the" does not correspond to any UW, but to the attribute "@def".
**rel
the atrribute "rel", along with its corresponding value, is used when the word is associated to a relation (instead of a UW):
[of]{}"" (LEX=P,POS=PRE,rel=mod)<eng,255,0>;
In this case, the English word "of" does not correspond to any UW, but to the relation "mod"
Some entries may have both "att" and "rel"
[under]{}"" (P,rel=plc,att=@under)<eng,255,0>;
In this case, the English word "under" corresponds to both the relation "plc" (place) and the attribute "@under".
Pronouns are associated to the UW 00 with attributes
[I]{}"00.@1" (LEX=R,POS=PPR,CAS=NOM,PER=1PS)<eng,255,0>;
The English word "I" is associated to the UW "00.@1" (the pro-form for first person singular) and not only to "00", which is simply a pro-form.
Pronouns and determiners
It's important to differentiate pronouns from determiners. The former behave as nouns and are associated to the UW 00 with the corresponding attributes; the latter are associated to the NULL UW. Compare the difference:
This is a book. ("this" is a demonstrative pronoun, to be represented as [this]{}"00.@proximal" (LEX=R,POS=DEP,NUM=SNG)<eng,255,0>;
This book ("this" is a demonstrative determiner, to be represented as [this]{}"" (LEX=D,POS=DEM,NUM=SNG,att=@proximal)<eng,255,0>;
This difference is very important, regardless the conventions adopted in the different grammar traditions, because these two "this" will be represented differently in UNL.
Possessive pronouns and possessive determiners, because associated to the same UW, may be represented in a single entry
This book is his. ("his" is a possessive pronoun, to be represented as [his]{}"00.@3.@male" (LEX=R,LEX=D,POS=SPR,POS=POD, PER=3PS)<eng,255,0>;
This is his book. ("his" is a possessive determiner, to be represented as [his]{}"00.@3.@male" (LEX=R,LEX=D,POS=SPR,POS=POD,PER=3PS)<eng,255,0>;
The entry [his] has two attributes LEX and two attributes POS, one for each use of the word. They can be included in the same entry because they are both associated to the same UW.
LST
The attribute LST (lexical structure) is assigned only in case of subwords (SBW) and contractions (CTT):
[shan't]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@conviction;@not)<eng,255,0>; (contraction)
['ve]{}"" (LEX=V,POS=AUX,ATE=INF,LST=SBW)<eng,255,0>; (subword)
Multiple values for the same attributes inside the same entry
An entry may have several different values for the same features if all its instances are associated to the same UW:
[killed]{}"kill"(LEX=V,POS=VER,TRA=TSTD,ATE=PAS,ATE=PTP)<eng,0,0>;
The entry [killed] may have two values for the attribute ATE (absolute tense): PAS (past) and PTP (participle), because in both cases the entry is associated to the same UW "kill"
[be]{}"" (LEX=V,POS=COP,POS=AUX,ATE=INF)<eng,255,0>;
The entry [be] may have two values for the attribute POS (part-of-speech): COP (copula) and AUX (auxiliary), because in both cases the entry is associated to the same UW ""
[this]{}"00.@proximal" (LEX=R,LEX=D,POS=DEP,POS=DEM,NUM=SNG)<eng,255,0>;
The entry [this] may not have two values for the attributes LEX and POS because it is associated to different UWs when a determiner (LEX=D,POS=DEM) and a pronoun (LEX=R,POS=DEP). Therefore:
[this]{}"00.@proximal" (LEX=R,POS=DEP,NUM=SNG)<eng,255,0>;
[this]{}"" (LEX=D,POS=DEM,NUM=SNG,att=@proximal)<eng,255,0>;
Cumulative attributes
Cumulative attributes for the same entry must be informed inside the same attribute "att" isolated by ";"
[can't]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@ability;@not)<eng,255,0>;
The auxiliary [can't] is not @ability OR @not, but @ability AND @not at the same time(.@ability.@not)
[may]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@permission,att=@possibility)<eng,255,0>;
The auxiliary [may] is either @permission OR @possibility, and not both at the same time.

Numbers in the dictionary

  1. The English dictionary does not contain digits as natural language entries, but only words. The entries [one] and [first] are in the dictionary, but not the entry [1].
  2. All natural language words corresponding to numbers are associated to UWs represented by digits, because numbers are always represented in UNL by digits
    [zero]{}"0" (LEX=U,POS=CDN,DIGIT)<eng,255,0>;
    The English word "zero" is associated to the UW "0".
  3. The English dictionary contains only simple forms (such as [one], [eleven], [twenty], [hundred]). Compound forms (such as [twenty-one]) and complex forms (such as [one hundred]) are not included in the dictionary, but are generated by the grammar.
  4. In order to handle compound and complex forms, dozens are associated to units, as in:
    [/(?i)twent(y|ie)/]{}"2" (LEX=U,POS=CDN,DIGIT,DOZEN)<eng,255,0>;
    Where "twenty" is associated to "2", instead of "20", because "twenty-one" is not "201", but "21".
  5. In order to handle compound and complex forms, [hundred], [thousand], [million], [billion] and [trillion] are associated to the NULL UW, as in:
    [hundred]{}"" (LEX=U,POS=CDN)<eng,255,0>;
    Where [hundred] is associated to the UW "", instead of "100" or "1", because "two hundred" is not "2100" or "21", but "200".
  6. The English dictionary does not contain words for ordinals, except the irregular ones ([first], [second], [third], [fifth], [eighth], [ninth] and [twelfth]). All the other ordinals are expected to be recognized by the grammar because they follow the general rule CARDINAL + "th" ("seventh" = "seven" + "th"). This is the reason for including the entry [th] is included in the dictionary:
    [th]{}"" (LEX=I,POS=SFX,LST=SBW)<eng,255,0>;
  7. In order to cope with spelling changes in the formation of ordinals ("twenty" = "twentieth", and not "twentyth"), dozens have been represented as regular expressions:
    [/(?i)twent(y|ie)/]{}"2" (LEX=U,POS=CDN,DIGIT,DOZEN)<eng,255,0>;
    The regular expression /(?i)twent(y|ie)/ corresponds to both "twenty" and "twentie", which allows for the recognition of "twentieth" as "twentie"+"th". The feature (?i) means "case insensitive", i.e., the entry will match: "twenty", "Twenty", "TWENTY", "twentie", etc.
  8. The irregular ordinals included in the dictionary were associated to the corresponding UW and the attribute @ordinal, as in
    [first]{}"1.@ordinal" (LEX=U,POS=ORD,DIGIT)<eng,255,0>;
  9. The feature DIGIT was introduced whenever the UW is a digit:
    [zero]{}"0" (LEX=U,POS=CDN,DIGIT)<eng,255,0>;
    but [hundred]{}"" (LEX=U,POS=CDN)<eng,255,0>;
  10. Decimals and fractions were not represented inside the dictionary because they are generated either from cardinals or ordinals.

Default dictionary

The default dictionary, which appears at the end of the English dictionary, is actually language-dependent and may be used by any language. It contains the punctuation signs and regular expressions to deal with URL's, time expressions, currency expressions, Roman numbers, phone numbers and formulas in general.

Software