English dictionary

From UNL Wiki
Revision as of 20:08, 27 July 2012 by Martins (Talk | contribs)
Jump to: navigation, search

The English dictionaries are used, along with the English grammars, in the process of UNLization and NLization with IAN and EUGENE, respectively. They follow the UNL Dictionary Specs and use the tags provided in the Tagset. They are provided in two different formats:

  • The ENG-UNL (Analysis) Dictionary, to be used in UNLization (IAN), is presented in the enumerative format, i.e., it brings all word forms ("die","dies","dying","dead") and not only base forms
  • The UNL-ENG (Generation) Dictionary, to be used in NLization (EUGENE), is presented in the generative format, i.e., it brings only base forms ("die") along with the instructions to generate the corresponding inflections


Contents

Dicionary Entry Structure

According to the UNL Dictionary Specs, entries are provided in the following format:

[Natural Language Entry]{}"UW"(feature list)<eng,FREQUENCY,PRIORITY>;

Where:

  • [Natural Language Entry] is a word form ("die","dies","dying","dead") in the ENG-UNL (Analysis) Dictionary and a base form ("die") in the UNL-ENG (Generation) Dictionary
  • "UW" may bring an ordinary UW ("book"); a pro-form ("00"), in case of pronouns; or can also be empty (""), in case of natural language entries, such as determiners and prepositions, that are not associated to any UW.
  • The feature list is a list of attribute-value pairs as defined in the Tagset.
  • FREQUENCY is used only in the ENG-UNL dictionary: higher frequency words prevail over other candidates with the same length during the tokenization of the natural language sentence.
  • PRIORITY is used only in the UNL-ENG dictionary: higher priority words prevail over other candidates during the tokenization of the UNL graph.

The features used in the dictionary are the following

Feature structure of the ENG-UNL Dictionary
Class ENG-UNL UNL-ENG
Common noun and proper names (N) LEX,POS,NUM LEX,POS,NUM,PAR,FLX
Adjectives (J) LEX,POS LEX,POS
Adverbs (A) LEX,POS,att LEX,POS,att
Verbs (V) LEX,POS,TRA,ATE,PER LEX,POS,TRA,PAR,FLX
Conjunctions (C) LEX,POS,att,rel LEX,POS,att,rel
Determiners (D) LEX,POS,att LEX,POS,att
Prepositions (P) LEX,POS,att,rel LEX,POS,att,rel
Pronouns (R) LEX,POS,CAS,PER,NUM LEX,POS,CAS,PER,NUM
Numerals (U) LEX,POS not represented
The tags are defined in the Tagset
  • LEX is the lexical category (LEX=N, for nouns; LEX=J, for adjectives; LEX=V, for verbs; LEX=A, for adverbs; etc)
  • POS is the part-of-speech (POS=NOU, for common nouns; POS=PPN, for proper names; etc)
  • NUM is the grammatical number (NUM=SNG, for singular; NUM=PLR, for plural; etc)
  • TRA is the transitivity (TRA=NTST, for intransitive verbs; TRA=TSTD, for direct transitive verbs; etc)
  • PER is the grammatical person (PER=1PS, for first person singular; PER=2PS, for second person singular; etc)
  • ATE is the absolute tense (ATE=PAS, for past; ATE=PRS, for present; etc)
  • CAS is the morphological case (CAS=NOM, for nominative; CAS=OBL, for oblique)
att (attribute)
the atrribute "att", along with its corresponding value, is used when the word is associated to an attribute (instead of a UW). For instance:
[the]{}"" (LEX=D,POS=ART,att=@def)<eng,255,0>;
In this case, the English word "the" does not correspond to any UW, but to the attribute "@def".
rel (relation)
the atrribute "rel", along with its corresponding value, is used when the word is associated to a relation (instead of a UW). For instance:
[of]{}"" (LEX=P,POS=PRE,rel=mod)<eng,255,0>;
In this case, the English word "of" does not correspond to any UW, but to the relation "mod"
Some entries may have both "att" and "rel". For instance
[under]{}"" (P,rel=plc,att=@under)<eng,255,0>;
In this case, the English word "under" corresponds both to the relation "plc" (place) and the attribute "@under".
PAR (paradigm)
the attribute "PAR" is used only for inflectional words (nouns and verbs, in English) and indicates the paradigm number. It is used only in generation (UNL-ENG Dictionary). PAR=M0 always indicates invariant word ("glasses", for instance). PAR=M1 always indicates irregular words ("foot", for instance). The other values - M2, M3, M4, ... - are indexed to the inflectional paradigms provided in the UNLarium), and which are exported in order to compose the morphological module of the grammar (see English grammar).
FLX (inflectional rules)
the attribute "FLX" is used only in case of irregular words (i.e., when PAR=M1). It is used only in generation (UNL-ENG Dictionary). It provides the rules to inflect words that do not follow an inflectional paradigm. These rules are informed inside the dictionary (See UNL Dictionary Specs).
[foot]{}"foot"(LEX=N,POS=NOU,PAR=M1,FLX(SNG:=0>"";PLR:=0>"s";))<eng,0,0>;
The values of the attribute FLX inform how to create the inflections for SNG (singular) and PLR (plural) in case of this particular irregular word
LST (lexical structure)
The attribute LST is assigned only in case of subwords (SBW) and contractions (CTT):
[shan't]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@conviction;@not)<eng,255,0>; (contraction)
['ve]{}"" (LEX=V,POS=AUX,ATE=INF,LST=SBW)<eng,255,0>; (subword)
Pronouns are associated to the pro-form UW 00 with the corresponding attributes, whenever applicable
[I]{}"00.@1" (LEX=R,POS=PPR,CAS=NOM,PER=1PS)<eng,255,0>;
The English word "I" is associated to the UW "00.@1" (the pro-form for first person singular) and not only to "00", which is simply a pro-form.
Pronouns and determiners
It's important to differentiate pronouns from determiners. The former behave as nouns and are associated to the UW 00 with the corresponding attributes; the latter are associated to the NULL UW. Compare the difference:
This is a book. ("this" is a demonstrative pronoun, to be represented as [this]{}"00.@proximal" (LEX=R,POS=DEP,NUM=SNG)<eng,255,0>;
This book ("this" is a demonstrative determiner, to be represented as [this]{}"" (LEX=D,POS=DEM,NUM=SNG,att=@proximal)<eng,255,0>;
This difference is very important, regardless the conventions adopted in the different grammar traditions, because these two "this" will be represented differently in UNL.
Possessive pronouns and possessive determiners, because associated to the same UW, may be represented in a single entry
This book is his. ("his" is a possessive pronoun, to be represented as [his]{}"00.@3.@male" (LEX=R,LEX=D,POS=SPR,POS=POD, PER=3PS)<eng,255,0>;
This is his book. ("his" is a possessive determiner, to be represented as [his]{}"00.@3.@male" (LEX=R,LEX=D,POS=SPR,POS=POD,PER=3PS)<eng,255,0>;
The entry [his] has two attributes LEX and two attributes POS, one for each use of the word. They can be included in the same entry because they are both associated to the same UW.
Multiple values for the same attributes inside the same entry
An entry may have several different values for the same features if all its instances are associated to the same UW:
[killed]{}"kill"(LEX=V,POS=VER,TRA=TSTD,ATE=PAS,ATE=PTP)<eng,0,0>;
The entry [killed] may have two values for the attribute ATE (absolute tense): PAS (past) and PTP (participle), because in both cases the entry is associated to the same UW "kill"
[be]{}"" (LEX=V,POS=COP,POS=AUX,ATE=INF)<eng,255,0>;
The entry [be] may have two values for the attribute POS (part-of-speech): COP (copula) and AUX (auxiliary), because in both cases the entry is associated to the same UW ""
[this]{}"00.@proximal" (LEX=R,LEX=D,POS=DEP,POS=DEM,NUM=SNG)<eng,255,0>;
The entry [this] may not have two values for the attributes LEX and POS because it is associated to different UWs when a determiner (LEX=D,POS=DEM) and a pronoun (LEX=R,POS=DEP). Therefore:
[this]{}"00.@proximal" (LEX=R,POS=DEP,NUM=SNG)<eng,255,0>;
[this]{}"" (LEX=D,POS=DEM,NUM=SNG,att=@proximal)<eng,255,0>;
Cumulative attributes
Cumulative attributes for the same entry must be informed inside the same attribute "att" isolated by ";"
[can't]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@ability;@not)<eng,255,0>;
The auxiliary [can't] is not @ability OR @not, but @ability AND @not at the same time(.@ability.@not)
[may]{}"" (LEX=V,LST=CTT,POS=AUX,POS=MOV,att=@permission,att=@possibility)<eng,255,0>;
The auxiliary [may] is either @permission OR @possibility, and not both at the same time.

ENG-UNL Dictionary

The ENG-UNL dictionary is used in natural language analysis (UNLization) and is divided into three parts:

  • Corpus-specific words bring open-class words (i.e., nouns, adjectives, verbs and adverbs) appearing in the Corpus500
  • Grammar words bring closed-class words (determiners, prepositions, conjunctions and numbers) of English
  • The Default Dictionary brings punctuation signs and regular expressions to process URLs, dates and other canned structures

File

The ENG-UNL Dictionary for Corpus500 may be downloaded from ana_dic_eng.txt. The complete ENG-UNL Dictionary may be exported from the UNLarium: UNLWEB>UNLARIUM>DICTIONARY>ENGLISH>EXPORT.

Numbers in the ENG-UNL dictionary

  1. The English dictionary does not contain digits as natural language entries, but only words. The entries [one] and [first] are in the dictionary, but not the entry [1].
  2. All natural language words corresponding to numbers are associated to UWs represented by digits, because numbers are always represented in UNL by digits
    [zero]{}"0" (LEX=U,POS=CDN,DIGIT)<eng,255,0>;
    The English word "zero" is associated to the UW "0".
  3. The English dictionary contains only simple forms (such as [one], [eleven], [twenty], [hundred]). Compound forms (such as [twenty-one]) and complex forms (such as [one hundred]) are not included in the dictionary, but are generated by the grammar.
  4. In order to handle compound and complex forms, dozens are associated to units, as in:
    [/(?i)twent(y|ie)/]{}"2" (LEX=U,POS=CDN,DIGIT,DOZEN)<eng,255,0>;
    Where "twenty" is associated to "2", instead of "20", because "twenty-one" is not "201", but "21".
  5. In order to handle compound and complex forms, [hundred], [thousand], [million], [billion] and [trillion] are associated to the NULL UW, as in:
    [hundred]{}"" (LEX=U,POS=CDN)<eng,255,0>;
    Where [hundred] is associated to the UW "", instead of "100" or "1", because "two hundred" is not "2100" or "21", but "200".
  6. The English dictionary does not contain words for ordinals, except the irregular ones ([first], [second], [third], [fifth], [eighth], [ninth] and [twelfth]). All the other ordinals are expected to be recognized by the grammar because they follow the general rule CARDINAL + "th" ("seventh" = "seven" + "th"). This is the reason for including the entry [th] is included in the dictionary:
    [th]{}"" (LEX=I,POS=SFX,LST=SBW)<eng,255,0>;
  7. In order to cope with spelling changes in the formation of ordinals ("twenty" = "twentieth", and not "twentyth"), dozens have been represented as regular expressions:
    [/(?i)twent(y|ie)/]{}"2" (LEX=U,POS=CDN,DIGIT,DOZEN)<eng,255,0>;
    The regular expression /(?i)twent(y|ie)/ corresponds to both "twenty" and "twentie", which allows for the recognition of "twentieth" as "twentie"+"th". The feature (?i) means "case insensitive", i.e., the entry will match: "twenty", "Twenty", "TWENTY", "twentie", etc.
  8. The irregular ordinals included in the dictionary were associated to the corresponding UW and the attribute @ordinal, as in
    [first]{}"1.@ordinal" (LEX=U,POS=ORD,DIGIT)<eng,255,0>;
  9. The feature DIGIT was introduced whenever the UW is a digit:
    [zero]{}"0" (LEX=U,POS=CDN,DIGIT)<eng,255,0>;
    but [hundred]{}"" (LEX=U,POS=CDN)<eng,255,0>;
  10. Decimals and fractions were not represented inside the dictionary because they are generated either from cardinals or ordinals.

Default dictionary

The default dictionary, which appears at the end of the English dictionary, is actually language-dependent and may be used by any language. It contains the punctuation signs and regular expressions to deal with URL's, time expressions, currency expressions, Roman numbers, phone numbers and formulas in general.

UNL-ENG Dictionary

The UNL-ENG dictionary is used in natural language generation (NLization) and is divided into two parts:

  • Corpus-specific words bring open-class words (i.e., nouns, adjectives, verbs and adverbs) appearing in the Corpus500
  • Grammar words bring only the closed-class words of English that are inflectional (auxiliary verbs and some determiners). The other words (prepositions, conjunctions, etc) are generated directly by the grammar.

The UNL-ENG is considerably smaller than the ENG-UNL dictionary, because it contains only base forms. For instance, instead of 4 entries associated to the UW "arrive", it contains only one:

ENG-UNL Dictionary
[arrive]{}"arrive"(LEX=V,POS=VER,TRA=TSTI,ATE=INF)<eng,0,0>;
[arrives]{}"arrive"(LEX=V,POS=VER,TRA=TSTI,PER=3PS,ATE=PRS)<eng,0,0>;
[arrived]{}"arrive"(LEX=V,POS=VER,TRA=TSTI,ATE=PAS,ATE=PTP)<eng,0,0>;
[arriving]{}"arrive"(LEX=V,POS=VER,TRA=TSTI,ATE=GER)<eng,0,0>;
UNL-ENG Dictionary
[arrive]{}"arrive"(LEX=V,POS=VER,TRA=TSTI,PAR=M17)<eng,0,0>;

However, dictionary entries inside the UNL-ENG dictionary can be considerably longer than the ones in the ENG-UNL dictionary, because of the inflectional rules. Compare the entries for the verb "be", for instance

ENG-UNL Dictionary
[be]{}"" (LEX=V,POS=COP,POS=AUX,ATE=INF)<eng,255,0>;
UNL-ENG Dictionary
[be]{}""(LEX=V,POS=COP,POS=AUX,PAR=M1,FLX(1PS&PRS:="am";2PS&PRS:="are";3PS&PRS:="is";1PP&PRS:="are";2PP&PRS:="are";3PP&PRS:="are";1PS&PAS:="was";2PS&PAS:="were";3PS&PAS:="was";1PP&PAS:="were";2PP&PAS:="were";3PP&PAS:="were";PTP:="been";))<eng,0,0>;

File

The UNL-ENG Dictionary for Corpus500 may be downloaded from gen_dic_eng.txt. The complete ENG-UNL Dictionary may be exported from the UNLarium: UNLWEB>UNLARIUM>DICTIONARY>ENGLISH>EXPORT.

Software