Universal Words
(→Principles) |
(→Principles) |
||
Line 40: | Line 40: | ||
== Principles == | == Principles == | ||
;Sense | ;Sense | ||
− | :UWs represent sense and not reference. UWs are related to the intension (sense) rather than to the extension (reference) of linguistic expressions. The expressions "morning star" and "evening star", which are said to have the same reference (the planet Venus), must be necessarily represented by different UWs, because they convey different ""modes of presentation" of the same object, i.e., have different senses | + | :UWs represent sense and not reference. UWs are related to the intension (sense) rather than to the extension (reference) of linguistic expressions. The expressions "morning star" and "evening star", which are said to have the same reference (the planet Venus), must be necessarily represented by different UWs, because they convey different ""modes of presentation" of the same object, i.e., have different senses: "the last star to disappear in the morning" and "the first star to appear in the evening", respectively. |
;Productivity | ;Productivity | ||
:UWs must correspond to and only to contents lexicalized by natural language '''open lexical categories''' (nouns, verbs, adjectives and adverbs). Any other semantic content (such as the ones conveyed by articles, prepositions, conjunctions etc.) should be represented as attributes or relations. This criterion is not language-biased: if a given semantic value proves to be conveyed, in any language, by a closed class, it should not be represented as a UW, regardless of its realisation in other languages. | :UWs must correspond to and only to contents lexicalized by natural language '''open lexical categories''' (nouns, verbs, adjectives and adverbs). Any other semantic content (such as the ones conveyed by articles, prepositions, conjunctions etc.) should be represented as attributes or relations. This criterion is not language-biased: if a given semantic value proves to be conveyed, in any language, by a closed class, it should not be represented as a UW, regardless of its realisation in other languages. |
Revision as of 17:35, 6 September 2012
Universal Words, or simply UWs, are the words of UNL, and correspond to the nodes - to be interlinked by relations and specified by attributes - in a UNL graph. They are labels for relatively stable units of knowledge (the concepts) that can be associated to natural language open lexical categories (noun, verb, adjective and adverb). The set of UWs is relatively open, and includes permanent UWs - those listed in the UNL Dictionary - and temporary UWs. Permanent UWs are defined in the UNL Knowledge Base and exemplified in the UNL Example Base, which are the lexical databases of the UNL.
Contents |
Types of UWs
Permanent UWs and Temporary UWs
UWs can be permanent or temporary.
- Permanent UWs
- Permanent UWs are included in the UNL Dictionary and correspond to concepts of common use (common nouns, adjectives, adverbs and verbs). They can be simple, compound or complex (see below).
- Temporary UWs
- Temporary UWs are words that:
- Represent new concepts or entities and are still candidates to be included in the UNL Dictionary ("Barack Obama", "Twitter");
- Are too specific to be included in the UNL Dictionary ("Universal Networking Digital Language Foundation", "Léon Werth"); or
- Are not translatable ("3.14159", "H2O", "www.undlfoundation.org");
- Temporary UWs are always represented between "double quotes", and observe the source language spelling practices (concerning, for instance, capitalization). For the time being, they are also expected to be transliterated in Roman characters.
Simple UWs, Compound UWs and Complex UWs
Permanent UWs can be simple, compound or complex.
- Simple UWs
- A simple UW is an isolated node in the UNL graph. It has been represented either as an integer or as a a unique character-string split into two different parts: a root and a suffix. The root can be a word or a multiword expression; the suffix, which is always introduced by a UNL relation, is used to disambiguate the root, i.e., to assign uniqueness to the UW.
UNL Representation | NL Representation | |
---|---|---|
104379964 | table(icl>furniture) table(icl>mobilier) mesa(icl>mobiliario) Tisch(icl>Möbel) стол(icl>мебель) ... |
- Compound UWs
- A compound UW is an isolated node combined with attributes. It is used when the concept can be fully derived from the combination of an existing simple UW and a UNL attribute, such as the concept conveyed by the English word "bigger", which can be represented simply as the UW corresponding to "big" ("301382086" or "big(aoj>thing)") specified by the degree attribute "@more": 301382086.@more or big(aoj>thing).@more.
- Complex UWs
- A complex UW is a hyper-node, i.e., a sub-graph inside the UNL graph. As graphs, complex UWs follow the structure defined for UNL Sentences. They are used when the concept can be fully derived from the combination of existing UWs, relations and attributes, such as in the case of the English multiword expression "pay attention", which could be represented, for instance, as obj(200732224,105702275) or obj(pay(agt>person,obj>thing),attention(icl>faculty).
Principles
- Sense
- UWs represent sense and not reference. UWs are related to the intension (sense) rather than to the extension (reference) of linguistic expressions. The expressions "morning star" and "evening star", which are said to have the same reference (the planet Venus), must be necessarily represented by different UWs, because they convey different ""modes of presentation" of the same object, i.e., have different senses: "the last star to disappear in the morning" and "the first star to appear in the evening", respectively.
- Productivity
- UWs must correspond to and only to contents lexicalized by natural language open lexical categories (nouns, verbs, adjectives and adverbs). Any other semantic content (such as the ones conveyed by articles, prepositions, conjunctions etc.) should be represented as attributes or relations. This criterion is not language-biased: if a given semantic value proves to be conveyed, in any language, by a closed class, it should not be represented as a UW, regardless of its realisation in other languages.
- Compositionality
- Simple UWs must correspond to and only to contents expressed by non-compositional lexical items, i.e., words and multiword expressions that cannot be fully reduced to the combination of existing UWs, attributes and relations.
- Arbitrariness
- Simple UWs are names (and not definitions) for senses. The simple UW does not bring much (or any) information about its sense. It is just a label. Any information concerning the sense is expected to be provided by the three different lexical databases available inside the UNL framework: the UNL Dictionary, the UNL Knowledge Base and the UNL Example Base.
- Universality
- Permanent UWs may have three different levels of universality and are stored in three nested lexical databases:
- The UNL Core Dictionary contains only permanent simple UWs that are (presumably) shared by all languages
- The UNL Abridged Dictionary contains all permanent UWs (simple, compound or complex) that are shared by at least two different language families
- The UNL Unabridged Dictionary contains all permanent UWs (simple, compound or complex) that are lexicalized in at least one language
Formal syntax
The syntax for permanent UWs is defined as follows:
UNL REPRESENTATION <PERMANENT UW> ::= <integer> <TEMPORARY UW> ::= """<ASCII character>+"""
NL REPRESENTATION <TEMPORARY UW> ::= """<UTF-8 character>+""" <PERMANENT UW> ::= <root>[<suffix>] <root> ::= <UTF-8 character>+ <suffix> ::= “(“ <suffix> [ “,” <suffix> ]… “)” | <relation> { “>” , “<” } <root> <relation> ::= {“agt”, "and", "aoj", ...}
where:
+ to be repeated 1 or more times
< > variable
" " terminal symbol
::= ... is defined as ...
| or
[ ] optional element
{ } alternative element
... to be repeated more than 0 times
Semantics
The basic assumption of the UNL approach is that the information conveyed by natural languages can be formally represented through three different types of semantic units: concepts, concept specifiers and binary relations between concepts. This three-layered representation model is the cornerstone of UNL and its most distinctive feature over other semantic networks, which normally proposes only two levels: edges and vertices. Nevertheless, it poses several problems to the UNLization as the distinction between what is supposed to be represented by each unit is not always clear.
The main difficulty concerns what is to be represented as a concept (and therefore as a UW) and what is to be represented as a relation between concepts. How many concepts (UWs) are there, for instance, in the sentence "Charles Dickens was the author of Oliver Twist"? Should "author" be represented as a concept or as a relation between "Charles Dickens" and "Oliver Twist"? Should the verb "to be" be represented as a concept or as a relation between "Charles Dickens" and "author"? Should the preposition "of" be represented as a concept or as a relation between "author" and "Oliver Twist"?
In order to avoid what can be an endless discussion, the UNL assumes that UWs must correspond to and only to concepts referred by natural language open lexical categories (noun, verb, adjective and adverb). Any other semantic content (such as the ones conveyed by articles, prepositions, conjunctions etc.) should be represented either as attributes or as relations. This criterion is not language-biased: if a given semantic value proves to be conveyed, in any language, by a closed class, it should not be represented as a UW, regardless of its realisation in other languages.
Categories of UWs
Permanent UWs are classified in four different categories, depending on their semantic values:
It should be stressed that these categories are semantic rather than syntactic or morphological. They are related to the UWs and are not oriented to any particular language. In that sense, adjectival UWs (such as "300217728" = "delighting the senses or exciting intellectual or emotional admiration") tend to be associated to English adjectives ("beautiful"), but they can also be realised as prepositional phrases ("with beauty"), verbal phrases ("possessing beauty"), etc.
Additionally, it should be emphasized that the set of UWs is not derived from any particular language. In that sense, there will be many UWs that do not correspond to a single lexical item and will have to be represented by periphrases. The concept "a state of torment created by the sudden sight of one's own misery", for instance, is lexicalized in Czech ("litost"), but not in English. In principle, the set of UWs, which is the UNL Dictionary, is supposed to be as comprehensive as the set of these different individual concepts depicted by different cultures, no matter how specific they are. In that sense, UWs are not to be considered semantic primitives, nor should represent only common concepts. They must include culture-dependent information and every relevant variation among similar concepts. Furthermore, the UNL Dictionary constitutes an open set, subject to permanent increase with new UWs, as UNL is supposed to incessantly incorporate new cultures and cultural changes.
FAQs
- Proper names are represented as permanent or temporary UWs?
- The difference between permanent and temporary UWs has not been staten formally. Most named entities, for instance, are represented as temporary UWs, because it would not be feasible to include them all in the UNL Dictionary. Nevertheless, some named entities of widespread use (such as "England", "William Shakespeare", "Romeo and Juliet", "Romeo" etc) have been already included in the UNL Dictionary and are treated as permanent UWs. Our current criteria is the Wikipedia. If a proper name is defined as an entry in the Wikipedia, then it should be defined as a permanent UW and included in the UNL Unabridged Dictionary.
- How to decide whether a concept must be represented as a simple, a compound or a complex UW?
- The main criteria is the Principle of Compositionality (Frege's Principle), i.e., that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them. Compound and complex UWs are analytical representations of the meaning structure, and must be used if and only if the meaning of the whole can be derived from the meaning of its parts. Otherwise, you should use simple UWs
- In principle, all concepts could be represented in any of these three formats. Consider, for instance, the concept "one who writes", which is lexicalized in English by the word "writer". This concept could be represented in UNL as:
- "writer" (Simple UW), i.e., a single lexical unit;
- "writ.@agent" (Compound UW), i.e., as a simple UW and an attribute corresponding to the morphological structure of the word ("writ+er"); or
- "agt(write,person.@topic)" (Complex UW), i.e., as a UNL graph corresponding to the definition ("one who writes")
- These differences do not pose any practical restrictions to the UNL representation. For instance, the English noun phrase "good writer" could be represented in UNL as:
- mod(writer, good) ("one who writes" as a Simple UW)
- mod(writ.@agent, good) ("one who writes" as a Compound UW)
- mod(:01,good)agt:01(write,person.@topic) ("one who writes" as a Complex UW)
- In the same way, these differences do not pose any restrictions to the resources (dictionaries and grammars). For instance, the French dictionary could bring :
- [écrivain]{} "writer" (LEX=N,POS=NOU,GEN=MCL,NUM=SNG)<fra,0,0>; ("one who writes" as a Simple UW)
- [écrivain]{} "writ.@agent" (LEX=N,POS=NOU,GEN=MCL,NUM=SNG)<fra,0,0>; ("one who writes" as a Compound UW)
- [écrivain]{} "agt(write,person.@topic)" (LEX=N,POS=NOU,GEN=MCL,NUM=SNG)<fra,0,0>; ("one who writes" as a Complex UW)
- But these differences do pose semantic consequences: a simple UW represents a concept seen as a single unit, whereas compound and complex UWs are strictly compositional, i.e., the meaning of the UW is entirely derived from its components. For instance, the word "writer", in English, is normally associated to an occupation. Although this feature ("as an occupation") could be considered taken for granted in the simple UW, it should be made explicit in a compound or a complex UW.