Introduction to UNL

From UNL Wiki

Revision as of 17:22, 14 September 2012 by Martins (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The Universal Networking Language (UNL) is a knowledge representation language that has been used in several different fields of natural language processing, such as machine translation, multilingual document generation, summarization, information retrieval and extraction, sentiment analysis and semantic reasoning.

History

The UNL Programme started in 1996, as an initiative of the Institute of Advanced Studies of the United Nations University in Tokyo, Japan. In January 2001, the United Nations University set up an autonomous organization, the UNDL Foundation, to be responsible for the development and management of the UNL Programme. The Foundation, a non-profit international organisation, has an independent identity from the United Nations University, although it has special links with the UN. It inherited from the UNU/IAS the mandate of implementing the UNL Programme. Its headquarters are based in Geneva, Switzerland.

The UNL Programme has already crossed important milestones. The overall architecture of the UNL System has been developed with a set of basic software and tools necessary for its functioning. These are being tested and improved. A vast amount of linguistic resources from the various native languages already under development has been accumulated in the last few years. Moreover, the technical infrastructure for expanding these resources is already in place, thus facilitating the participation of many more languages in the UNL system from now on. A growing number of scientific papers and academic dissertations on the UNL are being published every year.

The most visible accomplishment so far is the recognition by the Patent Co-operation Treaty (PCT) of the innovative character and industrial applicability of the UNL, which was obtained in May 2002 through the World Intellectual Property Organisation (WIPO). Acquiring the patent for the UNL is a completely novel achievement within the United Nations.

Commitments

The main goal of the UNL Programme is to construct the UNL, an artificial language that can be used to process information across the language barriers. The major commitments of the UNL are the following:

I - The UNL must represent knowledge: The UNL is first and foremost a knowledge representation language. The most important corollary of this first commitment is that UNL is not a meta-language, i.e., it is not intended to describe or represent natural languages; on the contrary, it is used to represent the information conveyed by natural languages. The goal of UNL is to represent "what was meant" and not "what was said". Accordingly, the UNL provides an interpretation rather than a translation of a given utterance. The UNL version of an existing document is not bound to preserve the lexical and the syntactic choices of the original, but must represent, in a non-ambiguous format, one of its possible meanings, preferably the most conventional one.

II - The UNL must be a language for computers: The UNL is an artificial language shaped to represent knowledge in a machine-tractable format. Like other formal systems, it seeks to provide the infrastructure for computers to handle what is meant by natural languages. Differently from other auxiliary languages (such as Esperanto, Interlingua, Volapük, Ido and others), the UNL is not intended to be a human language. We do not expect people to speak UNL or to communicate in UNL. But we do expect computers to process UNL: to generate UNL documents out of natural language documents, and vice-versa, with and without human aid; to retrieve and extract information from UNL documents; and to detect paraphrases, entailments, implicatures, presuppositions, inferences, contradictions and redundancies among a set of propositions represented in UNL.

III - The UNL must be self-sufficient: In the UNL approach, there are two basic movements: UNLization and NLization. UNLization is the process of representing the information conveyed by natural language into UNL; NLization, conversely, is the process of generating a natural language document out of UNL. In order to be fully "understandable" (and manageable) by machines, the UNL, which is the result of the UNLization process, must be self-sufficient, i.e., should be as semantically complete and saturated as possible. The UNL representation must not depend on any implicit knowledge, and should explicitly codify all information. This means that the UNLization process should be completely independent from the NLization process, and vice-versa, i.e., the UNLization should not take into consideration which will be the target language or format of any future NLization; and the NLization should not need any information about the original source language or previous structure of any UNL document.

IV - The UNL must be general-purpose: At first glance, the UNL seems to be a pivot-language to which the source texts are converted before being translated into the target languages. It can, in fact, be used for such a purpose, but its primary objective is to serve as an infrastructure for handling knowledge. In addition to translation, the UNL is expected to be used in several other different tasks, such as text mining, multilingual document generation, summarization, text simplification, information retrieval and extraction, sentiment analysis etc. Indeed, in UNL-based systems there is no need for the source language to be different from the target language: an English text may be represented in UNL in order to be generated, once again, in English, as a summarized, a simplified, a localized or a simply rephrased version of the original.

V - The UNL must be independent from any particular natural language: The UNL is expected to be the language of the United Nations and, therefore, must not be circumscribed to any existing natural language in particular, under the risk of being rejected by the state members of the General Assembly.

Assumptions

1. Languages convey information about the world: The very basic assumption of the UNL approach is that one of the most outstanding uses of natural languages is to convey information, i.e., that natural languages can be used to represent what we know about the world. This aboutness of natural languages, i.e., its representational role, is the main object of the UNL, which is expected, not to do what natural languages do, but to represent what they represent.

2. Information can be represented by semantic networks: The UNL assumes that any information conveyed by natural language can be formally and usefully represented by a semantic network. This idea is not new. Semantic networks have been used in knowledge representation at least since Charles S. Peirce, and as an interlingua for machine translation since the 1950's. In the UNL approach, this semantic network (or UNL graph) is made of three different types of discrete semantic entities: Universal Words, relations and attributes. Universal Words stand for objects (concepts), which figure as nodes or hyper-nodes in the graph. Relations are semantic cases (thematic roles) - such as agent, instrument, place etc. - holding between two objects; they are always binary and directed, and figure as arcs between nodes, defining the structure of the graph. Attributes play a role very similar to that of quantifiers in first-order predicate calculus: they are operators that bind objects ranging over a domain of discourse. The only difference is that they are not limited to quantification, but represent any type of specification, such as tense, aspect, modality etc. This three-layered representation model is the cornerstone of the UNL, and its most distinctive feature over other semantic networks, which normally propose only two levels: edges and vertices. As a matter of example, the English sentence "Dogs bite" could be represented, in UNL, as:

Simplified UNL for 'Dogs bite'^[1]

3. Universality is a scalar quantity: The UNL assumes that any information conveyed by natural languages is translatable, i.e., that natural languages differ, not in their power to express information, but in the way they do that. This means that there should be a sort of common semantic denominator between languages that ensures their intertranslatability. There are two approaches to this hypothesis: the weak proclaims that this denominator varies according to the different language pairs; the strong states that, above and beyond this variation, there is a common denominator to all languages, a set of semantic universals or primitives that could be derived from the fact that humans share the same underlying biological infrastructure for perceiving and categorizing the world. The UNL assumes that universality is a scalar quantity. That Universal Words, despite of the name, may range from absolutely universal to absolutely local, and even temporary. "Absolutely universal" UWs are shared by all languages. They constitute the deep underlying vocabulary which is specified (by attributes) or combined (through relations) in order to form more complex semantic units, some of which are considerably stable and widespread, at the point of being already lexicalized (i.e., consolidated as a single indivisible unit) in several languages. These complex structures, although potentially customary, cannot be said to be absolutely universal, because they do not exist in all languages. But many of them - such as the UW corresponding to English verb "to kill" - are quite extensive, and may be considered "relatively universal". Some others - such as the UW corresponding to the concept of "to execute someone by suffocation so as to leave the body intact and suitable for dissection" - are "weakly universal" or "relatively local", in the sense that they are lexicalized in no more than a couple of languages. Others - such as the UW representing "a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time" - are "absolutely local" (i.e., fully language-dependent), and do not have any cross-language validity as a single concept (although they may be obviously represented in any language). At last, there are complex semantic structures - such as "women that normally wear red hats and white shoes in big theaters" - that are not lexicalized in any language and are rather temporary. All these entities convey information and are expected to be represented in UNL, but at different levels of semantic granularity: as simple or complex semantic entities. The main implication of this assumption is that the UNL graph is actually a hyper-graph, comprising hyper-nodes and hyper-relations.

Issues

Figurative Language: As a non-ambiguous formal system, the UNL is always literal, i.e., fully compositional. UNL expressions must derive their semantic value thoroughly from their components, which must be explicitly defined in the UNL Knowledge Base. Accordingly, the UNL does not allow for any figure of speech, such as metaphor and metonymy. Schemes and tropes must be represented, in UNL, by their intended meaning. A sentence such as "John devoured thousands of books", for instance, must be represented, in UNL, as "John read many books eagerly"^[2].
Speech Acts: As a knowledge representation language, the UNL is not expected to perform speech acts (such as promises, requests, orders etc.), but only to describe them in a constative manner. For instance, given a performative utterance such as "Can you pass me the salt?", the role of the UNL is to represent "you pass the salt to me" and to indicate that this was a polite request^[3]. The UNL representation itself will not be a request, nor will be bound to provoke the same (perlocutionary) effect caused by the original utterance.
Anaphora and ellipsis: As a fully-explicit semantic system, the UNL is not expected to have ellipses or pro-forms, except when the referent is not present in the document (exophora). A sentence such as "The monkey took the banana and ate it" must be represented, in UNL, as "[The monkey]_i took [the banana]_j and [the monkey]_i ate [the banana]_j".
Ambiguity: The UNL is not expected to have any ambiguity, at any level. The sentence "The girls saw the boy with the telescope" must be represented, in UNL, in a way that there is no ambiguity concerning the meaning of "saw" (past tense of "to see" x present tense of "to saw") or the dependency relations of "with the telescope" ("saw with the telescope" x "the boy with the telescope").
Redundancy: The UNL is not expected to have any redundancy. Expressions such as "free gift", "round circle" and "murder to death" are expected to be represented, in UNL, as "gift", "circle" and "murder", respectively. Likewise, sentences such as "Peter killed Mary", "Peter murdered Mary", "It's Peter who killed Mary" and "Mary was killed by Peter" are expected to be represented in UNL in the same way^[4].

Structure

The structure of the UNL is defined by the UNL Specs. The UNL Specs specify the structure of a UNL document; the syntax of a UNL graph; the syntax of Universal Words; the set of relations; the set of attributes; and all the information concerning UNL as a formalism:

Notes

↑ In this graph, "dog" and "bite" are Universal Words; "agt" (agent) is a relation; and "@generic" is an attribute assigned to "dog".
↑ The information that this content has been conveyed through figurative language can be indicated by the corresponding attributes (@metaphor, @hyperbole, etc.), but this is rather optional, because the goal of UNL is to represent "what was meant" and not "what was said" or "how it was said".
↑ This can be done by the use of the attributes @polite and @request.
↑ The differences between them can be represented by attributes such as @topic and @passive, but, once again, this is rather optional.

[0] In this graph, "dog" and "bite" are Universal Words; "agt" (agent) is a relation; and "@generic" is an attribute assigned to "dog".

[1] The information that this content has been conveyed through figurative language can be indicated by the corresponding attributes (@metaphor, @hyperbole, etc.), but this is rather optional, because the goal of UNL is to represent "what was meant" and not "what was said" or "how it was said".

[2] This can be done by the use of the attributes @polite and @request.

[3] The differences between them can be represented by attributes such as @topic and @passive, but, once again, this is rather optional.

[1]

[2]

[3]

[4]

Introduction to UNL

Contents

History

Commitments

Assumptions

Issues

Structure

Notes

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export