Introduction to UNL

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
Line 28: Line 28:
  
 
;b) The UNL includes a universal lexicon
 
;b) The UNL includes a universal lexicon
:Languages are normally defined as the combination of a lexicon (i.e., a set of words) and a grammar (i.e., a set of rules for combining words). As a "universal language", the UNL includes a "universal lexicon" and a "universal grammar". The universal lexicon is derived from the assumption that words from different languages may be co-indexed through common semantic entities (the "[[Universal Word]]s", or "UWs"), which would stand as linguistic counterparts for the notion of "universal concept". The idea of universality in language, and of a universal dictionary, is as old as the language studies. However, except for extremely constrained domains (such as Chemistry), none of these attempts really succeeded up to now – including the UNL, whose set of UWs has been subject to continuous changes, both in the form (i.e., in the way of expressing UWs) and in the content (the actual list of UWs).  
+
:Languages are normally defined as the combination of a lexicon (i.e., a set of words) and a grammar (i.e., a set of rules for combining words). As a "universal language", the UNL includes a "universal lexicon" and a "universal grammar". The universal lexicon is derived from the assumption that words from different languages may be co-indexed through common semantic entities (the "[[Universal Words]]", or "UWs"), which would stand as linguistic counterparts for the notion of "universal concept". The idea of universality in language, and of a universal dictionary, is as old as the language studies. However, except for extremely constrained domains (such as Chemistry), none of these attempts really succeeded up to now – including the UNL, whose set of UWs has been subject to continuous changes, both in the form (i.e., in the way of expressing UWs) and in the content (the actual list of UWs).  
  
 
;c) The UNL includes a set of semantic binary relations
 
;c) The UNL includes a set of semantic binary relations
Line 37: Line 37:
  
 
;e) The UNL is a markup language
 
;e) The UNL is a markup language
The UNL is used for annotating a natural language document in a language-independent (semantically-based) way. As HTML annotations can be realized differently in the context of different applications, machines, displays etc., so UNL expressions can have different realizations in different human languages. This is done through the [[UNL document structure]]. , which can be illustrated by the example below:
+
ÇThe UNL is used for annotating a natural language document in a language-independent (semantically-based) way. As HTML annotations can be realized differently in the context of different applications, machines, displays etc., so UNL expressions can have different realizations in different human languages. This is done through the [[UNL document structure]]. , which can be illustrated by the example below:
  
 
As indicated above, the main characteristics of the UNL are not actually unique, and have long been part of the research and development in the field of Computational Linguistics and Artificial Intelligence. The originality of the UNL concerns rather the "combination" of these well-known and domain-public categories in a single representation, i.e., that the UNL semantic network must be composed with a specific universal lexicon, with a specific set of semantic relations and with a specific set of semantic attributes, represented in a given format. This is the keystone of the UNL, and its main distinctive feature in comparison to other semantic networks and formal systems. Provided that the concepts deployed in the UNL approach (semantic networks, universal dictionary, relations, attributes) have long been used in the field of natural language processing, we understand that the novelty of UNL does not concern "what" the UNL does, but "how" it does what it does, i.e., "with which resources" it does that.  
 
As indicated above, the main characteristics of the UNL are not actually unique, and have long been part of the research and development in the field of Computational Linguistics and Artificial Intelligence. The originality of the UNL concerns rather the "combination" of these well-known and domain-public categories in a single representation, i.e., that the UNL semantic network must be composed with a specific universal lexicon, with a specific set of semantic relations and with a specific set of semantic attributes, represented in a given format. This is the keystone of the UNL, and its main distinctive feature in comparison to other semantic networks and formal systems. Provided that the concepts deployed in the UNL approach (semantic networks, universal dictionary, relations, attributes) have long been used in the field of natural language processing, we understand that the novelty of UNL does not concern "what" the UNL does, but "how" it does what it does, i.e., "with which resources" it does that.  
Line 51: Line 51:
 
In the example above, "sky(icl>natural world)" and "blue(icl>color)", which represent individual concepts, are UWs; "aoj" (= attribute of an object) is a directed binary semantic relation linking the two UWs; and "@def", "@interrogative", "@past", "@exclamation" and "@entry" are attributes modifying UWs.
 
In the example above, "sky(icl>natural world)" and "blue(icl>color)", which represent individual concepts, are UWs; "aoj" (= attribute of an object) is a directed binary semantic relation linking the two UWs; and "@def", "@interrogative", "@past", "@exclamation" and "@entry" are attributes modifying UWs.
  
UWs are supposed to represent universal concepts and are expressed here in English words in order to be readable. They consist of "headword" (the UW root) and a "constraint list" (the UW suffix between parentheses), the latter being used to disambiguate the general concept conveyed by the former. The set of UWs is organized in an ontology-like structure (the so-called "UNL Ontology"), are defined in the UNL Knowledge Base (UNLKB), and are exemplified in the UNL Example Base (UNLEB).   
+
UWs are supposed to represent universal concepts and are expressed here in English words in order to be readable. They consist of "headword" (the UW root) and a "constraint list" (the UW suffix between parentheses), the latter being used to disambiguate the general concept conveyed by the former. The set of UWs constitute the [[UNL Dictionary]]. The UWs are defined in the [[UNL Knowledge Base]] (UNLKB), and are exemplified in the [[UNL Example Base]] (UNLEB).   
  
Relations are expected to represent semantic links between concepts or sets of concepts in every existing language. They can be ontological (such as "icl" and "iof" referred to above), logical (such as "and" and "or") and thematic (such as "agt" = agent, "ins" = instrument, "tim" = time, "plc" = place, etc). There are currently 46 relations in the UNL Specs, and they define the syntax of UNL.
+
Relations are expected to represent semantic links between concepts or sets of concepts in every existing language. They can be ontological (such as "icl" and "iof" referred to above), logical (such as "and" and "or") and thematic (such as "agt" = agent, "ins" = instrument, "tim" = time, "plc" = place, etc).
  
 
Attributes represent information that cannot be conveyed by UWs and relations. Normally, they represent information on tense (".@past", "@future", etc), reference ("@def", "@indef", etc), modality ("@can", "@must",  etc), focus ("@topic", "@focus", etc), and other closed class categories.
 
Attributes represent information that cannot be conveyed by UWs and relations. Normally, they represent information on tense (".@past", "@future", etc), reference ("@def", "@indef", etc), modality ("@can", "@must",  etc), focus ("@topic", "@focus", etc), and other closed class categories.

Revision as of 18:09, 21 August 2012

The Universal Networking Language (UNL) is a knowledge representation language that has been used in several different fields of natural language processing, such as machine translation, multilingual document generation, summarization, information retrieval and extraction, sentiment analysis and semantic reasoning.

Contents

History

The UNL Programme started in 1996, as an initiative of the Institute of Advanced Studies of the United Nations University in Tokyo, Japan. In January 2001, the United Nations University set up an autonomous organization, the UNDL Foundation, to be responsible for the development and management of the UNL Programme. The Foundation, a non-profit international organisation, has an independent identity from the United Nations University, although it has special links with the UN. It inherited from the UNU/IAS the mandate of implementing the UNL Programme. Its headquarters are based in Geneva, Switzerland.

The UNL Programme has already crossed important milestones. The overall architecture of the UNL System has been developed with a set of basic software and tools necessary for its functioning. These are being tested and improved. A vast amount of linguistic resources from the various native languages already under development has been accumulated in the last few years. Moreover, the technical infrastructure for expanding these resources is already in place, thus facilitating the participation of many more languages in the UNL system from now on. A growing number of scientific papers and academic dissertations on the UNL are being published every year.

The most visible accomplishment so far is the recognition by the Patent Co-operation Treaty (PCT) of the innovative character and industrial applicability of the UNL, which was obtained in May 2002 through the World Intellectual Property Organisation (WIPO). Acquiring the patent for the UNL is a completely novel achievement within the United Nations.

Commitments

The main goal of the UNL Programme is to construct the UNL, an artificial language that could be used to process information across the language barriers. The major commitments of the UNL are the following:

I - The UNL must represent knowledge
The UNL is an artificial language designed to represent knowledge. In this sense, the UNL is first and foremost a knowledge representation language. The most important corollary of this first commitment is that UNL is not intended to describe or represent natural languages. It is used to describe and represent the information conveyed by natural languages. The goal of UNL is to represent "what is known" and not "what was said" or "how it was said". Accordingly, the UNL is not committed to preserve the lexical and the structural choices of the original, but must represent, in a non-ambiguous format, one of its possible meanings, preferably the most conventional one. For instance, given the sentence "The present King of France is bald.", the UNL representation should state that there are entities such as kings and France, that there are properties such as bald, that France may have a king, that kings can be bald, and that the present king of France is bald.
II - The UNL must be language-independent
The linguistic neutrality of the UNL is one of its most imperative and strong commitments and must be understood in its two different senses: the political and the technical. Politically, the UNL is expected to be the language of the United Nations and, therefore, must not be circumscribed to any existing natural language in particular, under the risk of being rejected by the state members of the General Assembly. Technically, the UNL document must be independent from both the source and the target languages. In the UNL approach, there are two basic movements: UNLization and NLization. UNLization is the process of representing the information conveyed by natural language into UNL; NLization, conversely, is the process of generating a natural language document out of UNL. These processes should be completely independent, i.e., the UNLization should not take into consideration which will be the target language of any future NLization; and the NLization should not need any information about the original source language of the UNLization. This means that the UNL representation must be language-independent and, for that reason, should be as semantically complete and saturated as possible.
III - The UNL must be general-purpose
At first glance, the UNL seems to be an "interlingua", a sort of pivot-language to which the source texts are converted before being translated into the target languages. It can, in fact, be used for such a purpose, but its primary objective is to serve as an infrastructure for handling knowledge rather than individual languages. In addition to translation, the UNL is expected to be used in several other different tasks, such as text mining, multilingual document generation, summarization, text simplification, information retrieval and extraction, sentiment analysis etc. Indeed, in UNL-based systems there is no need for the source language to be different from the target language: an English text may be represented in UNL in order to be generated, once again, in English, as a summarized, a simplified, a localized or a simply rephrased version of the original.
IV - The UNL must be machine-tractable
The UNL is a formal system designed for computers. It is an artificial language shaped to represent knowledge in a machine-tractable format. Like other logical systems, it seeks to provide the linguistic and semiotic infrastructure for computers to handle what is meant by natural languages. Differently from other auxiliary languages (such as Esperanto, Interlingua, Volapük, Ido and others), the UNL is not intended to be a human language. We do not expect people to speak UNL or to communicate in UNL. And it must be opaque to the end users. As no one is required to know HTML to browse the Internet or even to create websites, everyone should be able to write and read documents in UNL without any knowledge of UNL.

Structure

The four commitments above have been materialized in a formal system first presented in 1996 and amended several times since then. The most recent specification is available at [[Specs[[. The cornerstones of the current implementation are the following:

a) The UNL represents information through semantic networks
The UNL assumes that information can be represented as a graph structure (a semantic network), composed of nodes (concepts) and arcs between nodes (semantic relations between concepts). This idea is not new. Semantic networks have been used in knowledge representation at least since Charles S. Peirce , and as an interlingua for machine translation since 1956 .
b) The UNL includes a universal lexicon
Languages are normally defined as the combination of a lexicon (i.e., a set of words) and a grammar (i.e., a set of rules for combining words). As a "universal language", the UNL includes a "universal lexicon" and a "universal grammar". The universal lexicon is derived from the assumption that words from different languages may be co-indexed through common semantic entities (the "Universal Words", or "UWs"), which would stand as linguistic counterparts for the notion of "universal concept". The idea of universality in language, and of a universal dictionary, is as old as the language studies. However, except for extremely constrained domains (such as Chemistry), none of these attempts really succeeded up to now – including the UNL, whose set of UWs has been subject to continuous changes, both in the form (i.e., in the way of expressing UWs) and in the content (the actual list of UWs).
c) The UNL includes a set of semantic binary relations
The "universal grammar" of UNL consists of a set of relations and attributes. The set of relations comprise semantic binary relations that link UWs in order to form the UNL graph. This set of relations has also been undergoing several changes since 1996 and this is the main difference between the several versions of the UNL specifications. The idea of semantic binary relations has been proposed in numerous linguistic approaches, the most famous ones being the "structural syntax", developed by Lucien Tesnière in the 1930"s, and the "semantic case", presented by Charles Fillmore in 1968. Although the lists of semantic relations are always different, they normally share some basic common concepts: "agent", "patient", "instrument", "place", "time" etc.
d) The UNL includes a set of semantic attributes
The "universal grammar" of UNL also contains attributes, which are used to specify the use of UWs and to assign pragmatic information to UNL statements. The current set of attributes represent mostly information on inflectional categories (such as tense, gender, number, aspect, modality, voice etc.) and speech acts.
e) The UNL is a markup language

ÇThe UNL is used for annotating a natural language document in a language-independent (semantically-based) way. As HTML annotations can be realized differently in the context of different applications, machines, displays etc., so UNL expressions can have different realizations in different human languages. This is done through the UNL document structure. , which can be illustrated by the example below:

As indicated above, the main characteristics of the UNL are not actually unique, and have long been part of the research and development in the field of Computational Linguistics and Artificial Intelligence. The originality of the UNL concerns rather the "combination" of these well-known and domain-public categories in a single representation, i.e., that the UNL semantic network must be composed with a specific universal lexicon, with a specific set of semantic relations and with a specific set of semantic attributes, represented in a given format. This is the keystone of the UNL, and its main distinctive feature in comparison to other semantic networks and formal systems. Provided that the concepts deployed in the UNL approach (semantic networks, universal dictionary, relations, attributes) have long been used in the field of natural language processing, we understand that the novelty of UNL does not concern "what" the UNL does, but "how" it does what it does, i.e., "with which resources" it does that.

Example

In the UNL approach, information conveyed by natural language is represented as a hypergraph composed of a set of directed binary labelled links (referred to as “relations”) between nodes or hypernodes (the “Universal Words”, or simply “UW”), which stand for concepts. UWs can also be annotated with “attributes" representing context information..

As a matter of example, the English sentence ‘The sky was blue?!’ can be represented in UNL as follows:

Unl.ht1.gif

In the example above, "sky(icl>natural world)" and "blue(icl>color)", which represent individual concepts, are UWs; "aoj" (= attribute of an object) is a directed binary semantic relation linking the two UWs; and "@def", "@interrogative", "@past", "@exclamation" and "@entry" are attributes modifying UWs.

UWs are supposed to represent universal concepts and are expressed here in English words in order to be readable. They consist of "headword" (the UW root) and a "constraint list" (the UW suffix between parentheses), the latter being used to disambiguate the general concept conveyed by the former. The set of UWs constitute the UNL Dictionary. The UWs are defined in the UNL Knowledge Base (UNLKB), and are exemplified in the UNL Example Base (UNLEB).

Relations are expected to represent semantic links between concepts or sets of concepts in every existing language. They can be ontological (such as "icl" and "iof" referred to above), logical (such as "and" and "or") and thematic (such as "agt" = agent, "ins" = instrument, "tim" = time, "plc" = place, etc).

Attributes represent information that cannot be conveyed by UWs and relations. Normally, they represent information on tense (".@past", "@future", etc), reference ("@def", "@indef", etc), modality ("@can", "@must", etc), focus ("@topic", "@focus", etc), and other closed class categories.

Software