Lexical Realisation Unit

From UNL Wiki
Revision as of 11:00, 6 January 2010 by Admin (Talk | contribs)
Jump to: navigation, search

A lexical unit (or simply LU) is any stable and recurring unit of meaning in a given natural language. It can be a morpheme (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms).

Contents

From concepts to lexical units

The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL documents. In that sense, dictionary entries should correspond to the most likely lexical realisation, in a given language, of a definition for a concept. For instance, the definition “the natural satellite of the Earth” is realised, in English, by the word “Moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out “lexical realisations” for those concept definitions, which will be presented in English.

Lexical realisation is not only about "words"

The expression "lexical realisation" is used here to avoid a common misunderstanding in natural language description. Due to writing conventions, especially in the Western tradition, we tend to reduce the lexicon of a language to a list of “words”, which are normally understood as strings of alphabetic characters isolated by blank spaces. Unfortunately, it is not that simple. The vocabulary of a language is made not only of words, but of parts of words (roots, stems, affixes, particles) and of multi-word expressions (compounds, collocations, idioms). In English, one of the most frequent lexical realisations for the concept “contrary of” is the prefix “un-“, which is a bound morpheme (i.e., a semantic unit that does not have an independent existence); in the same way, the concept “to die” is frequently realised by the idiom “to kick the bucket”, which is a complex structure that does not figure as a separate entry in most English dictionaries (it is normally listed inside the verb “to kick”). So, it is important to understand that “lexical realisation”, here, means not only “words”, in the common sense, but any LEXICAL UNIT, i.e., any reasonably constant unit of a language, regardless of its length and number of morphemes. For us, the most important requisite, which is however still quite subjective, is the rate of recurrence. If the sequence is convincingly recurring, it is a lexical unit (or simply “LU”); otherwise, it is not.

Lexicalisation processes

As languages have different lexicalisation processes, a single definition may correspond to several different LUs, which are said to be synonyms. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LUs: “to die”, “to croak”, “to decease”, “to drop dead”, “to buy the farm”, “to cash in one's chips”, “to give-up the ghost”, “to kick the bucket”, “to pass away”, “to perish”, “to snuff it”, “to pop off”, “to expire”, “to conk”, “to exit”, “to choke”, “to go”, “to pass”, etc. In such cases, all realisations should be informed in the UNLarium.

There are cases, however, in which the definition cannot be lexically realised [by a single lexical unit] in the target language. This happens in two situations:

  • When the concept is underspecified, i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LUs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but they are rather subordinate terms (or hyponyms), in the sense they only include and partly match the intended sense. And the expression “red entity” itself is too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram).
  • When the concept is overspecified, i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LU in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a superordinate term (or hypernym), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.

In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of periphrases and other semantically related units (such as hyponyms or hypernyms).

How to express a LU

In the UNLarium, the LU is expressed by its canonical (citation) form, i.e., the word or expression as it would normally appear in ordinary dictionaries and glossaries, which is normally the unmarked (generic, basic, default) form, such as the singular, for nouns; the masculine singular, for adjectives; the infinitive, for verbs; and so on. Accordingly, you should use “foot” for both “foot” and “feet”; “run” for “run”, “runs”, “ran”, “running”; “beau” (=beautiful, in French) for “beau” (masculine singular), “beaux” (masculine plural), “belle” (feminine singular), “belles” (feminine plural); etc.

The role of LUs

LUs are not actually essential to the UNLarium. They just provide a humanly-readable label or reading for base form, which are the starting points and the keystones for the generation.

Software