Lexical Realisation Unit

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
In the UNLarium framework, a '''Lexical Realisation Unit''' (or simply '''LRU''') is any discrete, recurring and standardized unit of meaning of a given natural language. It can be a subword (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms).  
+
In the UNLarium framework, a '''Lexical Realisation Unit''' (or simply '''LRU''') is the abstract unit underlying any discrete, recurring and standardized unit of meaning of a given natural language. It can be a subword (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms).  
  
 
== LRUs are lexical realisations for concepts ==  
 
== LRUs are lexical realisations for concepts ==  
Line 8: Line 8:
 
The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. The expression "the natural satellite of the Earth", for instance, does not bring any semantic content other the ones conveyed by its components.
 
The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. The expression "the natural satellite of the Earth", for instance, does not bring any semantic content other the ones conveyed by its components.
  
== Types of LRU ==
+
Finally, LRUs are '''underlying units''' rather than surface forms. This means that LRUs are classes that may comprise several lexical realisations. In synthetic (inflected) languages, such as the Indo-European ones, a single concept may be realised by different lexical realisations in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is realised, in English, by the forms "to die" (infinitive), "die" (present tense except 3rd person singular), "dies" (3rd person singular of the present tense), "dying" (gerund), "died" (past tense), "dead" (past participle), "will die" (future), etc. Nevertheless, particles (such as "to"), affixes (such as the inflectional suffixes "-s", "-ing", "-ed") and co-verbs (such as the auxiliary "will") convey notions that are not included in the definition of the concept and should be treated as different separate LRUs. The concept is actually realised by the forms "die" and "dy-", which correspond to the root shared by these realisations.
  
 +
This does not mean, however, that LRUs are roots. As realisations of definitions, LRUs may correspond either to single-rooted forms (such as “die”) or to multiple-root compounds (such as “kick the bucket” or “give up the ghost”). Additionally, and depending on the concept to be realised, LRUs may correspond to stems. The concept "experiencing or marked by or causing sadness or sorrow or discontent", for instance, is realised, in English, by "unhappy", which is a combination of the prefix "un-" and the root "happy". Finally, LRUs may even include inflectional affixes, such as in "glasses", which is the lexical realisation of "optical instrument consisting of a pair of lenses for correcting defective vision". The form "glasses" contain a root "glass" and an inflectional suffix "-es", which expresses plural. It is quite difficult, thus, to associate the notion of LRU to that of a morpheme, since LRUs may be either mono-morphemic or pluri-morphemic structures.
  
 +
Additionally, it is not very simple to associate the idea of LRU to that of a lexeme, as LRUs may be complex structures comprising several different (and even discontinuous) lexemes, as in "throw <someone> to the lions", which is considered to be an idiom rather than a lexeme.
  
 +
== Types of LRU ==
  
LRUs are not lexemes ==
+
In the UNLarium framework, there can be three different types of LRUs, depending on their internal structure:
[[morphology|Lexemes]] are LRUs, but the inventory of LRUs includes morphemes and multiple-lexeme expressions as well. In English, for instance, one of the most frequent lexical realisations for the concept “contrary of” is the prefix “un-“, which is a bound morpheme (and not a lexeme); in the same way, the concept “allow or plan for a certain possibility” is frequently realised by the discontinuous phrasal verb “to take (sth) into account”, which is a complex structure that does not figure as a separate entry in most English dictionaries (it is normally listed inside the verb “to take”). So, it is important to understand that “lexical realisation unit”, here, means not only ordinary “lexemes”, but also other different lexical structures.
+
  
== LRUs are not lexemes ==
+
{{#tree:id=tagset|openlevels=0|root=Lexical Realisation Unit (LRU)|
 +
*subword (SBW) (bound morpheme)
 +
*simple word (WRD) (lexeme)
 +
*multiword expression (MTW) (more than one lexeme)
 +
}}
  
 +
=== Subword (SBW) ===
  
 +
'''Subwords''' are bound morphemes, i.e., morphemes that do not have independent existence in the language and only appear together with other morphemes to form a lexeme. Affixes (such as "un-", "re-", "-ful", "-ness") and roots that do not occur alone (such as "rupt" in "interrupt", "disrupt", "corrupt", "rupture", etc) are bound morphemes and, therefore, subwords.
  
in the common sense, but any part of the vocabulary of a language, which includes words, subwords or multi-word expressions.
+
=== Simple Word (WRD) ===
  
The inventory of LRUs consist not only of the un
+
'''Simple Words''' are the ordinary (indivisible) lexemes in the semantic system of a language. They may consist of:
 +
*one single free morpheme (such as "happy", "break");
 +
*one single free morpheme and bound morphemes ("unhappy", "happiness", "happily", "unbreakable", "unbreakableness"); and
 +
*compounds of bound morphemes (such as "interrupt", "disrupt", "corrupt").
 +
Simple words must not include more than one free morpheme.<br >
 +
Classical compounds ("agriculture", "photograph") and their derivations ("agricultural", "photographically") are to be treated as simple words if they do not include more than one free morpheme.
  
 +
=== Multiword Expression (MTW) ===
  
vocabulary of a language is made not only of words, but of parts of words (roots, stems, affixes, particles) and of multiple-word expressions (compounds, collocations, idioms).  
+
'''Multiword Expressions''' are lexical structures made up of a sequence of two or more lexemes concatenated ("darkroom", "skinhead") or isolated by hyphens ("blue-green", "Afro-American") or blank spaces ("round table", "part of speech"). Multiword expressions can be continuous ("get over") or discontinuous ("get <something> together"). They can be compounds ("fireman", "hardware"), phrases ("in spite of", "take into account"), idioms ("kick the bucket", "play cat and mouse"), fragments of sentences ("and so on", "whatever the case") or sentences ("Every evil is followed by some good", "No flies enter a mouth that is shut"). In the UNLarium framework, phrasal verbs ("give in", "come across") are treated as multiword expressions.
  
 +
== How to express a LRU ==
 +
To assure readability and to allow the reference to all instances of the same LRU, the LRU is represented, in the UNLarium, through a '''lemma''', i.e., a canonical (citation) form, which is the entry form normally given in ordinary dictionaries and glossaries. The lemmatization process should be done as follows:
  
  
  
 
+
is the form of the singular, for nouns; of the masculine singular, for adjectives; and the infinitive, for verbs. The lemma should follow the spelling and the capitalization rules of the target language. In English, for instance, only proper names should bring an initial upper case, whereas in German all nouns should be written this way.
A common misunderstanding in natural language description is that, due to our writing conventions, especially in the Western tradition, we tend to reduce the lexicon of a language to a list of “words”, which are normally understood as the smallest free forms, or the strings of alphabetic characters isolated by blank spaces. Unfortunately, it is not that simple. The vocabulary of a language is made not only of words, but of parts of words (roots, stems, affixes, particles) and of multiple-word expressions (compounds, collocations, idioms).
+
 
+
== LRUs are not morphemes ==
+
In synthetic (inflected) languages, such as the Indo-European ones, a single concept may be realised by different lexical realisations in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is realised, in English, by the forms "to die" (infinitive), "die" (present tense except 3rd person singular), "dies" (3rd person singular of the present tense), "dying" (gerund), "died" (past tense), "dead" (past participle), "will die" (future), etc.
+
 
+
The morphological analysis allows us to perceive that, from the perspective of realising a concept, the most interesting entity is the root. Particles (such as "to"), affixes (such as the inflectional suffixes "-s", "-ing", "-ed") and co-verbs (such as the auxiliary "will") convey notions that are not included in the concept definition and should be isolated in order to obtain the real LRU. This does not mean, however, that LRUs are roots. As realisations of definitions, LRUs may correspond either to single-rooted forms (such as “die”) or to multiple-root compounds (such as “kick the bucket” or “give up the ghost”). Additionally, and depending on the concept to be realised, LRUs may correspond to stems (aka as inflectional roots), which are combinations of roots and derivational affixes. The concept "experiencing or marked by or causing sadness or sorrow or discontent", for instance, is realised, in English, by "unhappy", which is a combination of the prefix "un-" and the root "happy". Finally, LRUs may even include inflectional affixes, such as in "glasses", which is the lexical realisation of "optical instrument consisting of a pair of lenses for correcting defective vision". The form "glasses" contain a root "glass" and an inflectional suffix "-es", which expresses plural. It is quite difficult, thus, to associate the notion of LRU to that of a morpheme, since LRUs may be either mono-morphemic or pluri-morphemic structures. In any case, we normally consider that derivational affixes (such as "un-", "re-", "-er" and "-ness" in realisations such as "unhappy", "rewrite", "writer" and "happinesss") form new LRUs, whereas inflectional affixes form only different realisations of the same LRU, which should be understood, then, as a class, rather than as single element.
+
 
+
Apart form inflections, LRUs may vary in many other different directions: spelling (such as in the allomorphs “die” and “dy” above), discontinuity (as in “take […] into account”) and order (as in German separable verbs such as “angekommen”, which becomes “kommen” […] “an” in finite realisations). These variants are said to be simply instances of the same LRU, even in case of radical changes (such as the forms of the irregular verb "to be" in English: “be”, “am”, “are”, “was”, etc). As we will see later, the possible variations of a given LRU will be informed through special rules to be created inside the very LRU.
+
 
+
== How to express a LRU ==
+
To assure readability and to allow the reference to all instances of the same LRU, the LRU is represented, in the UNLarium, through a '''lemma''', i.e., a canonical (citation) form, which is the entry form normally given in ordinary dictionaries and glossaries. The lemma is the form of the singular, for nouns; of the masculine singular, for adjectives; and the infinitive, for verbs. The lemma should follow the spelling and the capitalization rules of the target language. In English, for instance, only proper names should bring an initial upper case, whereas in German all nouns should be written this way.
+
  
 
== Lexicalisation divergences ==
 
== Lexicalisation divergences ==

Revision as of 12:01, 13 January 2010

In the UNLarium framework, a Lexical Realisation Unit (or simply LRU) is the abstract unit underlying any discrete, recurring and standardized unit of meaning of a given natural language. It can be a subword (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms).

Contents

LRUs are lexical realisations for concepts

The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL graphs. In that sense, UNLarium entries should correspond to the most likely realisations, in a given language, of a given concept. The expression “realisation" stands here for a mixture of wording and phrasing, i.e., the manner in which a concept is articulated in a given language. For instance, the concept “the natural satellite of the Earth” is realised, in English, by the word “moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out linguistic realisations for concepts, which will be always presented by their corresponding definition in English.

LRUs, however, are not simply linguistic realisations; they are lexical realisations. This means that LRUs should correspond to the units of the vocabulary of a language, i.e., a lexical item. Let’s come back to our previous example. Apart from “moon”, the concept “the natural satellite of the Earth” can be realised, in English, by the very expression “the natural satellite of the Earth”, which is indeed very frequent (2.130.000 occurrences in Google). This expression, however, is a “definition” rather than a “lexical realisation” for the concept, and should therefore not correspond to a LRU.

The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. The expression "the natural satellite of the Earth", for instance, does not bring any semantic content other the ones conveyed by its components.

Finally, LRUs are underlying units rather than surface forms. This means that LRUs are classes that may comprise several lexical realisations. In synthetic (inflected) languages, such as the Indo-European ones, a single concept may be realised by different lexical realisations in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is realised, in English, by the forms "to die" (infinitive), "die" (present tense except 3rd person singular), "dies" (3rd person singular of the present tense), "dying" (gerund), "died" (past tense), "dead" (past participle), "will die" (future), etc. Nevertheless, particles (such as "to"), affixes (such as the inflectional suffixes "-s", "-ing", "-ed") and co-verbs (such as the auxiliary "will") convey notions that are not included in the definition of the concept and should be treated as different separate LRUs. The concept is actually realised by the forms "die" and "dy-", which correspond to the root shared by these realisations.

This does not mean, however, that LRUs are roots. As realisations of definitions, LRUs may correspond either to single-rooted forms (such as “die”) or to multiple-root compounds (such as “kick the bucket” or “give up the ghost”). Additionally, and depending on the concept to be realised, LRUs may correspond to stems. The concept "experiencing or marked by or causing sadness or sorrow or discontent", for instance, is realised, in English, by "unhappy", which is a combination of the prefix "un-" and the root "happy". Finally, LRUs may even include inflectional affixes, such as in "glasses", which is the lexical realisation of "optical instrument consisting of a pair of lenses for correcting defective vision". The form "glasses" contain a root "glass" and an inflectional suffix "-es", which expresses plural. It is quite difficult, thus, to associate the notion of LRU to that of a morpheme, since LRUs may be either mono-morphemic or pluri-morphemic structures.

Additionally, it is not very simple to associate the idea of LRU to that of a lexeme, as LRUs may be complex structures comprising several different (and even discontinuous) lexemes, as in "throw <someone> to the lions", which is considered to be an idiom rather than a lexeme.

Types of LRU

In the UNLarium framework, there can be three different types of LRUs, depending on their internal structure:

Subword (SBW)

Subwords are bound morphemes, i.e., morphemes that do not have independent existence in the language and only appear together with other morphemes to form a lexeme. Affixes (such as "un-", "re-", "-ful", "-ness") and roots that do not occur alone (such as "rupt" in "interrupt", "disrupt", "corrupt", "rupture", etc) are bound morphemes and, therefore, subwords.

Simple Word (WRD)

Simple Words are the ordinary (indivisible) lexemes in the semantic system of a language. They may consist of:

  • one single free morpheme (such as "happy", "break");
  • one single free morpheme and bound morphemes ("unhappy", "happiness", "happily", "unbreakable", "unbreakableness"); and
  • compounds of bound morphemes (such as "interrupt", "disrupt", "corrupt").

Simple words must not include more than one free morpheme.
Classical compounds ("agriculture", "photograph") and their derivations ("agricultural", "photographically") are to be treated as simple words if they do not include more than one free morpheme.

Multiword Expression (MTW)

Multiword Expressions are lexical structures made up of a sequence of two or more lexemes concatenated ("darkroom", "skinhead") or isolated by hyphens ("blue-green", "Afro-American") or blank spaces ("round table", "part of speech"). Multiword expressions can be continuous ("get over") or discontinuous ("get <something> together"). They can be compounds ("fireman", "hardware"), phrases ("in spite of", "take into account"), idioms ("kick the bucket", "play cat and mouse"), fragments of sentences ("and so on", "whatever the case") or sentences ("Every evil is followed by some good", "No flies enter a mouth that is shut"). In the UNLarium framework, phrasal verbs ("give in", "come across") are treated as multiword expressions.

How to express a LRU

To assure readability and to allow the reference to all instances of the same LRU, the LRU is represented, in the UNLarium, through a lemma, i.e., a canonical (citation) form, which is the entry form normally given in ordinary dictionaries and glossaries. The lemmatization process should be done as follows:


is the form of the singular, for nouns; of the masculine singular, for adjectives; and the infinitive, for verbs. The lemma should follow the spelling and the capitalization rules of the target language. In English, for instance, only proper names should bring an initial upper case, whereas in German all nouns should be written this way.

Lexicalisation divergences

As languages have different lexicalisation processes, a single definition may correspond to several different LRUs, which are said to be synonyms. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LRUs: “die”, “croak”, “decease”, “drop dead”, “buy the farm”, “cash in one's chips”, “give-up the ghost”, “kick the bucket”, “pass away”, “perish”, “snuff it”, “pop off”, “expire”, “conk”, “exit”, “choke”, “go”, “pass”, etc. In such cases, all realisations should be informed in the UNLarium.

There are cases, however, in which the definition cannot be lexically realised [by a single lexical unit] in the target language. This happens in two situations:

  • When the concept is underspecified, i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LRUs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but these are rather subordinate terms (or hyponyms), in the sense they only include and partly match the intended sense. And the expression “red entity” itself is too compositional and too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram).
  • When the concept is overspecified, i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LRU in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a superordinate term (or hypernym), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.

In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of definitions (periphrases) and other semantically related (and inaccurate) lexical units (such as hyponyms or hypernyms).

Examples

Concept Lexical Realisations Lexical Realisation Unit (LRU)
large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male lion, lions lion
a female lion lioness, lionesses lioness
a large and densely populated urban area city, cities city
the part of the leg of a human being below the ankle joint foot, feet foot
the largest city in New York State and in the United States New York, New York City, NY, NYC New York, New York City, NY, NYC
the corporate executive responsible for the operations of the firm chief executive officer, chief executive officers, chief operating officer, chief operating officers, CEO, CEOs chief executive officer, chief operating officer, CEO
optical instrument consisting of a pair of lenses for correcting defective vision spectacles, specs, eyeglasses, glasses spectacles, specs, eyeglasses, glasses
pale yellowish wine made from white grapes or red grapes with skins removed before fermentation white wine, white wines white wine
a person whose occupation is teaching professor (male singular), professores (male plural), professora (female singular), professoras (female plural) (Spanish) professor
solid-hoofed herbivorous quadruped domesticated since prehistoric times cheval (male singular), chevaux (male plural), jument (female singular), juments (female plural) (French) cheval, jument
delighting the senses or exciting intellectual or emotional admiration beautiful beautiful
delighting the senses or exciting intellectual or emotional admiration beau (masculine singular), beaux (masculine plural), belle (feminine singular), belles (feminine plural) (French) beau
have the quality of being to be, be, am, is, are, was, were, being, been be
have a great affection or liking for aime, aimes, aimons, aimez, aiment, aimerais, ai aimé, aimais, ... (French) aimer
steer a vehicle to the side of the road to pull over, pull over, pulls over, pulled over, ... pull over
allow or plan for a certain possibility to take into account, take into account, takes into account, taking into account, ... take into account
on the day preceding today yesterday yesterday
in a willing manner gladly gladly
Software