Lexical Realisation Unit
(→Lexical Realisation Units (LRU)) |
|||
Line 1: | Line 1: | ||
In the UNLarium framework, a '''Lexical Realisation''' (or simply '''LR''') is any discrete, recurring and standardized unit of meaning of a given natural language. It can be a subword (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms). | In the UNLarium framework, a '''Lexical Realisation''' (or simply '''LR''') is any discrete, recurring and standardized unit of meaning of a given natural language. It can be a subword (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms). | ||
− | == Lexical | + | == Lexical realisation (LR) == |
The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL graphs. In that sense, UNLarium entries should correspond to the most likely '''realisations''', in a given language, of a given concept. The expression “realisation" stands here for a mixture of wording and phrasing, i.e., the manner in which a concept is articulated in a given language. For instance, the concept “the natural satellite of the Earth” is realised, in English, by the word “moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out linguistic realisations for concepts, which will be always presented by their corresponding definition in English. | The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL graphs. In that sense, UNLarium entries should correspond to the most likely '''realisations''', in a given language, of a given concept. The expression “realisation" stands here for a mixture of wording and phrasing, i.e., the manner in which a concept is articulated in a given language. For instance, the concept “the natural satellite of the Earth” is realised, in English, by the word “moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out linguistic realisations for concepts, which will be always presented by their corresponding definition in English. | ||
Line 8: | Line 8: | ||
The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. The above-mentioned expression "the natural satellite of the Earth", for instance, does not bring any semantic content other than the ones conveyed by its components. This is not the case of "geostationary communications satellite", which includes the idea of "orbit" which is not explicitly present in the compound. Accordingly, "geostationary communications satellite" (208.000 occurrences in Google) should be treated as a LR, whereas "the natural satellite of the Earth", in spite of its higher frequency, should not. | The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. The above-mentioned expression "the natural satellite of the Earth", for instance, does not bring any semantic content other than the ones conveyed by its components. This is not the case of "geostationary communications satellite", which includes the idea of "orbit" which is not explicitly present in the compound. Accordingly, "geostationary communications satellite" (208.000 occurrences in Google) should be treated as a LR, whereas "the natural satellite of the Earth", in spite of its higher frequency, should not. | ||
− | == Lexical Realisation | + | == Lexical Realisation Unit (LRU) == |
In synthetic (inflected) languages, such as the Indo-European ones, a single concept may be realised by different lexical realisations in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is realised, in English, by the forms "to die", "die", "dies", "dying", "died", "dead", "will die", etc. These LRs are said to be different forms of the same '''Lexical Realisation Unit''' (or LRU). | In synthetic (inflected) languages, such as the Indo-European ones, a single concept may be realised by different lexical realisations in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is realised, in English, by the forms "to die", "die", "dies", "dying", "died", "dead", "will die", etc. These LRs are said to be different forms of the same '''Lexical Realisation Unit''' (or LRU). | ||
Revision as of 14:45, 13 January 2010
In the UNLarium framework, a Lexical Realisation (or simply LR) is any discrete, recurring and standardized unit of meaning of a given natural language. It can be a subword (a root, an affix), a simple word or a multiword expression (compounds, collocations, idioms).
Contents |
Lexical realisation (LR)
The UNLarium is first and foremost a generation-driven framework, which has been developed mainly to provide resources for generating natural language texts out of UNL graphs. In that sense, UNLarium entries should correspond to the most likely realisations, in a given language, of a given concept. The expression “realisation" stands here for a mixture of wording and phrasing, i.e., the manner in which a concept is articulated in a given language. For instance, the concept “the natural satellite of the Earth” is realised, in English, by the word “moon”; in French, by “lune”; in German, by “Mond”; in Russian, by “луна”; in Spanish, by “luna”; in Chinese, by 月; and so on. Your first task in the UNLarium is exactly to find out linguistic realisations for concepts, which will be always presented by their corresponding definition in English.
LRs, however, are not simply linguistic realisations; they are lexical realisations. This means that LRs should correspond to the units of the vocabulary of a language, i.e., to a "lexical item". Let’s come back to our previous example. Apart from “moon”, the concept “the natural satellite of the Earth” can be realised, in English, by the very expression “the natural satellite of the Earth”, which is indeed very frequent (2.130.000 occurrences in Google). This expression, however, is a “definition” rather than a “lexical realisation” for the concept, and should therefore not correspond to a LR.
The differences between definitions and lexical items, or between “defining” and “naming” a concept, are fairly subjective, and are normally ascribed to the compositionality (or analyticity) of the candidate term: if the meaning of the compound can be reduced to the combination of the meaning of its components, it is said to be simply a definition; otherwise, i.e., if there is a sort of semantic surplus, a supplementary (or even complementary) sense added to the simple combination, the term is considered a lexical item. The above-mentioned expression "the natural satellite of the Earth", for instance, does not bring any semantic content other than the ones conveyed by its components. This is not the case of "geostationary communications satellite", which includes the idea of "orbit" which is not explicitly present in the compound. Accordingly, "geostationary communications satellite" (208.000 occurrences in Google) should be treated as a LR, whereas "the natural satellite of the Earth", in spite of its higher frequency, should not.
Lexical Realisation Unit (LRU)
In synthetic (inflected) languages, such as the Indo-European ones, a single concept may be realised by different lexical realisations in order to express different grammatical categories, such as number, gender, tense and case. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, is realised, in English, by the forms "to die", "die", "dies", "dying", "died", "dead", "will die", etc. These LRs are said to be different forms of the same Lexical Realisation Unit (or LRU).
Lexical Realisation Units are therefore abstract underlying units shared by different lexical realisations, but they should not be mistaken by lexemes. Indeed, it is not very simple to associate the idea of LRU to that of a lexeme, as LRUs may correspond to any morphological category:
- inflectional affixes (such as "-s", which is one of the possible lexical realisations for the concept "more than one");
- derivational affixes (such as "un", which is one of the possible lexical realisations for the concept "contrary of");
- roots (such as "die", which is one of the possible lexical realisation for the concept “pass from physical life and lose all bodily attributes and functions necessary to sustain life”);
- stems (such as "unhappy", which is one of the possible lexical realisations for the concept "experiencing or marked by or causing sadness or sorrow or discontent"); and
- word forms (such as "glasses", which is one of the possible lexical realisations for "optical instrument consisting of a pair of lenses for correcting defective vision").
Additionally, LRUs may also correspond to complex structures comprising several different (and even discontinuous) lexemes, as in "geostationary communications satellite" or "throw <someone> to the lions".
Lexical Realisation Set (LRS)
As languages have different lexicalisation processes, a single definition may correspond to several different LRs, which are said to be synonyms. The definition “pass from physical life and lose all bodily attributes and functions necessary to sustain life”, for instance, may be realised in English by several different LRs: “die”, “croak”, “decease”, “drop dead”, “buy the farm”, “cash in one's chips”, “give-up the ghost”, “kick the bucket”, “pass away”, “perish”, “snuff it”, “pop off”, “expire”, “conk”, “exit”, “choke”, “go”, “pass”, etc. In such cases, all realisations should be informed in the UNLarium inside a single Lexical Realisation Set.
There are cases, however, in which the definition cannot be lexically realised in the target language. This happens in two situations:
- When the concept is underspecified, i.e., too broad (or vague) to be realised. The concept of “red entity”, for instance, may be coextensive with several different English LRs (“blood”, “cherry”, “ruby”, “ketchup”, “Spiderman”, etc), but these are rather subordinate terms (or hyponyms), in the sense they only include and partly match the intended sense. As the expression “red entity” itself is too compositional and too occasional to be considered already lexicalized (Google brings only 8,040 occurrences for this bigram), there will no LR in this case.
- When the concept is overspecified, i.e., too narrow (or specific) to be realised. Consider, for instance, the definition “a person who is ready to forgive any transgression a first time and then to tolerate it for a second time, but never for a third time”. This definition does not lead to any LR in English, French or Russian, even though it corresponds to a single word (“ilunga”) in Tshiluba, a language spoken in the Republic of Congo. We may obviously express the concept in any language, but we have to do it through a periphrasis (as we have done for English) or through a superordinate term (or hypernym), such as “forgiver”, “excuser”, “pardoner”, which are again fairly accurate.
In both cases, there will be no realisation to be informed, and it is important to indicate, in the UNLarium, that the concept has not been lexicalized yet, which means that it can be expressed in the target language only by means of definitions (periphrases) and other semantically related (and inaccurate) LRUs (such as hyponyms or hypernyms). This is done by informing that the Lexical Realisation Set is empty.
Types of LRU
In the UNLarium framework, there can be three different types of LRUs, depending on their internal structure:
Subwords (SBW) are LRUs that do not have independent existence in the language and only appear together with other morphemes to form a lexeme. They are common in synthetic (inflected) languages, such as the Indo-European ones, where some concepts (such as "contrary of" and "plural") are normally realised by bound morphemes ("un-" and "-s", for instance). Subword LRUs include affixes (such as "un-", "re-", "-ful", "-ness") and roots that do not occur alone (such as "rupt" in "interrupt", "disrupt", "corrupt", "rupture", etc).
Simple Words (WRD) are the ordinary (indivisible) lexemes in the semantic system of a language. They may consist of:
- one single free morpheme (such as "happy", "break");
- one single free morpheme and bound morphemes ("unhappy", "happiness", "happily", "unbreakable", "unbreakableness"); and
- compounds of bound morphemes (such as "interrupt", "disrupt", "corrupt").
Multiword Expressions (MTW) are lexical structures made up of a sequence of two or more lexemes. They can be concatenated ("darkroom", "skinhead") or isolated by hyphens ("blue-green", "Afro-American") or blank spaces ("round table", "part of speech"). Multiword expressions can be continuous ("get over") or discontinuous ("get <something> together"). They correspond to compounds ("fireman", "hardware"), phrases ("in spite of", "take into account"), idioms ("kick the bucket", "play cat and mouse"), fragments of sentences ("and so on", "whatever the case") or sentences ("Every evil is followed by some good", "No flies enter a mouth that is shut").
Classical compounds ("agriculture", "photograph") and their derivations ("agricultural", "photographically") are to be treated as simple words if they do not include more than one free morpheme. Phrasal verbs ("give in", "come across") are treated as multiword expressions.
How to express a LRU
In the UNLarium, LRUs are represented by a canonical form, to be obtained as follows:
- Subwords are to be expressed by their corresponding lemma without any special sign. Allomorphs (such as "es" for plural) will be generated by special rules and should not be represented:
- s (and not "-s")
- un (and not "un-")
- chrono (and not "chrono-")
- Simple words are to be expressed by the lemma of the corresponding lexeme
- happy
- unhappy
- unhappiness
- table
- foot
- clothes
- love
- be
- Multiword expressions are to be expressed as continuous sequences of required lexemes, which should be represented by their corresponding lemma, if variable, or by their word form, otherwise:
- get over
- take into account (and not "take something into account", or "take <something> into account", or "take ... into account", etc)
- throw to the lions (and not "throw to the lion")
Examples
Concept | Lexical Realisations | Lexical Realisation Unit (LRU) |
---|---|---|
large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male | lion, lions | lion |
a female lion | lioness, lionesses | lioness |
a large and densely populated urban area | city, cities | city |
the part of the leg of a human being below the ankle joint | foot, feet | foot |
the largest city in New York State and in the United States | New York, New York City, NY, NYC | New York, New York City, NY, NYC |
the corporate executive responsible for the operations of the firm | chief executive officer, chief executive officers, chief operating officer, chief operating officers, CEO, CEOs | chief executive officer, chief operating officer, CEO |
optical instrument consisting of a pair of lenses for correcting defective vision | spectacles, specs, eyeglasses, glasses | spectacles, specs, eyeglasses, glasses |
pale yellowish wine made from white grapes or red grapes with skins removed before fermentation | white wine, white wines | white wine |
a person whose occupation is teaching | professor (male singular), professores (male plural), professora (female singular), professoras (female plural) (Spanish) | professor |
solid-hoofed herbivorous quadruped domesticated since prehistoric times | cheval (male singular), chevaux (male plural), jument (female singular), juments (female plural) (French) | cheval, jument |
delighting the senses or exciting intellectual or emotional admiration | beautiful | beautiful |
delighting the senses or exciting intellectual or emotional admiration | beau (masculine singular), beaux (masculine plural), belle (feminine singular), belles (feminine plural) (French) | beau |
have the quality of being | to be, be, am, is, are, was, were, being, been | be |
have a great affection or liking for | aime, aimes, aimons, aimez, aiment, aimerais, ai aimé, aimais, ... (French) | aimer |
steer a vehicle to the side of the road | to pull over, pull over, pulls over, pulled over, ... | pull over |
allow or plan for a certain possibility | to take into account, take into account, takes into account, taking into account, ... | take into account |
on the day preceding today | yesterday | yesterday |
in a willing manner | gladly | gladly |