Multiword expression

From UNL Wiki
Jump to: navigation, search

Multiword Expressions (MTW) are lexical structures made up of a sequence of two or more lexemes. They can be concatenated ("darkroom", "skinhead") or isolated by hyphens ("blue-green", "African-American") or blank spaces ("round table", "part of speech"). Multiword expressions can be continuous ("get over") or discontinuous ("get <something> together"). They correspond to compounds ("fireman", "hardware"), phrases ("in spite of", "take into account"), idioms ("kick the bucket", "play cat and mouse"), fragments of sentences ("and so on", "whatever the case") or sentences ("Every evil is followed by some good", "No flies enter a mouth that is shut"). Multiword expressions may also include acronyms (such as "UNESCO"), multiple-word contractions (such as "don't") and blends (such as "sitcom") that are still analysable (differently from "radar" and "motel", which are represented as simple words). Classical compounds ("agriculture", "photograph") and their derivations ("agricultural", "photographically") are treated as simple words if they do not include more than one free morpheme. Phrasal verbs ("give in", "come across") are treated as multiword expressions.


Multiword expressions in the UNLarium

Multiword expressions formed by a single inflectional element

The lemma of a continuous multiword expression is the multiword expression itself ("part of speech");
The lemma of a discontinuous multiword expression must include the obligatory variables ("behind <person>'s back");
The lemma of a continuous/discontinuous multiword expression is the multiword expression itself ("take into account", "bring back").
Base form
The base form is the same as the lemma, except in case of multiword expressions that involve discontinuity or infixation, i.e., where variations cannot be generated by simple prefixation and/or suffixation rules. In these cases, the base form will correspond to the lemma of the longest common denominator between all the possible variations of the word. The base form must necessarily belong to the same category of the lemma.
For instance:
  • coffee house (continuous multiword expression without infixation: "coffee house">"coffee houses"): BF=lemma="coffee house"
  • give in (continuous multiword expression with infixation: "give in">"gave in"): BF="give" lemma="give in"
  • behind one's back (discontinuous multiword expression without infixation: "behind my back", "behind his back", etc.): BF="behind" lemma="behind <person>'s back"
  • take into account (discontinuous multiword expression with infixation: "take it into account", "took that into account"): BF="take" lemma="take into account"
Composition rules
Composition rules are rules that are applied over the base form to generate the lemma. They are used only when the lemma is different from the base form.
For instance:
  • coffee house: lemma = base form, composition rule = NULL
  • give in: lemma (= give in) base form (= give), composition rule = VH([in]); (i.e., lemma = base form + "in")
  • take into account: lemma base form, composition rule = VA("into account"); (i.e, lemma = base form + "into account")
  • behind one's back: lemma base form, composition rules = PA([back]); (i.e., lemma = base form + "back")
The composition rules are further described in composition
Inflectional paradigm
The inflectional paradigm and the inflectional rules apply over the base form (and not over the lemma).
For instance:
  • coffee house: base form = "coffee house", paradigm = M2 (regular nouns that make the plural in -s);
  • give in: base form = "give", paradigm = M1 (irregular), inflectional rules = (PAS:="gave";PTP:="given";);
  • take into account: base form = "take", paradigm = M1 (irregular), inflectional rules = (PAS:="took";PTP:="taken";);
  • behind <person>'s back: base form = "behind", paradigm = M0 (invariant)
Subcategorization frame
The subcategorization frame refers to the the lemma (and not to the base form).
For instance:
  • coffee house: base form = "coffee house", frame = Y0 (avalent);
  • give in: base form = "give", frame = Y38 (Somebody ----s something);
  • take into account: base form = "take", frame = Y38 (Somebody ----s something);
  • behind <person>'s back: base form = "behind", frame = Y1 (irregular), subcategorization rules: PC(NA([back];ANM,GNT));

Multiword expressions formed by several inflectional elements

Multiword expressions may consist of several different inflectional elements. In Latin, for instance, the expression "lingua franca" involves two words: "lingua", a noun, and "franca", an adjective, both inflectional, and the latter must agree with the former in number, gender and case, as follows:

case singular plural
nominative lingua franca linguae francae
vocative lingua franca linguae francae
accusative linguam francam linguas francas
genitive linguae francae linguarum francarum
dative linguae francae linguis francis
ablative lingua franca linguis francis

In the UNLarium framework, the following applies to "lingua franca":

  • LEMMA = "lingua franca"
  • BASE FORM = "lingua" (because "lingua franca" involves infixation; and "lingua" is the part of the lemma that belongs to the same category of "lingua franca", which is a noun)
    • Where:
    • NA indicates that "lingua" requires a nominal adjunct, which is formed according to the specifications below (enclosed between parentheses):
    • [francus] indicates that this adjunct is made out of the word that has the lemma "francus"
    • J indicates that [francus] is an adjective
    • M7 indicates that [francus] belongs to the paradigm M7 (adjectives of the type -us,-a,-um)
    • FEM indicates that [francus] must be generated in the feminine form
    • RNUM indicates that [francus] must have the same number of the head of the noun phrase (i.e., "lingua")
    • RCAS indicates that [francus] must have the same case of the head of the noun phrase (i.e., "lingua")
  • INFLECTIONAL PARADIGM = M2 ("lingua" follows the first declension of Latin nouns)
  • SUBCATEGORIZATION FRAME = Y0 (avalent: "lingua franca" does not require any specifier or complement)

Multiword expressions formed by coordination

Multiword expressions formed by coordination such as "oranges and apples", "cloak and dagger", "smoke and mirrors", etc, are normally invariant and, in these cases, do not offer any problem, because lemma = base form, and there are no composition rules. There are, however, expressions such as "twist and shout" that may be inflected: "twisted and shouted", "twisting and shouting", etc. In these cases, the whole syntactic structure has to be analyzed in order for the system to handle all possible combinations.

  • LEMMA = twist and shout
  • BASE FORM = twist
    • Where:
    • VA indicates that the base form "twist", which is a verb, requires an adjunct, which is formed as follows:
    • The adjunct is a complementizer phrase whose head is "and" and whose complement is [shout];
    • [shout] belongs to the paradigm M16 (regular verbs)
    • [shout] receives person (RPER), number (RNUM), tense (RTNS) and aspect (RASP) from the head of the verbal phrase
  • INFLECTIONAL PARADIGM = M16 ("twist" is a regular verb)
  • SUBCATEGORIZATION FRAME = Y32 (Somebody ----s)

Multiword Expressions in Natural Language Analysis

Continuous Multiword Expressions

Continuous Multiword Expressions do not offer any problem in natural language analysis as they normally correspond to one single dictionary entry, automatically generated according to the rules described above.
For instance, the Latin dictionary will bring several different entries for the same lemma "lingua franca":

  • [lingua franca]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=NOM,NUM=SNG)<lat,0,0>; (nominative singular)
  • [lingua franca]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=VOC,NUM=SNG)<lat,0,0>; (vocative singular)
  • [linguam francam]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=ACC,NUM=SNG)<lat,0,0>; (accusative singular)
  • [linguae francae]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=GNT,NUM=SNG)<lat,0,0>; (genitive singular)
  • [linguae francae]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=DAT,NUM=SNG)<lat,0,0>; (dative singular)
  • [lingua franca]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=ABL,NUM=SNG)<lat,0,0>; (ablative singular)
  • [linguae francae]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=NOM,NUM=PLR)<lat,0,0>; (nominative plural)
  • [linguae francae]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=VOC,NUM=PLR)<lat,0,0>; (vocative plural)
  • [linguas francas]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=ACC,NUM=PLR)<lat,0,0>; (accusative plural)
  • [linguarum francarum]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=GNT,NUM=PLR)<lat,0,0>; (genitive plural)
  • [linguis francis]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=DAT,NUM=PLR)<lat,0,0>; (dative plural)
  • [linguis francis]{}"lingua franca"(LEX=N,POS=NOU,LST=MTW,GEN=FEM,CAS=ABL,NUM=PLR)<lat,0,0>; (ablative plural)

Discontinuous Multiword Expressions

Discontinuous Multiword Expressions cannot be represented as one single dictionary entry because they may involve non-adjacent segments. In this case, we must use information concerning the subcategorization of the base form.
Consider, for instance, the case of "bring back" in English, which may be separable: "to bring [something] back".
If we represent [bring back] as one single dictionary entry, the system will never match sentences as:

  • (1) Bring him back, please!

The only solution is then to split discontinuous multiword expression in as many separable parts as it may have, i.e.,:

  • [bring]{}"bring back"(LEX=V,POS=VER,LST=MTW,GOV=VA([back],A),...)<en,0,0>; where GOV=VA([back],A) describes the subcategorization rule to form the entry (i.e., the verb requires the adjunct [back], which is an adverb)
  • [back]{}"back"(LEX=A,POS=ADV)<en,0,0>;

According to this dictionary structure, sentence (1) above will be tokenized as:

  • (1) [bring][ ][him][ ][back]

as if "bring" and "back" were completely different entries. However, these entries can be merged later on, during the transformation phase. This can be done through:

  • reordering words in the preprocessing phase:
  • (VA[back],%x)(^[back],%y):=(%y)(%x); move the verb to the right until it finds its adjunct
  • [bring][him][back] >> [him][bring][back]
  • parsing
  • (V,%x)(NP,%y):=(VB(%x;%y),+LEX=V,+XB=VB,+GOV=%x,%new); (create an intermediate projection XB out of a V + NP and assign to it the values of the verbal head)
  • [bring][him][back] >> [VB(bring;him)][back]

Once the discontinuity has been eliminated (i.e., all the separable parts of the multiword expression are contiguous), the entry can be merged:

  • by simply deleting the separable part (if the tokenization is already correct, i.e., the head of the multiword expression is mapped to the correct UW)
  • (VA[back],%x)([back],%y):=(%x,-VA[back]); (delete the separable part if they are contiguous)
  • by copying features of the separable part to the head of the multiword expression
  • (VA[back],%x)([back],%y):=(%x,-VA[back],+FEATURES FROM %y); (copy some features from the separable part to the head of the multiword expression and delete the separable part[1])
Tokenizing multiword expressions made of regular words

If the head of the multiword expression may stand as an isolated entry, it will important to disambiguate the lexical choice during the tokenization in order to avoid retokenization and backtracking. In the case of "bring back", for instance, "bring" may perfectly work as an independent entry, so that the English dictionary will actually contain:

  • [bring]{}"bring"(LEX=V,POS=VER,LST=WRD,...)<en,0,0>; "bring" as a regular word, i.e., not as a part of a multiword expression
  • [bring]{}"bring back"(LEX=V,POS=VER,LST=MTW,GOV=VA([back],A),...)<en,0,0>; "bring" as a part of a multiword expression
  • [back]{}"back"(LEX=A,POS=ADV)<en,0,0>;

In order to induce the system to choose the correct entry in the dictionary, a disambiguation rule must be created for the tokenization phase:

  • (VA[back])([back])=1; (i.e., prefer tokens with these features if they happen to occur[2].


  1. The features to be copied must be explicitly informed, i.e., GEN=%y, or NUM=%y, etc.
  2. Note that this rule will only apply if the nodes are contiguous. In order to handle non-contiguous nodes, you will have to inform them explicitly or to use regular expressions:
    • explicit: (VA[back])()([back])=1; there can be one single node in-between
    • regex: (VA[back])/(){0,3}/([back])=1; there can be up to three nodes in-between