Default grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Normalization)
(Normalization)
Line 29: Line 29:
 
=== Normalization ===
 
=== Normalization ===
 
The normalization section is divided into three modules:
 
The normalization section is divided into three modules:
*Standardization, where isolated features are rewritten in the attribute-value format.
+
*'''Standardization''', where isolated features are rewritten in the attribute-value format.
 
This is used when the feature list of entries are not represented in the dictionary in the attribute-value format, or as a cross-check for the feature assignment operations performed by the grammar itself. An example of standardization rules is:
 
This is used when the feature list of entries are not represented in the dictionary in the attribute-value format, or as a cross-check for the feature assignment operations performed by the grammar itself. An example of standardization rules is:
 
  (CAU,^ASP):=(-CAU,+ASP=CAU);
 
  (CAU,^ASP):=(-CAU,+ASP=CAU);
 
if a node has the feature "CAU" (= causative) but does not have the attribute "ASP" (aspect), then rewrite CAU as ASP=CAU
 
if a node has the feature "CAU" (= causative) but does not have the attribute "ASP" (aspect), then rewrite CAU as ASP=CAU
*Propagation, where the features of top categories are copied to their children.
+
*'''Propagation''', where the features of top categories are copied to their children.
 
This is used to avoid proliferating rules. For instance, every word having the feature SNGT (singulare tantum) is also SNG (singular). This information is not stated in the dictionary, and must be made explicit in the grammar, in order not to simply duplicate all rules dealing with SNG. This generalization movement is performed by rules such as:
 
This is used to avoid proliferating rules. For instance, every word having the feature SNGT (singulare tantum) is also SNG (singular). This information is not stated in the dictionary, and must be made explicit in the grammar, in order not to simply duplicate all rules dealing with SNG. This generalization movement is performed by rules such as:
 
  (SNGT,^SNG):=(-NUM,-SGNT,+NUM=SNG,+NUM=SNGT);
 
  (SNGT,^SNG):=(-NUM,-SGNT,+NUM=SNG,+NUM=SNGT);
 
if a node has the feature SNGT (singulare tantum) and does not have the feature SNG (singular), then copy the feature SNG to it
 
if a node has the feature SNGT (singulare tantum) and does not have the feature SNG (singular), then copy the feature SNG to it
*Other normalization rules, to deal with special cases such as temporary UW's, pronouns and numbers, such as:
+
*'''Other normalization rules''', to deal with special cases such as temporary UW's, pronouns and numbers, such as:
 
  (TEMP,^LEX):=(+LEX=N,+POS=PPN); treats all temporary words as proper nouns
 
  (TEMP,^LEX):=(+LEX=N,+POS=PPN); treats all temporary words as proper nouns
temporary UW's, which are absent from the dictionary, do not have any information other than the feature TEMP. In order to manipulate them inside the grammar, we assign the feature PPN (proper name) to it (i.e., all temporary words are interpreted as proper names)
+
temporary UW's, which are absent from the dictionary, do not have any information other than the feature TEMP. In order to manipulate them inside the grammar, we assign them the feature PPN (proper name) (i.e., all temporary words are interpreted as proper names)
  
 
=== Parsing ===
 
=== Parsing ===

Revision as of 20:26, 26 October 2012

The Default grammar is expected to be language-independent and is normally loaded, after the language-specific grammars, in order to handle phenomena that are not covered by them. The default grammar is used only in transformation (t-grammar) and is unidirectional: there is a default grammar for UNLization, and a different default grammar for NLization.

Contents

Files

NL>UNL Default Grammar (UNLization)

The NL>UNL Default Grammar is divided into 7 sections

  1. Pre-processing (prepares the input for the processing)
  2. Normalization (standardizes the feature structure)
  3. Parsing (converts the input list structure into a tree structure)
  4. Transformation (converts the surface tree struture into the deep tree structure)
  5. Dearborization (converts the tree structure into a network structure)
  6. Interpretation (converts the syntactic network into a semantic network)
  7. Post-processing (adjusts the final output)

Pre-processing

The pre-processing module aims at preparing the input for processing. It includes rule such as the following:

(TEMP,%x)(BLK,%y)(TEMP,%z):=(%x&%y&%z,-BLK); merges temporary nodes

if there are two nodes (TEMP) isolated by a blank space (BLK) they become one single node

("asdfgh")(" ")("asdfgh")>("asdfgh asdfgh")
(PPN,%x)(BLK,%y)(PPN,%z):=(%x&%y&%z,+TEMP,-BLK); merges sequences of proper names

if there are two proper names (PPN) isolated by a blank space (BLK) they become one single node

("John")(" ")("Smith") > ("John Smith")
(BLK):=; deletes the blank space

deletes all blank spaces

("a")(" ")("b") > ("a")("b")

Normalization

The normalization section is divided into three modules:

  • Standardization, where isolated features are rewritten in the attribute-value format.

This is used when the feature list of entries are not represented in the dictionary in the attribute-value format, or as a cross-check for the feature assignment operations performed by the grammar itself. An example of standardization rules is:

(CAU,^ASP):=(-CAU,+ASP=CAU);

if a node has the feature "CAU" (= causative) but does not have the attribute "ASP" (aspect), then rewrite CAU as ASP=CAU

  • Propagation, where the features of top categories are copied to their children.

This is used to avoid proliferating rules. For instance, every word having the feature SNGT (singulare tantum) is also SNG (singular). This information is not stated in the dictionary, and must be made explicit in the grammar, in order not to simply duplicate all rules dealing with SNG. This generalization movement is performed by rules such as:

(SNGT,^SNG):=(-NUM,-SGNT,+NUM=SNG,+NUM=SNGT);

if a node has the feature SNGT (singulare tantum) and does not have the feature SNG (singular), then copy the feature SNG to it

  • Other normalization rules, to deal with special cases such as temporary UW's, pronouns and numbers, such as:
(TEMP,^LEX):=(+LEX=N,+POS=PPN); treats all temporary words as proper nouns

temporary UW's, which are absent from the dictionary, do not have any information other than the feature TEMP. In order to manipulate them inside the grammar, we assign them the feature PPN (proper name) (i.e., all temporary words are interpreted as proper names)

Parsing

UNL>NL Default Grammar (NLization)

The NL>UNL Default Grammar is divided into 6 sections

  • Pre-processing (prepares the input for the processing)
  • Normalization (standardizes the feature structure)
  • Arborization (converts the syntactic network into a syntactic tree)
  • Transformation (converts the deep syntactic structure into the surface syntactic structure)
  • Linearization (converts the syntactic structure into a list structure)
  • Post-processing (adjusts the final output)

Modules

Pre-processing

Structure

The English grammars are unidirectional There is a grammar for UNLization (the ENG->UNL Analysis Grammar) and another grammar for NLization (the UNL->ENG Generation Grammar). The former takes natural languages sentences as inputs and provides the corresponding UNL graphs as outputs; the latter takes UNL graphs as inputs and provides the corresponding English sentences as outputs.

The English grammars are of two types: the transformation grammar, or simply t-grammar, which is used to manipulate data structures (i.e., to convert a list into a tree, a tree into network, a network into a tree, a tree into list); and the disambiguation grammar, or simply d-grammar, which is used to control the behavior of the t-grammar (by prohibiting or inducing some of its possibilities).

The English grammars are divided into two parts: the English Grammar itself, which contains rules that are specific to English, and the Default Grammar, which contains language-independent rules and may be used by any language. The English Grammar applies first (i.e., the rules of the English Grammar have higher priority); the Default Grammar applies when no rule from the English Grammar can be fired.

Files

Requisites

The grammars here presented depend heavily on the structure of the dictionary presented at English dictionary. You have to be acquainted with the formalism described at the UNL Dictionary Specs and the Tagset in order to fully understand how the grammar deal with the dictionary entry structure. You should also understand the process of tokenization done by the machine.

Features

The grammars play with a set of features that come from three different sources:

  • Dictionary features are the features ascribed to the entries in the dictionary, and appear either as simple attributes (LEX,GEN,NUM), as simple values (N,MCL,SNG) or attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).
  • System-defined features are features automatically assigned by EUGENE and IAN during the processing. They are the following:
    • SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
    • CHEAD = beginning of a scope (system-defined feature assigned automatically by the machine)
    • STAIL = end of the sentence (system-defined feature assigned automatically by the machine)
    • CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
    • TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
  • Grammar features are features created inside the grammar in any of its intermediate states between the input and the output.

All the features are described at the Tagset.

UNLization (ENG-UNL)

The UNLization process is performed in three different steps:

  1. Segmentation of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences are provided one per line, this step does not require any action from the grammar developer.
  2. Tokenization of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, some disambiguation rules are required to induce the correct lexical choice.
  3. Transformation applies after tokenization and is divided in five different steps:
    1. Normalization prepares the input for the transformation rules. In the normalization step, we delete blank spaces, replace some words by symbols (such as "point" by ".", when between numbers), process numbers and temporary words (such as proper nouns) and standardize the feature structure of the nodes (by informing, for instance, that words having the feature "SNGT" (singulare tantum) are also "SNG" (singular); that "N" is a value of the attribute "LEX"; etc).
    2. Parsing performs the syntactic analysis of the normalized input. The parsing follows some general procedures coming from the X-bar theory and results in a tree structure with binary branching with the following configuration:
          XP
         / \
      spec  XB
           / \
          XB  adjt
         / \
        X   comp
        |
      head
      
      Where X is the category of any of the heads (N,V,J,A,P,D,I,C), XB is any of the intermediate projections (there can be as many intermediate projections as complements (comp) and adjuncts (adjt) in a phrase) and XP is the maximal projection, always linking the topmost intermediate projection to the specifier (spec).
    3. Dearborization rewrites the tree structure as a graph structure, replacing intermediate (XB) and maximal projections (XP) by head-driven binary syntactic relations: XS(head,spec), XC(head,comp) and XA(head,adjt), where X is the category of any of the heads (e.g.,VC means complement to the verb).
    4. Interpretation replaces syntactic binary relations by the UNL semantic binary relations (e.g., VC(head,comp) may be rewritten as obj(head,comp)).
    5. Rectification adjusts the output graph to the UNL Standards.

Tokenization

The tokenization is done with the English Disambiguation Grammar.

Normalization

The normalization grammar is done with the Normalization Grammar.

Parsing

Dearborization

Rectification

UNL-EN (Generation) Grammar

UNL-EN (Generation) Transformation Grammar

UNL-EN (Generation) Disambiguation Grammar

Software