English grammar

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Structure)
Line 1: Line 1:
The English grammars here presented are provided as a didactic sample that may help users to build their own grammars. They are used for representing the English sentences into UNL ([[UNLization]]) and for generating English sentences from UNL graphs ([[NLization]]). They follow the syntax defined at the [[UNL Grammar Specs]] and have been used for [[IAN]] and [[EUGENE]]. They follow, in general, the [[X-bar]] approach, with some adaptations.
+
The English grammars follow, in general, the [[X-bar]] approach, with some adaptations. They are used for transforming English sentences into UNL ([[UNLization]]) and for generating English sentences out of UNL graphs ([[NLization]]). They follow the syntax defined at the [[UNL Grammar Specs]] and the tags described at the [[Tagset]].  
  
 
== Structure ==
 
== Structure ==
The English grammars are '''unidirectional''' There is a grammar for UNLization (the ENG->UNL Analysis Grammar) and another grammar for NLization (the UNL->ENG Generation Grammar). The former takes natural languages sentences as inputs and provides the corresponding UNL graphs as outputs; the latter takes UNL graphs as inputs and provides the corresponding English sentences as outputs.
+
The English grammars are '''unidirectional'''. There is a grammar for UNLization (the ENG->UNL Analysis Grammar) and another grammar for NLization (the UNL->ENG Generation Grammar). The former takes natural languages sentences as inputs and provides the corresponding UNL graphs as outputs; the latter takes UNL graphs as inputs and provides the corresponding English sentences as outputs.
  
The English grammars are of two types: the '''transformation grammar''', or simply [[t-grammar]], which is used to manipulate data structures (i.e., to convert a list into a tree, a tree into network, a network into a tree, a tree into list); and the disambiguation grammar, or simply [[d-grammar]], which is used to control the behavior of the t-grammar (by prohibiting or inducing some of its possibilities).
+
The English grammars are of two types: the '''transformation grammar''', or simply [[t-grammar]], which is used to manipulate data structures (i.e., to convert lists into trees, trees into networks, networks into a trees, trees into lists); and the '''disambiguation grammar''', or simply [[d-grammar]], which is used to control the behavior of the t-grammar (by prohibiting or inducing some of its possibilities).
  
The English grammars are divided into two parts: the '''English Grammar''' itself, which contains rules that are specific to English, and the '''Default Grammar''', which contains language-independent rules and may be used by any language. The English Grammar applies first (i.e., the rules of the English Grammar have higher priority); the [[Default Grammar]] applies when no rule from the English Grammar can be fired.
+
The English grammars are divided into two parts: the '''English Grammar''' itself, which contains rules that are specific to English, and the [[Default Grammar]], which contains language-independent rules and may be used by any language. The English Grammar applies first (i.e., the rules of the English Grammar have higher priority); the Default Grammar applies when no rule from the English Grammar can be fired.
 
+
== Files ==
+
*ENG-UNL (Analysis) Grammar (IAN)
+
**[http://www.unlweb.net/resources/grammar/eng_unl_tgrammar.txt ENG->UNL T-Grammar]
+
**[http://www.unlweb.net/resources/grammar/nl_unl_tgrammar.txt NL->UNL Default T-Grammar]
+
**[http://www.unlweb.net/resources/grammar/eng_unl_dgrammar.txt ENG->UNL D-Grammar]
+
**[http://www.unlweb.net/resources/grammar/nl-unl_dgrammar.txt NL->UNL Default D-Grammar]
+
*UNL-ENG (Generation) Grammar (EUGENE)
+
**[http://www.unlweb.net/resources/grammar/unl_eng_tgrammar.txt UNL->ENG T-Grammar]
+
**[http://www.unlweb.net/resources/grammar/unl_nl_tgrammar.txt UNL->NL T-Grammar]
+
**[http://www.unlweb.net/resources/grammar/unl_eng_dgrammar.txt UNL->ENG D-Grammar]
+
**[http://www.unlweb.net/resources/grammar/unl_nl_dgrammar.txt UNL->NL D-Grammar]
+
 
+
== Requisites ==
+
The grammars here presented depend heavily on the structure of the dictionary presented at [[English dictionary]]. You have to be acquainted with the formalism described at the [[UNL Dictionary Specs]] and the [[Tagset]] in order to fully understand how the grammar deal with the dictionary entry structure. You should also understand the process of [[tokenization]] done by the machine.
+
  
 
== Features ==
 
== Features ==
 
The grammars play with a set of features that come from three different sources:
 
The grammars play with a set of features that come from three different sources:
*'''Dictionary features''' are the features ascribed to the entries in the dictionary, and appear either as simple attributes (LEX,GEN,NUM), as simple values (N,MCL,SNG) or attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).  
+
*'''Dictionary features''' are the features ascribed to the entries in the dictionary, and appear as attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).  
 
*'''System-defined features''' are features automatically assigned by EUGENE and IAN during the processing. They are the following:
 
*'''System-defined features''' are features automatically assigned by EUGENE and IAN during the processing. They are the following:
 
**SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
 
**SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
Line 32: Line 17:
 
**CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
 
**CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
 
**TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
 
**TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
 +
**SCOPE = scopes entry (system-defined feature assigned to hyper-nodes)
 +
**DIGIT = digits (system-defined feature assigned to digits)
 
*'''Grammar features''' are features created inside the grammar in any of its intermediate states between the input and the output.
 
*'''Grammar features''' are features created inside the grammar in any of its intermediate states between the input and the output.
All the features are described at the [[Tagset]].
+
The dictionary and system-defined features are described at the [[Tagset]].
  
== UNLization (ENG-UNL) ==
+
== UNLization (ENG->UNL) ==
 
The UNLization process is performed in three different steps:
 
The UNLization process is performed in three different steps:
 
<ol>
 
<ol>
 
<li>[[Segmentation]] of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences are provided one per line, this step does not require any action from the grammar developer.</li>
 
<li>[[Segmentation]] of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences are provided one per line, this step does not require any action from the grammar developer.</li>
<li>[[Tokenization]] of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, some disambiguation rules are required to induce the correct lexical choice. </li>
+
<li>[[Tokenization]] of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, some disambiguation rules are required to induce the correct lexical choice. The tokenization is done with the [[English Disambiguation Grammar]].</li>
<li>[[Transformation]] applies after tokenization and is divided in five different steps:</li>
+
<li>[[Transformation]] applies after tokenization and is divided in two different steps:</li>
 
<ol>
 
<ol>
<li>'''Normalization''' prepares the input for the transformation rules. In the normalization step, we delete blank spaces, replace some words by symbols (such as "point" by ".", when between numbers), process numbers and temporary words (such as proper nouns) and standardize the feature structure of the nodes (by informing, for instance, that words having the feature "SNGT" (singulare tantum) are also "SNG" (singular); that "N" is a value of the attribute "LEX"; etc).</li>
+
<li>'''Morphology''', where English features (such as PLR, PAS and [not]) are mapped into attributes (@pl, @past and @not, respectively).</li>
<li>'''Parsing''' performs the syntactic analysis of the normalized input. The parsing follows some general procedures coming from the [[X-bar theory]] and results in a tree structure with binary branching with the following configuration:
+
<li>'''Syntax''', where structures that are specific to English (such as determiners, compounds and coordination) are mapped into UNL.</li>  
<pre>
+
    XP
+
  / \
+
spec  XB
+
    / \
+
    XB  adjt
+
  / \
+
  X  comp
+
  |
+
head
+
</pre>
+
:Where X is the category of any of the heads (N,V,J,A,P,D,I,C), XB is any of the intermediate projections (there can be as many intermediate projections as complements (comp) and adjuncts (adjt) in a phrase) and XP is the maximal projection, always linking the topmost intermediate projection to the specifier (spec).</li>
+
<li>'''Dearborization''' rewrites the tree structure as a graph structure, replacing intermediate (XB) and maximal projections (XP) by head-driven binary syntactic relations: XS(head,spec), XC(head,comp) and XA(head,adjt), where X is the category of any of the heads (e.g.,VC means complement to the verb). </li>
+
<li>'''Interpretation''' replaces syntactic binary relations by the UNL semantic binary relations (e.g., VC(head,comp) may be rewritten as obj(head,comp)).</li>
+
<li>'''Rectification''' adjusts the output graph to the UNL Standards.</li>
+
 
</ol>
 
</ol>
 +
The transformation is done with the [[English Transformation Grammar]].
 
</ol>
 
</ol>
=== Tokenization ===
+
=
The tokenization is done with the [[English Disambiguation Grammar]].
+
 
+
=== Normalization ===
+
The normalization grammar is done with the [[Normalization Grammar]].
+
 
+
=== Parsing ===
+
 
+
=== Dearborization ===
+
 
+
=== Rectification ===
+
 
+
== UNL-EN (Generation) Grammar ==
+
 
+
 
+
=== UNL-EN (Generation) Transformation Grammar ===
+
 
+
 
+
=== UNL-EN (Generation) Disambiguation Grammar ===
+
 
+
== Examples ==
+
*[[English_grammar/Temporary_entries|Temporary entries]]
+
*[[English_grammar/Numbers|Numbers]]
+
*[[English_grammar/Determiners|Determiners]]
+
*[[English_grammar/Prepositions|Prepositions]]
+
*[[English_grammar/Conjunctions|Conjunctions]]
+
*[[English_grammar/NP|Noun phrase structure]]
+
*[[English_grammar/Time|Expressions of time]]
+
*[[English_grammar/Verbs|Verbs]]
+
*[[English_grammar/Pronouns|Pronouns]]
+
*[[English_grammar/Sentences|Sentence structures]]
+

Revision as of 22:47, 29 October 2012

The English grammars follow, in general, the X-bar approach, with some adaptations. They are used for transforming English sentences into UNL (UNLization) and for generating English sentences out of UNL graphs (NLization). They follow the syntax defined at the UNL Grammar Specs and the tags described at the Tagset.

Structure

The English grammars are unidirectional. There is a grammar for UNLization (the ENG->UNL Analysis Grammar) and another grammar for NLization (the UNL->ENG Generation Grammar). The former takes natural languages sentences as inputs and provides the corresponding UNL graphs as outputs; the latter takes UNL graphs as inputs and provides the corresponding English sentences as outputs.

The English grammars are of two types: the transformation grammar, or simply t-grammar, which is used to manipulate data structures (i.e., to convert lists into trees, trees into networks, networks into a trees, trees into lists); and the disambiguation grammar, or simply d-grammar, which is used to control the behavior of the t-grammar (by prohibiting or inducing some of its possibilities).

The English grammars are divided into two parts: the English Grammar itself, which contains rules that are specific to English, and the Default Grammar, which contains language-independent rules and may be used by any language. The English Grammar applies first (i.e., the rules of the English Grammar have higher priority); the Default Grammar applies when no rule from the English Grammar can be fired.

Features

The grammars play with a set of features that come from three different sources:

  • Dictionary features are the features ascribed to the entries in the dictionary, and appear as attribute-value pairs (LEX=N,GEN=MCL,NUM=SNG).
  • System-defined features are features automatically assigned by EUGENE and IAN during the processing. They are the following:
    • SHEAD = beggining of the sentence (system-defined feature assigned automatically by the machine)
    • CHEAD = beginning of a scope (system-defined feature assigned automatically by the machine)
    • STAIL = end of the sentence (system-defined feature assigned automatically by the machine)
    • CTAIL = end of a scope (system-defined feature assigned automatically by the machine)
    • TEMP = temporary entry (system-defined feature assigned to the strings that are not present in the dictionary)
    • SCOPE = scopes entry (system-defined feature assigned to hyper-nodes)
    • DIGIT = digits (system-defined feature assigned to digits)
  • Grammar features are features created inside the grammar in any of its intermediate states between the input and the output.

The dictionary and system-defined features are described at the Tagset.

UNLization (ENG->UNL)

The UNLization process is performed in three different steps:

  1. Segmentation of English sentences is done automatically by the machine. It uses some punctuation signs (such as ".","?","!") and special characters (end of line, end of paragraph) as sentence boundaries. As the sentences are provided one per line, this step does not require any action from the grammar developer.
  2. Tokenization of each sentence is done against the dictionary entries, from left to right, following the principle of the longest first. As there are several lexical ambiguities, some disambiguation rules are required to induce the correct lexical choice. The tokenization is done with the English Disambiguation Grammar.
  3. Transformation applies after tokenization and is divided in two different steps:
    1. Morphology, where English features (such as PLR, PAS and [not]) are mapped into attributes (@pl, @past and @not, respectively).
    2. Syntax, where structures that are specific to English (such as determiners, compounds and coordination) are mapped into UNL.

    The transformation is done with the English Transformation Grammar.

=

Software