English grammar/Temporary entries

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(asfsdfdfdsf)
(www.undlfoundation.org)
Line 29: Line 29:
 
*T-GRAMMAR: no rule is necessary
 
*T-GRAMMAR: no rule is necessary
 
*D-GRAMMAR: to control hyper-segmentation, as explained below.
 
*D-GRAMMAR: to control hyper-segmentation, as explained below.
If we introduce the rules above to the disambiguation grammar, we will solve the problems concerning
+
If we introduce the rules above to the disambiguation grammar, we will solve the problems concerning "asfsdfdfdsf", "asfsdfdfdsf asfsdfdfdsf" and "asfsdfdfdsf asfsdfdfdsf asfsdfdfdsf", which will be correctly tokenized, but we will not solve the problems concerning "www.undlfoundation.org", because, in this case, there are punctuation marks between temporary words. The result of the tokenization will be: [www][.][undlfoundation][.][org], which is not good, because we would expect [www.undlfoundation.org]. We haven then to tell the machine that there cannot be two temporary words separated by punctuation mark. Therefore:
*asfsdfdfdsf
+
*asfsdfdfdsf asfsdfdfdsf
+
*asfsdfdfdsf asfsdfdfdsf asfsdfdfdsf
+
which will be correctly tokenized, but we will not solve the problems concerning  
+
*www.undlfoundation.org
+
Because, in this case, there are punctuation marks between temporary words. The result of the tokenization will be:
+
*[www][.][undlfoundation][.][org]
+
Which is not good, because we would expect [www.undlfoundation.org]. We haven then to tell the machine that there cannot be two temporary words separated by punctuation mark. Therefore:
+
 
;(TEMP)(PUT)(TEMP)=0;
 
;(TEMP)(PUT)(TEMP)=0;
 
:(there cannot be two temporary words separated by punctuation mark)
 
:(there cannot be two temporary words separated by punctuation mark)
The problem is that this rule will apply only to "www.undlfoundation.org", which will be correctly tokenized as "www.undlfoundation.org", but will not work for  
+
The problem is that this rule will apply only to "www.undlfoundation.org", which will be correctly tokenized as [www.undlfoundation.org], but will not work for "http://www.undlfoundation.org", because we have three punctuation marks (://) between two temporary words. <br />
*http://www.undlfoundation.org
+
because we have three punctuation marks (://) between two temporary words. <br />
+
 
There are two solutions in this case:
 
There are two solutions in this case:
 
#To extend the rule to:
 
#To extend the rule to:

Revision as of 07:09, 31 July 2012

Temporary entries are entries that are not expected to be translated, such as URL's, e-mail addresses, phone numbers, etc. Because of that, they are not included in the dictionary, because the UNL is the same as the source. For instance: the UNL for "asfsdfdfdsf" is "asfsdfdfdsf". For the time being, no transliteration is required either.

Contents

Issues

Temporary entries are expressed between "quotes"
Note that temporary entries are represented, in UNL, between quotes, but these quotes are provided automatically by the system: all the words having the feature TEMP will be displayed, in the final output, between quotes. The feature TEMP is assigned automatically by the system to all the entries that were not found in the dictionary.

Examples

UNLization

asfsdfdfdsf

  • INPUT (ENG): asfsdfdfdsf
  • OUTPUT (UNL): "asfsdfdfdsf"
  • DICTIONARY: no entry is necessary
  • T-GRAMMAR: no rule is necessary
  • D-GRAMMAR: to control hyper-segmentation, as explained below.

The OUTPUT for "asfsdfdfdsf" should be simply "asfsdfdfdsf". No process is actually required. However, the English dictionary contains the entries:

  • [as]{}"" (LEX=C,POS=SCJ,POS=AVR,att=@as,rel=man)<eng,255,0>; and
  • [s]{}"" (LEX=I,POS=SFX,LST=SBW)<eng,255,0>;

If we do not control the tokenization process, "asfsdfdfdsf" will be tokenized as [as][f][s][dfdfd][s][f], which is not correct, since we expect the whole string to be treated as a single temporary entry. In order to avoid that, we have to inform the machine that words such as [as] may only happen after a blank space or the beginning of the sentence, and before a blank space. Similary, we have to tell the machine that the suffix [s] must be followed by a blank space. This kind of information is provided in the disambiguation grammar (d-grammar), and it is used to prevent the system from hyper-analyzing temporary entries. In the English d-grammar, this is done by the rules:

(TEMP)(^BLK,^PUT,^STAIL)=0;
(a temporary word must be followed by a blank, a punctuation mark or the end of the sentence)
(^BLK,^PUT,^SHEAD)(TEMP)=0;
(a temporary word must be preceded by a blank, a punctuation mark or the beginning of the sentence)

www.undlfoundation.org

  • INPUT (ENG): www.undlfoundation.org
  • OUTPUT (UNL): "www.undlfoundation.org"
  • DICTIONARY: no entry is necessary
  • T-GRAMMAR: no rule is necessary
  • D-GRAMMAR: to control hyper-segmentation, as explained below.

If we introduce the rules above to the disambiguation grammar, we will solve the problems concerning "asfsdfdfdsf", "asfsdfdfdsf asfsdfdfdsf" and "asfsdfdfdsf asfsdfdfdsf asfsdfdfdsf", which will be correctly tokenized, but we will not solve the problems concerning "www.undlfoundation.org", because, in this case, there are punctuation marks between temporary words. The result of the tokenization will be: [www][.][undlfoundation][.][org], which is not good, because we would expect [www.undlfoundation.org]. We haven then to tell the machine that there cannot be two temporary words separated by punctuation mark. Therefore:

(TEMP)(PUT)(TEMP)=0;
(there cannot be two temporary words separated by punctuation mark)

The problem is that this rule will apply only to "www.undlfoundation.org", which will be correctly tokenized as [www.undlfoundation.org], but will not work for "http://www.undlfoundation.org", because we have three punctuation marks (://) between two temporary words.
There are two solutions in this case:

  1. To extend the rule to:
    (TEMP)(PUT)(TEMP)=0;
    (there cannot be two temporary words separated by one punctuation mark)
    (TEMP)(PUT)(PUT)(TEMP)=0;
    (there cannot be two temporary words separated by two punctuation marks)
    (TEMP)(PUT)(PUT)(PUT)(TEMP)=0;
    (there cannot be two temporary words separated by three punctuation marks)
  2. Or to add regular expressions in the dictionary to deal with URL's and e-mail addresses, such as
    • [/(?i)(http\:\/\/[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)(https\:\/\/[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)(ftp\:\/\/[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)(www\.[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)([^ ]+@[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,EMAIL)<eng,0,0>;

In the English grammar, we have adopted the second solution. These entries are available in the default dictionary, at the end of the English dictionary.

UNLization

The NLization of temporary UWs do not require any action from the user. The UW "asfsdfdfdsf" will be automatically generated as "asfsdfdfdsf" if this UW is not in the dictionary.

Software