English grammar/Temporary entries

From UNL Wiki
Revision as of 20:47, 31 July 2012 by Martins (Talk | contribs)
Jump to: navigation, search

Temporary entries are entries that are not expected to be translated, such as URL's, e-mail addresses, phone numbers, etc. Because of that, they are not included in the dictionary, because the UNL is the same as the source. For instance: the UNL for "asfsdfdfdsf" is "asfsdfdfdsf". For the time being, no transliteration is required either.

Contents

Issues

Temporary entries are expressed between "quotes"
Note that temporary entries are represented, in UNL, between quotes, but these quotes are provided automatically by the system: all the words having the feature TEMP will be displayed, in the final output, between quotes. The feature TEMP is assigned automatically by the system to all the entries that were not found in the dictionary.

UNLization

asfsdfdfdsf

  • INPUT (ENG): asfsdfdfdsf
  • OUTPUT (UNL): "asfsdfdfdsf"
  • DICTIONARY: no entry is necessary
  • T-GRAMMAR: no rule is necessary
  • D-GRAMMAR: to control hyper-segmentation, as explained below.

The OUTPUT for "asfsdfdfdsf" should be simply "asfsdfdfdsf". No process is actually required. However, the English dictionary contains the entries:

  • [as]{}"" (LEX=C,POS=SCJ,POS=AVR,att=@as,rel=man)<eng,255,0>; and
  • [s]{}"" (LEX=I,POS=SFX,LST=SBW)<eng,255,0>;

If we do not control the tokenization process, "asfsdfdfdsf" will be tokenized as [as][f][s][dfdfd][s][f], which is not correct, since we expect the whole string to be treated as a single temporary entry. In order to avoid that, we have to inform the machine that words such as [as] may only happen after a blank space or the beginning of the sentence, and before a blank space. Similary, we have to tell the machine that the suffix [s] must be followed by a blank space. This kind of information is provided in the disambiguation grammar (d-grammar), and it is used to prevent the system from hyper-analyzing temporary entries. In the English d-grammar, this is done by the rules:

(TEMP)(^BLK,^PUT,^STAIL)=0;
(a temporary word must be followed by a blank, a punctuation mark or the end of the sentence)
(^BLK,^PUT,^SHEAD)(TEMP)=0;

Note that, in order for these rules to apply, the punctuation marks and the blank space must be included in the dictionary with the corresponding features (PUT and BLK, respectively). See the Default Dictionary, for an example. The SHEAD (beginning of the sentence) and STAIL (end of sentence) are system-defined features and are assigned automatically by the machine. You don't need to introduce them in the dictionary.

asfsdfdfdsf asfsdfdfdsf

  • INPUT (ENG): asfsdfdfdsf asfsdfdfdsf
  • OUTPUT (UNL): "asfsdfdfdsf asfsdfdfdsf"
  • DICTIONARY: no entry is necessary
  • T-GRAMMAR: (TEMP,%x)(BLK,%y)(TEMP,%z):=(%x&%y&%z,-BLK); (merges two temporary nodes with a space in-between)
  • D-GRAMMAR: the same as above, to prevent hyper-segmentation.

From the application of the d-rule above, the tokenization result will be [asfsdfdfdsf][ ][asfsdfdfdsf], which is not actually good, because we would like the whole string to be one single token [asfsdfdfdsf asfsdfdfdsf]. We need then a rule to merge the two temporary words, which is the following:

(TEMP,%x)(BLK,%y)(TEMP,%z):=(%x&%y&%z,-BLK);

The command & merges the nodes, as described in the UNL Grammar Specs.

www.undlfoundation.org

  • INPUT (ENG): www.undlfoundation.org
  • OUTPUT (UNL): "www.undlfoundation.org"
  • DICTIONARY: no entry is necessary
  • T-GRAMMAR: no rule is necessary
  • D-GRAMMAR: to control hyper-segmentation, as explained below.

If we introduce the rules above to the disambiguation grammar, we will solve the problems concerning "asfsdfdfdsf", "asfsdfdfdsf asfsdfdfdsf" and "asfsdfdfdsf asfsdfdfdsf asfsdfdfdsf", which will be correctly tokenized, but we will not solve the problems concerning "www.undlfoundation.org", because, in this case, there are punctuation marks between temporary words. The result of the tokenization will be: [www][.][undlfoundation][.][org], which is not good, because we would expect [www.undlfoundation.org]. We have then to tell the machine that there cannot be two temporary words separated by punctuation mark. Therefore:

(TEMP)(PUT)(TEMP)=0;
(there cannot be two temporary words separated by punctuation mark)

The problem is that this rule will apply only to "www.undlfoundation.org", which will be correctly tokenized as [www.undlfoundation.org], but will not work for "http://www.undlfoundation.org", because we have three punctuation marks (://) between two temporary words.
There are two solutions in this case:

  1. To extend the rule to:
    (TEMP)(PUT)(TEMP)=0;
    (there cannot be two temporary words separated by one punctuation mark)
    (TEMP)(PUT)(PUT)(TEMP)=0;
    (there cannot be two temporary words separated by two punctuation marks)
    (TEMP)(PUT)(PUT)(PUT)(TEMP)=0;
    (there cannot be two temporary words separated by three punctuation marks)
  2. Or to add regular expressions in the dictionary to deal with URL's and e-mail addresses, such as
    • [/(?i)(http\:\/\/[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)(https\:\/\/[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)(ftp\:\/\/[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)(www\.[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,URL)<eng,0,0>;
    • [/(?i)([^ ]+@[^ ]+)/]{}""(TEMP,LEX=N,POS=PPN,EMAIL)<eng,0,0>;

In the English grammar, we have adopted the second solution. These entries are available in the default dictionary, at the end of the English dictionary.

NLization

The NLization of temporary UWs do not require any action from the user. The UW "asfsdfdfdsf" will be automatically generated as "asfsdfdfdsf" if this UW is not in the dictionary.

Software