Localization

From UNL Wiki
Revision as of 16:34, 29 July 2012 by Martins (Talk | contribs)
Jump to: navigation, search

Localization is the process of adapting dictionaries and grammars to a specific language, which is referred to as "locale". In the UNL framework, localization of existing resources is one of the strategies that can be adopted in order to create basic dictionaries and grammars required to process simple corpora, such as Corpus500. In what follows, we give some instructions concerning the localization of the English dictionary and the English grammar.

Localization of the English Dictionary

In order to localize the English Dictionary, you have to consider the following:

  • The localization affects only the fields [natural language entry] and (the feature list). Do not localize the field "UW": it is not English (although written in English); it is UNL.
    For instance: given the entry
    [book]{}"book" (LEX=N,POS=NOU,NUM=SNG)<eng,0,0>;
    The localization of this entry to Spanish, French, Portuguese, German and Russian should be as follows:
    [livre]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=MCL)<fra,0,0>;
    [libro]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=MCL)<esp,0,0>;
    [livro]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=MCL)<por,0,0>;
    [Buch]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=NEU)<deu,0,0>;
    [книга]{}"book" (LEX=N,POS=NOU,NUM=SNG,GEN=FEM,CAS=NOM)<rus,0,0>;
  • The feature list depends on the locale
    In English, for instance, there is no gender information, because English has no grammatical gender. This is not the case of French, where gender must be provided. In Russian, in addition to gender, we have to add case as well, because Russian has morphological case. You have to add, to the feature list, the inflectional categories available for your language. But note that you may only use the tags available at the Tagset.
  • Values of attributes depend on the locale
    Note also that the values of the attribute may vary from language to language. In Spanish, the natural language word corresponding to "book" is masculine (GEN=MCL); in German, it is neutral (GEN=NEU); in Russian, it is feminine (GEN=FEM).
  • The localization should be corpus-driven: your dictionary should contain all and only the words appearing in your translated version of the corpus.
    The word "book" may have several different meanings, and may be translated by different words in your locale. You don't have to consider all these possibilities. You must address only the use that the word actually had in the corpus. Additionally, you don't have to treat forms that did not appear in the corpus. In the English Dictionary, for instance, there are both "book" and "books", because these two forms appear in the corpus; but note that there is only "girl" in singular, because "girls" in plural does not appear in the corpus. In short: your dictionary should reflect your corpus. You don't have to include words or word forms that are not there. You are not creating a generic dictionary, but a rather small dictionary, because the main goal here is not the dictionary, but the grammars.
Software