BRUNO

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
The project BRUNO ('''B'''asic '''R'''esources for '''UN'''Lizati'''O'''n) is devoted to the creation of NL-UNL dictionaries.
+
The project BRUNO ('''B'''asic '''R'''esources for '''UN'''Lizati'''O'''n) is devoted to the creation of NL-UNL (analysis) dictionaries.
  
 
== Goal ==
 
== Goal ==
 
The project BRUNO has two main goals:
 
The project BRUNO has two main goals:
#To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in [[UNLization|language-based shallow UNLization]], i.e., in generating UNL graphs out of natural language documents, especially through [[IAN]].
+
#To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in [[UNLization]], i.e., in generating UNL graphs out of natural language documents, especially through [[IAN]].
 
#To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the [[UNL Dictionary]].
 
#To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the [[UNL Dictionary]].
  
== The repository ==
+
== Repository ==
BRUNO is language dependent. Every language has its own set of entries to be addressed. The list of entries is provided in one of the following ways:
+
BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas.  
*Dictionary-based. The list corresponds to the nominata (i.e., to the list of headwords) of prestigious monolingual dictionaries, organized according to the frequency of use<ref>The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.</ref>
+
*Corpus-based. The list is extracted from a corpus considered to be representative of the standard written language<ref>This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at [[NC]].</ref>.
+
In both cases, the list must be lemmatized. Only the lemmas are used as headwords in the dictionary.
+
 
+
== Structure ==
+
BRUNO is divided into 6 different subprojects according to the frequency of use of the lemmas.  
+
 
{|border="1" align="center" cellpadding="2"
 
{|border="1" align="center" cellpadding="2"
 
!Repository
 
!Repository
Line 36: Line 30:
 
|align="center"|5,000
 
|align="center"|5,000
 
|}
 
|}
 +
 +
== Requisites ==
 +
The project BRUNO is open to all languages complying with the following requisites:
 +
*MIR-A1 and NADIA-A1 are required for BRUNO-A1;
 +
*MIR-A2 and NADIA-A2 are required for BRUNO-A2;
 +
*MIR-B1 and NADIA-B1 are required for BRUNO-B1;
 +
*MIR-B2 and NADIA-B2 are required for BRUNO-B2;
 +
*MIR-C1 and NADIA-C1 are required for BRUNO-C1;
 +
*MIR-C2 and NADIA-C2 are required for BRUNO-C2;
 +
*In all cases, the language must contain a reasonable amount of [[inflectional paradigms]] and [[subcategorization frames]] already registered in the [[UNLarium]].
  
 
== Methodology ==
 
== Methodology ==
BRUNO is open to all languages, except to English, for which it is expected to be already finished. As in any UNLization project, we expect users to link natural language lemmas to UW's. This process must take into consideration the following:
+
#List of entries
*The UW always represent an '''open-class category''' (noun, adjective, adverb or verb). Prepositions, conjunctions, articles, interjections, etc. are not mapped into UW's, but must be included (and treated) in the NL-UNL Dictionary. On the other hand, all nouns, adjectives, adverbs and verbs must be associated to at least one UW. If the UW does not exist yet, it should be proposed to be incorporated to the UNL Dictionary.
+
#:Participants are expected to provide a list of the entries according to the following criteria:
*There should be as many lemmas as different '''morphological behavior''' (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. The noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (="book"), and "livre" as a noun feminine (="pound"). The verb "haver", in Portuguese, should correspond to two lemmas: "haver" (auxiliary verb inflected in all verb forms) and "haver" (main verb inflected only in the 3rd person, i.e., defective).
+
#:*The list of entries can be extracted from prestigious monolingual dictionaries or from a corpus considered to be representative of the standard written language<ref>This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at [[NC]].</ref>.
*The same lemma may be associated to '''more than one UW''', i.e., lemmas should not be proliferated according to their semantic value (but only according to their morphological behavior). The noun "book", in English, should correspond to one single lemma, despite of its several possible meanings, which must be all associated to the same entry.
+
#:*The list of entries must be ordered according to the frequency of occurrence (the most frequent entries must come first)<ref>The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.</ref>.
 +
#:*The list of entries must be lemmatized<ref>There should be as many lemmas as different '''morphological behavior''' (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. The noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (="book"), and "livre" as a noun feminine (="pound"). The verb "haver", in Portuguese, should correspond to two lemmas: "haver" (auxiliary verb inflected in all verb forms) and "haver" (main verb inflected only in the 3rd person, i.e., defective).</ref>
 +
#:*Entries must be provided in a plain text file (.txt) with UTF-8 encoding, with one entry per line, along with the corresponding value of the lexical category [[LEX]], as follows:
 +
#Verification
 +
#:The list of entries is verified by a language manager or, in case there is no language manager for the target language, by the Language Resources Manager of the UNDL Foundation. If approved, it is uploaded to the UNLarium, and the corresponding BRUNO project is open.
 +
#Dictionary
 +
#:Entries become available, in the UNLarium, for all the registered users of a given language, in case of open projects, or for the approved candidates, in case of closed projects. Users are expected to provide all the morphological, syntactic and semantic information for each entry
  
 
== Notes ==
 
== Notes ==
 
<references />
 
<references />

Revision as of 15:58, 24 September 2013

The project BRUNO (Basic Resources for UNLizatiOn) is devoted to the creation of NL-UNL (analysis) dictionaries.

Contents

Goal

The project BRUNO has two main goals:

  1. To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in UNLization, i.e., in generating UNL graphs out of natural language documents, especially through IAN.
  2. To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the UNL Dictionary.

Repository

BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas.

Repository # of lemmas[1]
BRUNO-A1 2,000
BRUNO-A2 3,000
BRUNO-B1 5,000
BRUNO-B2 5,000
BRUNO-C1 5,000
BRUNO-C2 5,000

Requisites

The project BRUNO is open to all languages complying with the following requisites:

  • MIR-A1 and NADIA-A1 are required for BRUNO-A1;
  • MIR-A2 and NADIA-A2 are required for BRUNO-A2;
  • MIR-B1 and NADIA-B1 are required for BRUNO-B1;
  • MIR-B2 and NADIA-B2 are required for BRUNO-B2;
  • MIR-C1 and NADIA-C1 are required for BRUNO-C1;
  • MIR-C2 and NADIA-C2 are required for BRUNO-C2;
  • In all cases, the language must contain a reasonable amount of inflectional paradigms and subcategorization frames already registered in the UNLarium.

Methodology

  1. List of entries
    Participants are expected to provide a list of the entries according to the following criteria:
    • The list of entries can be extracted from prestigious monolingual dictionaries or from a corpus considered to be representative of the standard written language[2].
    • The list of entries must be ordered according to the frequency of occurrence (the most frequent entries must come first)[3].
    • The list of entries must be lemmatized[4]
    • Entries must be provided in a plain text file (.txt) with UTF-8 encoding, with one entry per line, along with the corresponding value of the lexical category LEX, as follows:
  2. Verification
    The list of entries is verified by a language manager or, in case there is no language manager for the target language, by the Language Resources Manager of the UNDL Foundation. If approved, it is uploaded to the UNLarium, and the corresponding BRUNO project is open.
  3. Dictionary
    Entries become available, in the UNLarium, for all the registered users of a given language, in case of open projects, or for the approved candidates, in case of closed projects. Users are expected to provide all the morphological, syntactic and semantic information for each entry

Notes

  1. The lemmas must be ordered according to the frequency of use. In that sense, BRUNO-A1 deals with the most frequent lemmas from 1 to 2,000. BRUNO-A2 deals with the most frequent lemmas from 2,001 to 5,000. BRUNO-B1 deals with the most frequent lemmas from 5,001 to 10,000. And so on.
  2. This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at NC.
  3. The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.
  4. There should be as many lemmas as different morphological behavior (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. The noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (="book"), and "livre" as a noun feminine (="pound"). The verb "haver", in Portuguese, should correspond to two lemmas: "haver" (auxiliary verb inflected in all verb forms) and "haver" (main verb inflected only in the 3rd person, i.e., defective).
Software