BRUNO

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(The repository)
(Repository)
 
(17 intermediate revisions by one user not shown)
Line 1: Line 1:
The project BRUNO ('''B'''asic '''R'''esources for '''UN'''Lizati'''O'''n) is devoted to the creation of NL-UNL dictionaries.
+
The project BRUNO ('''B'''asic '''R'''esources for '''UN'''Lizati'''O'''n) is devoted to the creation of NL-UNL (analysis) dictionaries.
  
 
== Goal ==
 
== Goal ==
 
The project BRUNO has two main goals:
 
The project BRUNO has two main goals:
#To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in [[UNLization|language-based shallow UNLization]], i.e., in generating UNL graphs out of natural language documents, especially through [[IAN]].
+
#To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in [[UNLization]], i.e., in generating UNL graphs out of natural language documents, especially through [[IAN]].
 
#To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the [[UNL Dictionary]].
 
#To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the [[UNL Dictionary]].
  
== The repository ==
+
== Repository ==
BRUNO is language dependent. Every language has its own set of entries to be addressed. The list of entries is provided in one of the following ways:
+
BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas.
*Dictionary-based. The list corresponds to the nominata (i.e., to the list of headwords) of prestigious monolingual dictionaries, organized according to the frequency of use<ref>The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.</ref>
+
*BRUNO-A1 contains the list of the 2,000 most frequent lemmas of the language (including articles, prepositions, conjunctions, auxiliary verbs, etc.);
*Corpus-based. The list is extracted from a corpus considered to be representative of the standard written language<ref>This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at [[NC]].</ref>.
+
*BRUNO-A2 contains the next 3,000 most frequent lemmas of the language;
In both cases, the list must be lemmatized. Only the lemmas are used as headwords in the dictionary.
+
*BRUNO-B1 contains the next 5,000 most frequent lemmas of the language;
 +
And so on.
  
== Structure ==
 
BRUNO is divided into 6 different subprojects according to the frequency of use of the lemmas.
 
 
{|border="1" align="center" cellpadding="2"
 
{|border="1" align="center" cellpadding="2"
 
!Repository
 
!Repository
!# of lemmas<ref>The lemmas must be ordered according to the frequency of use. In that sense, BRUNO-A1 deals with the most frequent lemmas from 1 to 2,000. BRUNO-A2 deals with the most frequent lemmas from 2,001 to 5,000. BRUNO-B1 deals with the most frequent lemmas from 5,001 to 10,000. And so on.</ref>
+
!# of lemmas  
 
|-
 
|-
 
|align="center"|BRUNO-A1
 
|align="center"|BRUNO-A1
Line 31: Line 30:
 
|-
 
|-
 
|align="center"|BRUNO-C1
 
|align="center"|BRUNO-C1
|align="center"|5,000
+
|align="center"|10,000
 
|-
 
|-
 
|align="center"|BRUNO-C2
 
|align="center"|BRUNO-C2
|align="center"|5,000
+
|align="center"|10,000
 
|}
 
|}
  
== Methodology ==
+
== Requisites ==
MIR is open to all languages, except to English, which was the source for all data. As in any NLization project, we expect users to link UW's (i.e., concepts) to natural language entries. This process must take into consideration the following:
+
The project BRUNO is open to all languages complying with the following requisites:
*The UW is a '''concept''' (expressed by a definition), and not an English word. English words are provided only as examples. They should not be translated.
+
*[[MIR-A1]] and [[NADIA-A1]] are required for BRUNO-A1;
*The UW must be associated only to [[LRU|'''lexical items''']], i.e., words recognized as such by the lexicography of a language. These words may simple, compound or complex (multiword expressions), <u>but they must figure as entries or sub-entries in monolingual dictionaries</u>. If a UW is not lexicalized in a given language, i.e., if it is not realized as a lexical unit, but as a sequence of words (a periphrases) that has not been lexicalized yet, the UW should be reported either as UNDERSPECIFIED (if it is too general) or OVERSPECIFIED (if it is too specific). Users should not forget that one of our goals is to assign a degree of universality to the senses coming from the WordNet3.0, which brings many concepts that are English-specific.
+
*[[MIR-A2]] and [[NADIA-A2]] are required for BRUNO-A2;
*The UW must be associated only to '''[[lemma]]s''', i.e., to the citation form of words (the form that is used in ordinary dictionaries, such as the infinitive, for verbs, the singular, for nouns, etc.). Inflections must not be informed as natural language entries.
+
*[[MIR-B1]] and [[NADIA-B1]] are required for BRUNO-B1;
*The UW must be associated to natural language entries of the '''same part-of-speech'''. A nominal concept must be necessarily associated to a noun or a noun phrase; a verbal concept, to a verb or a verbal phrase; an adjectival concept, to an adjective or to an adjective phrase; an adverbial concept, to an adverb or adverbial phrase<ref>Users should not forget that prepositional phrases may play adjective or adverbial roles, and can be considered adjective or adverbial phrases. However, prepositional phrases should be used as natural language entries only if "lexicalized", i.e., if they figure as entries or sub-entries in monolingual dictionaries.</ref>.
+
*[[MIR-B2]] and [[NADIA-B2]] are required for BRUNO-B2;
*The same UW can be associated to several natural language entries, provided that they are informed according to the '''frequency of use'''. Non-standard lexical units (archaisms, jargon, slang, taboo, etc.) should be avoided and, if included, should be marked (in the field "register"). Spelling variations, if standard, must be informed as different entries.
+
*[[MIR-C1]] and [[NADIA-C1]] are required for BRUNO-C1;
 +
*[[MIR-C2]] and [[NADIA-C2]] are required for BRUNO-C2;
 +
*In all cases, the language must contain a reasonable amount of [[inflectional paradigms]] and [[subcategorization frames]] already registered in the [[UNLarium]].
 +
 
 +
== Preparing the list of entries ==
 +
#List of entries
 +
#:Participants are expected to provide a list of the entries according to the following criteria:
 +
#:*The list of entries must include the most frequent lemmas of the language, including articles, prepositions, conjunctions, nouns, verbs, adjectives, adverbs, etc.
 +
#:*The list of entries can be extracted from prestigious monolingual dictionaries or from a corpus considered to be representative of the standard written language<ref>This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at [[NC]].</ref>.
 +
#:*The list of entries must be ordered according to the frequency of occurrence (the most frequent entries must come first)<ref>The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.</ref>.
 +
#:*The list of entries must be lemmatized<ref>There should be as many lemmas as different '''morphological behavior''' (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. Note that the many different meanings of "book" as a noun do not lead to different lemmas, because all of them have the same morphological behavior, i.e., are singular and make plural in -s. On the other hand, the noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (= "book"), and "livre" as a noun feminine (= "pound"). This difference is not derived from the different meanings, but from the different morphological behavior: one is masculine and the other is feminine.</ref>
 +
#:*Entries must be provided in a plain text file (.txt) with UTF-8 encoding, with one entry per line, along with the corresponding value of the lexical category [[LEX]], in the following format:
 +
#::lemma:LEX<ref>See an example at [http://www.unlweb.net/resources/bruno/hu_a1.txt]</ref>
 +
#Verification
 +
#:The list of entries is verified by a language manager or, in case there is no language manager for the target language, by the Language Resources Manager of the UNDL Foundation. If approved, it is uploaded to the UNLarium, and the corresponding BRUNO project is open.
 +
#Dictionary
 +
#:Entries become available, in the UNLarium, for all the registered users of a given language, in case of open projects, or for the approved candidates, in case of closed projects. Users are expected to provide all the morphological, syntactic and semantic information for each entry
 +
 
 +
== Instructions ==
 +
;Lexical Category
 +
:Whenever the lexical category for a given lemma is provided, check whether it is correct. If it is not correct, decline the entry and report the problem by clicking over the yellow triangle at the right of the main entry. If the lexical category is not provided, select the most likely category. Do not worry about homonyms: provide one single category for a given main entry.
 +
;Lemma
 +
:Do not change the lemma. If it is not correct (i.e., if it is misspelled or cannot be considered to be a lexical unit), decline the entry and report the problem by clicking over the yellow triangle at the right of the main entry.
 +
;Provide as many UW's as necessary to each lemma, but do not include very rare or unusual cases. And check the order: the most likely senses must appear first.
 +
;Base Form
 +
:You have to worry about the base form only in case of multiword expressions 1) whose inflections cannot be formed by simple affixation or 2) which are discontinuous. In these cases, provide the corresponding composition rules.
 +
;Inflection
 +
:Select AND TEST the inflectional paradigm that generates the inflections of the base form. Any errors here will be propagated to the dictionary, so be careful. And pay attention to the cases below:
 +
:*LOCALIZED IRREGULARITY: if the word is mostly regular and its irregularity is localized in some few and specific rules (more than one possible plural for nouns, or defective verbs that are not used in a given person, for instance, but follow the general rules for all the others), assign the word to the corresponding paradigm and list, in the box "inflectional rules", its irregularities;
 +
:*NON-EXISTING PARADIGM: if the word is regular or semi-regular (in the sense that there are several other words in the same case), and cannot be associated to any existing paradigm, press the button REQUEST A NEW PARADIGM and provide the corresponding details;
 +
:*IRREGULAR WORDS: if the word is irregular (i.e., it has a quite unusual and specific morphological behavior), choose the option IRREGULAR and provide the corresponding inflectional rules.
 +
;Subcategorization
 +
:Subcategorization is only required when the word REQUIRES a complement or a specifier (indirect transitive verbs that select an specific preposition, for instance). In this case, you have to inform the corresponding subcategorization frame. If the subcategorization frame is not available, press the button REQUEST A NEW SUBCATEGORIZATION FRAME and provide the corresponding details.
  
 
== Notes ==
 
== Notes ==
 
<references />
 
<references />

Latest revision as of 09:47, 28 May 2014

The project BRUNO (Basic Resources for UNLizatiOn) is devoted to the creation of NL-UNL (analysis) dictionaries.

Contents

Goal

The project BRUNO has two main goals:

  1. To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in UNLization, i.e., in generating UNL graphs out of natural language documents, especially through IAN.
  2. To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the UNL Dictionary.

Repository

BRUNO is language dependent. Every language has its own set of entries to be addressed. The repository is divided into 6 different subprojects according to the frequency of use of the lemmas.

  • BRUNO-A1 contains the list of the 2,000 most frequent lemmas of the language (including articles, prepositions, conjunctions, auxiliary verbs, etc.);
  • BRUNO-A2 contains the next 3,000 most frequent lemmas of the language;
  • BRUNO-B1 contains the next 5,000 most frequent lemmas of the language;

And so on.

Repository # of lemmas
BRUNO-A1 2,000
BRUNO-A2 3,000
BRUNO-B1 5,000
BRUNO-B2 5,000
BRUNO-C1 10,000
BRUNO-C2 10,000

Requisites

The project BRUNO is open to all languages complying with the following requisites:

Preparing the list of entries

  1. List of entries
    Participants are expected to provide a list of the entries according to the following criteria:
    • The list of entries must include the most frequent lemmas of the language, including articles, prepositions, conjunctions, nouns, verbs, adjectives, adverbs, etc.
    • The list of entries can be extracted from prestigious monolingual dictionaries or from a corpus considered to be representative of the standard written language[1].
    • The list of entries must be ordered according to the frequency of occurrence (the most frequent entries must come first)[2].
    • The list of entries must be lemmatized[3]
    • Entries must be provided in a plain text file (.txt) with UTF-8 encoding, with one entry per line, along with the corresponding value of the lexical category LEX, in the following format:
    lemma:LEX[4]
  2. Verification
    The list of entries is verified by a language manager or, in case there is no language manager for the target language, by the Language Resources Manager of the UNDL Foundation. If approved, it is uploaded to the UNLarium, and the corresponding BRUNO project is open.
  3. Dictionary
    Entries become available, in the UNLarium, for all the registered users of a given language, in case of open projects, or for the approved candidates, in case of closed projects. Users are expected to provide all the morphological, syntactic and semantic information for each entry

Instructions

Lexical Category
Whenever the lexical category for a given lemma is provided, check whether it is correct. If it is not correct, decline the entry and report the problem by clicking over the yellow triangle at the right of the main entry. If the lexical category is not provided, select the most likely category. Do not worry about homonyms: provide one single category for a given main entry.
Lemma
Do not change the lemma. If it is not correct (i.e., if it is misspelled or cannot be considered to be a lexical unit), decline the entry and report the problem by clicking over the yellow triangle at the right of the main entry.
Provide as many UW's as necessary to each lemma, but do not include very rare or unusual cases. And check the order
the most likely senses must appear first.
Base Form
You have to worry about the base form only in case of multiword expressions 1) whose inflections cannot be formed by simple affixation or 2) which are discontinuous. In these cases, provide the corresponding composition rules.
Inflection
Select AND TEST the inflectional paradigm that generates the inflections of the base form. Any errors here will be propagated to the dictionary, so be careful. And pay attention to the cases below:
  • LOCALIZED IRREGULARITY: if the word is mostly regular and its irregularity is localized in some few and specific rules (more than one possible plural for nouns, or defective verbs that are not used in a given person, for instance, but follow the general rules for all the others), assign the word to the corresponding paradigm and list, in the box "inflectional rules", its irregularities;
  • NON-EXISTING PARADIGM: if the word is regular or semi-regular (in the sense that there are several other words in the same case), and cannot be associated to any existing paradigm, press the button REQUEST A NEW PARADIGM and provide the corresponding details;
  • IRREGULAR WORDS: if the word is irregular (i.e., it has a quite unusual and specific morphological behavior), choose the option IRREGULAR and provide the corresponding inflectional rules.
Subcategorization
Subcategorization is only required when the word REQUIRES a complement or a specifier (indirect transitive verbs that select an specific preposition, for instance). In this case, you have to inform the corresponding subcategorization frame. If the subcategorization frame is not available, press the button REQUEST A NEW SUBCATEGORIZATION FRAME and provide the corresponding details.

Notes

  1. This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at NC.
  2. The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.
  3. There should be as many lemmas as different morphological behavior (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. Note that the many different meanings of "book" as a noun do not lead to different lemmas, because all of them have the same morphological behavior, i.e., are singular and make plural in -s. On the other hand, the noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (= "book"), and "livre" as a noun feminine (= "pound"). This difference is not derived from the different meanings, but from the different morphological behavior: one is masculine and the other is feminine.
  4. See an example at [1]
Software