BRUNO

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Created page with "The project BRUNO ('''B'''asic '''R'''esources for '''UN'''L-'''O'''riented processing) is devoted to the creation of NL-UNL based dictionaries.")
 
Line 1: Line 1:
The project BRUNO ('''B'''asic '''R'''esources for '''UN'''L-'''O'''riented processing) is devoted to the creation of NL-UNL based dictionaries.
+
The project BRUNO ('''B'''asic '''R'''esources for '''UN'''Lizati'''O'''n) is devoted to the creation of NL-UNL dictionaries.
 +
 
 +
== Goal ==
 +
The project BRUNO has two main goals:
 +
#To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in [[UNLization|language-based shallow UNLization]], i.e., in generating UNL graphs out of natural language documents, especially through [[IAN]].
 +
#To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the [[UNL Dictionary]].
 +
 
 +
== The repository ==
 +
BRUNO is language dependent. Every language has its own set of entries to be addressed. The list of entries is provided in one of the following ways:
 +
*Dictionary-based. The list corresponds to the nominata (i.e., to the list of headwords) of prestigious monolingual dictionaries.
 +
*Corpus-based. The list is extracted from a corpus considered to be representative of the standard written language<ref>This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at [[NC]].</ref>.
 +
In both cases, the list must be lemmatized. Only the lemmas are used as headwords in the dictionary. 
 +
 
 +
== Structure ==
 +
BRUNO is divided into 6 different subprojects according to the frequency of use of the lemmas.
 +
{|border="1" align="center" cellpadding="2"
 +
!Repository
 +
!# of lemmas<ref>The lemmas must be ordered according to the frequency of use. In that sense, BRUNO-A1 deals with the most frequent lemmas from 1 to 2,000. BRUNO-A2 deals with the most frequent lemmas from 2,001 to 5,000. BRUNO-B1 deals with the most frequent lemmas from 5,001 to 10,000. And so on.</ref>
 +
|-
 +
|align="center"|BRUNO-A1
 +
|align="center"|2,000
 +
|-
 +
|align="center"|BRUNO-A2
 +
|align="center"|3,000
 +
|-
 +
|align="center"|BRUNO-B1
 +
|align="center"|5,000
 +
|-
 +
|align="center"|BRUNO-B2
 +
|align="center"|5,000
 +
|-
 +
|align="center"|BRUNO-C1
 +
|align="center"|5,000
 +
|-
 +
|align="center"|BRUNO-C2
 +
|align="center"|5,000
 +
|}
 +
 
 +
== Methodology ==
 +
MIR is open to all languages, except to English, which was the source for all data. As in any NLization project, we expect users to link UW's (i.e., concepts) to natural language entries. This process must take into consideration the following:
 +
*The UW is a '''concept''' (expressed by a definition), and not an English word. English words are provided only as examples. They should not be translated.
 +
*The UW must be associated only to [[LRU|'''lexical items''']], i.e., words recognized as such by the lexicography of a language. These words may simple, compound or complex (multiword expressions), <u>but they must figure as entries or sub-entries in monolingual dictionaries</u>. If a UW is not lexicalized in a given language, i.e., if it is not realized as a lexical unit, but as a sequence of words (a periphrases) that has not been lexicalized yet, the UW should be reported either as UNDERSPECIFIED (if it is too general) or OVERSPECIFIED (if it is too specific). Users should not forget that one of our goals is to assign a degree of universality to the senses coming from the WordNet3.0, which brings many concepts that are English-specific.
 +
*The UW must be associated only to '''[[lemma]]s''', i.e., to the citation form of words (the form that is used in ordinary dictionaries, such as the infinitive, for verbs, the singular, for nouns, etc.). Inflections must not be informed as natural language entries.
 +
*The UW must be associated to natural language entries of the '''same part-of-speech'''. A nominal concept must be necessarily associated to a noun or a noun phrase; a verbal concept, to a verb or a verbal phrase; an adjectival concept, to an adjective or to an adjective phrase; an adverbial concept, to an adverb or adverbial phrase<ref>Users should not forget that prepositional phrases may play adjective or adverbial roles, and can be considered adjective or adverbial phrases. However, prepositional phrases should be used as natural language entries only if "lexicalized", i.e., if they figure as entries or sub-entries in monolingual dictionaries.</ref>.
 +
*The same UW can be associated to several natural language entries, provided that they are informed according to the '''frequency of use'''. Non-standard lexical units (archaisms, jargon, slang, taboo, etc.) should be avoided and, if included, should be marked (in the field "register"). Spelling variations, if standard, must be informed as different entries.
 +
 
 +
== Notes ==
 +
<references />

Revision as of 16:57, 18 September 2012

The project BRUNO (Basic Resources for UNLizatiOn) is devoted to the creation of NL-UNL dictionaries.

Contents

Goal

The project BRUNO has two main goals:

  1. To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in language-based shallow UNLization, i.e., in generating UNL graphs out of natural language documents, especially through IAN.
  2. To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the UNL Dictionary.

The repository

BRUNO is language dependent. Every language has its own set of entries to be addressed. The list of entries is provided in one of the following ways:

  • Dictionary-based. The list corresponds to the nominata (i.e., to the list of headwords) of prestigious monolingual dictionaries.
  • Corpus-based. The list is extracted from a corpus considered to be representative of the standard written language[1].

In both cases, the list must be lemmatized. Only the lemmas are used as headwords in the dictionary.

Structure

BRUNO is divided into 6 different subprojects according to the frequency of use of the lemmas.

Repository # of lemmas[2]
BRUNO-A1 2,000
BRUNO-A2 3,000
BRUNO-B1 5,000
BRUNO-B2 5,000
BRUNO-C1 5,000
BRUNO-C2 5,000

Methodology

MIR is open to all languages, except to English, which was the source for all data. As in any NLization project, we expect users to link UW's (i.e., concepts) to natural language entries. This process must take into consideration the following:

  • The UW is a concept (expressed by a definition), and not an English word. English words are provided only as examples. They should not be translated.
  • The UW must be associated only to lexical items, i.e., words recognized as such by the lexicography of a language. These words may simple, compound or complex (multiword expressions), but they must figure as entries or sub-entries in monolingual dictionaries. If a UW is not lexicalized in a given language, i.e., if it is not realized as a lexical unit, but as a sequence of words (a periphrases) that has not been lexicalized yet, the UW should be reported either as UNDERSPECIFIED (if it is too general) or OVERSPECIFIED (if it is too specific). Users should not forget that one of our goals is to assign a degree of universality to the senses coming from the WordNet3.0, which brings many concepts that are English-specific.
  • The UW must be associated only to lemmas, i.e., to the citation form of words (the form that is used in ordinary dictionaries, such as the infinitive, for verbs, the singular, for nouns, etc.). Inflections must not be informed as natural language entries.
  • The UW must be associated to natural language entries of the same part-of-speech. A nominal concept must be necessarily associated to a noun or a noun phrase; a verbal concept, to a verb or a verbal phrase; an adjectival concept, to an adjective or to an adjective phrase; an adverbial concept, to an adverb or adverbial phrase[3].
  • The same UW can be associated to several natural language entries, provided that they are informed according to the frequency of use. Non-standard lexical units (archaisms, jargon, slang, taboo, etc.) should be avoided and, if included, should be marked (in the field "register"). Spelling variations, if standard, must be informed as different entries.

Notes

  1. This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at NC.
  2. The lemmas must be ordered according to the frequency of use. In that sense, BRUNO-A1 deals with the most frequent lemmas from 1 to 2,000. BRUNO-A2 deals with the most frequent lemmas from 2,001 to 5,000. BRUNO-B1 deals with the most frequent lemmas from 5,001 to 10,000. And so on.
  3. Users should not forget that prepositional phrases may play adjective or adverbial roles, and can be considered adjective or adverbial phrases. However, prepositional phrases should be used as natural language entries only if "lexicalized", i.e., if they figure as entries or sub-entries in monolingual dictionaries.
Software