BRUNO

From UNL Wiki

Revision as of 18:17, 18 September 2012 by Martins (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

The project BRUNO (Basic Resources for UNLizatiOn) is devoted to the creation of NL-UNL dictionaries.

Goal

The project BRUNO has two main goals:

To provide several word-to-concept monolingual databases (i.e., encoding or reader's dictionaries). These dictionaries are expected to be used in language-based shallow UNLization, i.e., in generating UNL graphs out of natural language documents, especially through IAN.
To find concepts that are not enclosed in the WordNet3.0 and should be incorporated to the UNL Dictionary.

The repository

BRUNO is language dependent. Every language has its own set of entries to be addressed. The list of entries is provided in one of the following ways:

Dictionary-based. The list corresponds to the nominata (i.e., to the list of headwords) of prestigious monolingual dictionaries, organized according to the frequency of use^[1]
Corpus-based. The list is extracted from a corpus considered to be representative of the standard written language^[2].

In both cases, the list must be lemmatized. Only the lemmas are used as headwords in the dictionary.

Structure

BRUNO is divided into 6 different subprojects according to the frequency of use of the lemmas.

Repository	# of lemmas^[3]
BRUNO-A1	2,000
BRUNO-A2	3,000
BRUNO-B1	5,000
BRUNO-B2	5,000
BRUNO-C1	5,000
BRUNO-C2	5,000

Methodology

BRUNO is open to all languages, except to English, which was the source for all data. As in any UNLization project, we expect users to link natural language lemmas to UW's. This process must take into consideration the following:

The UW always represent an open-class category (noun, adjective, adverb or verb). Prepositions, conjunctions, articles, interjections, etc. are not mapped into UW's, but must be included (and treated) in the NL-UNL Dictionary. On the other hand, all nouns, adjectives, adverbs and verbs must be associated to an UW. If the UW does not exist yet, it should be proposed to be incorporated to the UNL Dictionary.
There should be as many lemmas as different morphological behavior (part-of-speech, gender, number, inflections, etc.). The word "book", in English, should correspond to two lemmas: "book" as a noun, and "book" as a verb. The noun "livre", in French, should correspond to two lemmas: "livre" as a noun masculine (="book"), and "livre" as a noun feminine (="pound"). The verb "haver", in Portuguese, should correspond to two lemmas: "haver" (auxiliary verb inflected in all verb forms) as a verb belonging to the inflectional paradigm X; and "haver" (main verb inflected only in the 3rd person, i.e., defective) as a verb belong to the inflectional paradigm Y.
The same lemma may be associated to more than one UW, i.e., lemmas should not be proliferated according to their semantic value (but only according to their morphological behavior). The noun "book", in English, should correspond to one single lemma, despite of its several possible meanings, which must be all associated to the same entry.

Notes

↑ The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.
↑ This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at NC.
↑ The lemmas must be ordered according to the frequency of use. In that sense, BRUNO-A1 deals with the most frequent lemmas from 1 to 2,000. BRUNO-A2 deals with the most frequent lemmas from 2,001 to 5,000. BRUNO-B1 deals with the most frequent lemmas from 5,001 to 10,000. And so on.

[0] The frequency of use is not often informed by ordinary dictionaries but may be inferred from the several distributions of the same dictionary: basic, intermediate or advanced, for instance.

[1] This corpus can be either an existing reputable corpus or a new corpus compiled according to the criteria defined at NC.

[2] The lemmas must be ordered according to the frequency of use. In that sense, BRUNO-A1 deals with the most frequent lemmas from 1 to 2,000. BRUNO-A2 deals with the most frequent lemmas from 2,001 to 5,000. BRUNO-B1 deals with the most frequent lemmas from 5,001 to 10,000. And so on.

[1]

[2]

[3]

BRUNO

Contents

Goal

The repository

Structure

Methodology

Notes

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export