CORNELIA

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Created page with " *Open-Class Word List (3,000 word forms) *Corpus NC-A1 **Original corpus: 5-10 original articles from the Wikipedia about culture-specific subjects (minimum of 5,000...")
 
(Instructions)
 
(5 intermediate revisions by one user not shown)
Line 1: Line 1:
 +
The project CORNELIA ('''COR'''pus for '''N'''atural languag'''E''' to UN'''L''' '''I'''nductive '''A'''nalysis) is devoted to the creation of resources (LSS, SSS, UNL) for inducing grammars for UNLization. 
  
 +
== Goal ==
 +
The project CORNELIA has three main goals:
 +
#To provide Linear Sentence Structures ([[LSS]]) to be used in natural language disambiguation;
 +
#To provide Syntactic Sentence Structures ([[SSS]]) to be used in natural language transformation;
 +
#To provide NL-UNL memories, to be used in [[UNLization]].
  
 +
== Repository ==
 +
CORNELIA is language dependent. Every language has its own set of entries to be addressed. The list of entries is extracted from the Wikipedia and is organized as follows:
  
 
+
{|border="1" align="center" cellpadding="2"
 
+
!Subproject
 
+
!# of entries
 
+
!Description
 
+
 
+
 
+
*Open-Class Word List (3,000 word forms)
+
*Corpus NC-A1
+
**Original corpus: 5-10 original articles from the Wikipedia about culture-specific subjects (minimum of 5,000 words), in separate files, in plain text format with UTF-8 encoding
+
**List of at least 1,000 noun phrases appearing in the corpus with the following characteristics:
+
***the length of the NP must be equal or greater than 2 words (one-word NP's must be excluded): <strike>Geneva</strike>
+
***NP's must not contain foreign words: <strike>the city of Genève</strike> (note that "the city of Geneva" is OK)
+
***NP's must be continuous (there cannot be any extra-content, e.g., parentheses, inside the NP): <strike>the second most populous city in Switzerland (after Zurich)</strike> (note that the NP will be "the second most populous city in Switzerland")
+
***NP's must not contain verbs, even when used as nouns, adjectives or adverbs: <strike>French-'''speaking''' part of Switzerland</strike>, <strike>numerous international organizations, '''including''' the headquarters of many of the agencies of the United Nations and the Red Cross</strike> (in the latter case, there will be 2 NP's: "numerous international organizations" and "the headquarters... Red Cross")
+
***NP's must be original (no change should be made to the original text from the Wikipedia)
+
***NP's must ignore nesting (only the longest NP must be considered): "the headquarters of many of the agencies of the United Nations and the Red Cross" must be treated as a single NP (the inner NP's, such as "the agencies of the United Nations and the Red Cross" must not be extracted from the longer NP)
+
***NP's must be unique (repetitions must be ignored)
+
***NP's must be provided one per line in a plain text file, with UTF-8 encoding.
+
The completion of the post-workshop tasks is not mandatory but any intermediate-level workshop will only accept candidates having finished all A1 activities described in [[FoR-UNL]].
+
 
+
== FOLLOW-UP ==
+
The following projects will be open upon the accomplishment of the post-workshop tasks
+
*BRUNO-A1 (open only for languages where number of subcategorization frames (all languages) > 15 and number of paradigms (inflectional languages) > 15): 2,000 entries (around 4,000 UNLdots)
+
*NC-A1: 1,000 entries (3,000 UNLdots)
+
 
+
== ADDITIONAL MATERIAL ==
+
=== Open Class Word List ===
+
Extracted from the most frequent words in Wikipedia
+
 
+
{|table border=1 cellpadding=5
+
!Language
+
!File
+
 
|-
 
|-
|Arabic
+
|align="center"|CORNELIA-A1
|[http://www.unlweb.net/school/geneva2013/ar_words.xls ar_words.xls]
+
|align="center"|1,000
 +
|align="center"|simple noun phrases
 
|-
 
|-
|Armenian
+
|align="center"|CORNELIA-A2
|[http://www.unlweb.net/school/geneva2013/hy_words.xls hy_words.xls]
+
|align="center"|1,000
 +
|align="center"|simple verbal phrases
 
|-
 
|-
|Bulgarian
+
|align="center"|CORNELIA-B1
|[http://www.unlweb.net/school/geneva2013/bg_words.xls bg_words.xls]
+
|align="center"|1,000
 +
|align="center"|complex noun phrases
 
|-
 
|-
|Chinese
+
|align="center"|CORNELIA-B2
|[http://www.unlweb.net/school/geneva2013/zh_words.xls zh_words.xls]
+
|align="center"|1,000
 +
|align="center"|complex verbal phrases
 
|-
 
|-
|Kannada
+
|align="center"|CORNELIA-C1
|[http://www.unlweb.net/school/geneva2013/kn_words.xls kn_words.xls]
+
|align="center"|1,000
 +
|align="center"|short sentences (< average sentence length)
 
|-
 
|-
|Khmer
+
|align="center"|CORNELIA-C2
|[http://www.unlweb.net/school/geneva2013/km_words.xls km_words.xls]
+
|align="center"|1,000
|-
+
|align="center"|long sentences (> average sentence length)
|Malay
+
|[http://www.unlweb.net/school/geneva2013/ms_words.xls ms_words.xls]
+
|-
+
|Punjabi
+
|[http://www.unlweb.net/school/geneva2013/pa_words.xls pa_words.xls]
+
|-
+
|Ukrainian
+
|[http://www.unlweb.net/school/geneva2013/uk_words.xls uk_words.xls]
+
 
|}
 
|}
  
=== NP Examples ===
+
== Requisites ==
 +
CORNELIA is open to all languages, according to the following requisites:
 +
*CORNELIA-A1 is required for CORNELIA-A2;
 +
*CORNELIA-A2 is required for CORNELIA-B1;
 +
*CORNELIA-B1 is required for CORNELIA-B2;
 +
*CORNELIA-B2 is required for CORNELIA-C1;
 +
*CORNELIA-C1 is required for CORNELIA-C2;
 +
 
 +
== Instructions ==
 +
#Compilation
 +
#*CORNELIA-A1: SIMPLE NP's
 +
#**Original corpus: 5-10 original articles from the Wikipedia about culture-specific subjects (minimum of 5,000 words), in separate files, in plain text format with UTF-8 encoding
 +
#**List of at least 1,000 noun phrases appearing in the corpus with the following characteristics:
 +
#***the length of the NP must be equal or greater than 2 words (one-word NP's must be excluded): <strike>Geneva</strike>
 +
#***NP's must not contain foreign words: <strike>the city of Genève</strike><ref>Note that "the city of Geneva" is OK.</ref>
 +
#***NP's must be continuous (there cannot be any extra-content, e.g., parentheses, inside the NP): <strike>the second most populous city in Switzerland (after Zurich)</strike><ref>Note that the NP's will be "the second most populous city in Switzerland" and "Zurich".</ref>
 +
#***NP's must not contain verbs, even when used as nouns, adjectives or adverbs: <strike>French-'''speaking''' part of Switzerland</strike>, <strike>numerous international organizations, '''including''' the headquarters of many of the agencies of the United Nations and the Red Cross</strike><ref>In the latter case, there will be 2 NP's: "numerous international organizations" and "the headquarters... Red Cross".</ref>
 +
#***NP's must be original (no change should be made to the original text from the Wikipedia)
 +
#***NP's must ignore nesting (only the longest NP must be considered): "the headquarters of many of the agencies of the United Nations and the Red Cross" must be treated as a single NP<ref>Inner NP's, such as "the agencies of the United Nations and the Red Cross", must not be extracted from the longer NP.</ref>
 +
#***NP's must be unique (repetitions must be ignored)
 +
#***NP's must be provided one per line in a plain text file, with UTF-8 encoding.
 +
 
 +
== Examples (NP) ==
 
{|table border=1 cellpadding=5
 
{|table border=1 cellpadding=5
 
!width="50%"|original text
 
!width="50%"|original text
Line 89: Line 87:
 
<strike>agglomération franco-valdo-genevoise</strike> (foreign language)<br />
 
<strike>agglomération franco-valdo-genevoise</strike> (foreign language)<br />
 
<strike>Great Geneva or Grand Genève in French</strike> (foreign language)<br />
 
<strike>Great Geneva or Grand Genève in French</strike> (foreign language)<br />
1,240,000 inhabitants in 189 municipalities in both Switzerland and France<br />
+
1,240,000 inhabitants<br />
 +
189 municipalities<br />
 +
both Switzerland and France<br />
 
|}
 
|}
  
=== SSS Examples ===
+
== Notes ==
{|table border=1 cellpadding=5
+
<references />
!sentence
+
!SSS
+
|-
+
|book
+
|NH(book)
+
|-
+
|the book
+
|NS(book;the)
+
|-
+
|beautiful book
+
|NA(book;beautiful)
+
|-
+
|book of John
+
|NA(book;:01)<br/>PC:01(of;John)
+
|-
+
|the book of John
+
|NS(book;the)<br/>NA(book;:01)<br/>PC:01(of;John)
+
|-
+
|the beautiful book of John
+
|NS(book;the)<br/>NA(book;beautiful)<br/>NA(book;:01)<br/>PC:01(of;John)
+
|-
+
|the book of Math of John
+
|NS(book;the)<br/>NA(book;:01)<br/>PC:01(of;Math)<br/>NA(book;:02)<br />PC:02(of;John)
+
|-
+
|the book about the construction of Babel
+
|NS(book;the)<br/>NA(book;:01)<br/>PC:01(about;:02)<br/>NS:02(construction;the)<br/>NA:02(construction;:03)<br/>PC:03(of;Babel)
+
|}
+
 
+
=== UNL Simplified Examples ===
+
{|table border=1 cellpadding=5
+
!sentence
+
!UNL
+
|-
+
|book
+
|book
+
|-
+
|the book
+
|book.@def
+
|-
+
|beautiful book
+
|mod(book;beautiful)
+
|-
+
|book of John
+
|pos(book;John)
+
|-
+
|the book of John
+
|pos(book.@def;John)
+
|-
+
|the beautiful book of John
+
|mod(book.@def;beautiful)<br />pos(book.@def;John)
+
|-
+
|the book of Math of John
+
|cnt(book.@def;Math)<br />pos(book.@def;John)
+
|-
+
|the book about the construction of Babel
+
|cnt(book.@def;:01)<br />obj(construction.@def;Babel)
+
|}
+

Latest revision as of 18:37, 24 September 2013

The project CORNELIA (CORpus for Natural languagE to UNL Inductive Analysis) is devoted to the creation of resources (LSS, SSS, UNL) for inducing grammars for UNLization.

Contents

Goal

The project CORNELIA has three main goals:

  1. To provide Linear Sentence Structures (LSS) to be used in natural language disambiguation;
  2. To provide Syntactic Sentence Structures (SSS) to be used in natural language transformation;
  3. To provide NL-UNL memories, to be used in UNLization.

Repository

CORNELIA is language dependent. Every language has its own set of entries to be addressed. The list of entries is extracted from the Wikipedia and is organized as follows:

Subproject # of entries Description
CORNELIA-A1 1,000 simple noun phrases
CORNELIA-A2 1,000 simple verbal phrases
CORNELIA-B1 1,000 complex noun phrases
CORNELIA-B2 1,000 complex verbal phrases
CORNELIA-C1 1,000 short sentences (< average sentence length)
CORNELIA-C2 1,000 long sentences (> average sentence length)

Requisites

CORNELIA is open to all languages, according to the following requisites:

  • CORNELIA-A1 is required for CORNELIA-A2;
  • CORNELIA-A2 is required for CORNELIA-B1;
  • CORNELIA-B1 is required for CORNELIA-B2;
  • CORNELIA-B2 is required for CORNELIA-C1;
  • CORNELIA-C1 is required for CORNELIA-C2;

Instructions

  1. Compilation
    • CORNELIA-A1: SIMPLE NP's
      • Original corpus: 5-10 original articles from the Wikipedia about culture-specific subjects (minimum of 5,000 words), in separate files, in plain text format with UTF-8 encoding
      • List of at least 1,000 noun phrases appearing in the corpus with the following characteristics:
        • the length of the NP must be equal or greater than 2 words (one-word NP's must be excluded): Geneva
        • NP's must not contain foreign words: the city of Genève[1]
        • NP's must be continuous (there cannot be any extra-content, e.g., parentheses, inside the NP): the second most populous city in Switzerland (after Zurich)[2]
        • NP's must not contain verbs, even when used as nouns, adjectives or adverbs: French-speaking part of Switzerland, numerous international organizations, including the headquarters of many of the agencies of the United Nations and the Red Cross[3]
        • NP's must be original (no change should be made to the original text from the Wikipedia)
        • NP's must ignore nesting (only the longest NP must be considered): "the headquarters of many of the agencies of the United Nations and the Red Cross" must be treated as a single NP[4]
        • NP's must be unique (repetitions must be ignored)
        • NP's must be provided one per line in a plain text file, with UTF-8 encoding.

Examples (NP)

original text NP
Geneva is the second most populous city in Switzerland (after Zurich) and is the most populous city of Romandy, the French-speaking part of Switzerland. Situated where the Rhone exits Lake Geneva, it is the capital of the Republic and Canton of Geneva. The municipality (ville de Genève) has a population (as of March 2013) of 194,245, and the canton (République et Canton de Genève, which includes the city) has 472,530 residents. In 2007, the urban area, or agglomération franco-valdo-genevoise (Great Geneva or Grand Genève in French) had 1,240,000 inhabitants in 189 municipalities in both Switzerland and France. Geneva (length = 1)

the second most populous city in Switzerland
Zurich (length = 1)
the most populous city of Romandy
the French-speaking part of Switzerland (verb)
Switzerland (length = 1)
the Rhone
Lake Geneva
the capital of the Republic and Canton of Geneva
The municipality
ville de Genève (foreign language)
a population
the canton
République et Canton de Genève (foreign language)
the city
472,530 residents
the urban area
agglomération franco-valdo-genevoise (foreign language)
Great Geneva or Grand Genève in French (foreign language)
1,240,000 inhabitants
189 municipalities
both Switzerland and France

Notes

  1. Note that "the city of Geneva" is OK.
  2. Note that the NP's will be "the second most populous city in Switzerland" and "Zurich".
  3. In the latter case, there will be 2 NP's: "numerous international organizations" and "the headquarters... Red Cross".
  4. Inner NP's, such as "the agencies of the United Nations and the Red Cross", must not be extracted from the longer NP.
Software