NL Reference Corpus

From UNL Wiki
(Difference between revisions)
Jump to: navigation, search
(Methodology)
(Files)
Line 29: Line 29:
 
**Source: Wikipedia
 
**Source: Wikipedia
 
**Total number of distinct sentences: 801,258
 
**Total number of distinct sentences: 801,258
**Corpus
+
**Annotated Corpus
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A1.rar Corpus NC_ara_A1] (length <= 9: 141,988 sentences)
+
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A1_S0.rar Corpus NC_ara_A1_S0] (length <= 9: 141,988 sentences)
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A2.rar Corpus NC_ara_A2] (9 < length <= 13: 150,406 sentences)
+
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A2_S0.rar Corpus NC_ara_A2_S0] (9 < length <= 13: 150,406 sentences)
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B1.rar Corpus NC_ara_B1] (13 < length <= 17: 146,178 sentences)
+
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B1_S0.rar Corpus NC_ara_B1_S0] (13 < length <= 17: 146,178 sentences)
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B2.rar Corpus NC_ara_B2] (17 < length <= 22: 141,376 sentences)
+
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B2_S0.rar Corpus NC_ara_B2_S0] (17 < length <= 22: 141,376 sentences)
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C1.rar Corpus NC_ara_C1] (22 < length <= 32: 165,455 sentences)
+
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C1_S0.rar Corpus NC_ara_C1_S0] (22 < length <= 32: 165,455 sentences)
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C2.rar Corpus NC_ara_C2] (length > 32: 165,616 sentences)
+
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C2_S0.rar Corpus NC_ara_C2_S0] (length > 32: 165,616 sentences)
 +
**Training Corpus
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A1_S1.rar Corpus NC_ara_A1_S1] (1,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A2_S1.rar Corpus NC_ara_A2_S1] (1,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B1_S1.rar Corpus NC_ara_B1_S1] (1,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B2_S1.rar Corpus NC_ara_B2_S1] (1,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C1_S1.rar Corpus NC_ara_C1_S1] (1,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C2_S1.rar Corpus NC_ara_C2_S1] (1,000 sentences)
 +
**Testing Corpus
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A1_S2.rar Corpus NC_ara_A1_S2] (4,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A2_S2.rar Corpus NC_ara_A2_S2] (4,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B1_S2.rar Corpus NC_ara_B1_S2] (4,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B2_S2.rar Corpus NC_ara_B2_S2] (4,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C1_S2.rar Corpus NC_ara_C1_S2] (4,000 sentences)
 +
***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C2_S2.rar Corpus NC_ara_C2_S2] (4,000 sentences)

Revision as of 18:42, 17 April 2014

The NL Reference Corpus (NC) is the corpus used to prepare and to assess grammars for sentence-based UNLization. It is divided in 6 different levels according to the Framework of Reference for UNL (FoR-UNL):

  • NC-A1: NL Reference Corpus A1
  • NC-A2: NL Reference Corpus A2
  • NC-B1: NL Reference Corpus B1
  • NC-B2: NL Reference Corpus B2
  • NC-C1: NL Reference Corpus C1
  • NC-C2: NL Reference Corpus C2

Methodology

As a natural language corpus, the NC varies for each language. It is derived from a base corpus to be compiled and processed according to the following criteria:

  1. The Base Corpus must have at least 5,000,000 tokens (strings isolated by blank space and other word boundary markers). It must be representative of the contemporary standard use of the written language, and should include documents from as many different genres and domains as possible.
  2. The Base Corpus must be segmented (in sentences).
  3. The Segmented Corpus must be tokenized (according to the natural language dictionary exported from the UNLarium).
  4. The Tokenized Corpus must be annotated for lexical category, in order to generate the linear sentence structures (LSS).
  5. The Annotated Corpus must be subdivided into 6 different subsets, according to the number of tokens:
    • A1 = length <= 15th percentile (very small sentences)
    • A2 = 15th percentile < length <= 30th percentile (small sentences)
    • B1 = 30th percentile < length <= 45th percentile (small medium-size sentences)
    • B2 = 45th percentile < length <= 60th percentile (long medium-size sentences)
    • C1 = 60th percentile < length <= 80th percentile (long sentences)
    • C2 = length > 80th percentile (very long sentences)
  6. Each subcorpus is used to compile the NC corpus: the training corpora and the testing corpora.
    • The training corpora consists of 1 exemplar of the 1,000 most frequent LSS, and will be used to prepare the grammar. (1,000 sentences in total)
    • The testing corpora consists of 4 exemplars of each LSS included in the training corpora. The exemplars are randomly selected in the Annotated Corpus. (4,000 sentences in total)
  7. The whole NC corpus (i.e., 5 exemplars for each LSS) is used to calculate the F-measure, which is the parameter for assessing the precision and the recall of the grammars.

Files

Software