NL Reference Corpus

From UNL Wiki

(Difference between revisions)

Revision as of 21:27, 18 March 2014

The NL Reference Corpus (NC) is the corpus used to prepare and to assess grammars for sentence-based UNLization. It is divided in 6 different levels according to the Framework of Reference for UNL (FoR-UNL):

NC-A1: NL Reference Corpus A1
NC-A2: NL Reference Corpus A2
NC-B1: NL Reference Corpus B1
NC-B2: NL Reference Corpus B2
NC-C1: NL Reference Corpus C1
NC-C2: NL Reference Corpus C2

Methodology

As a natural language corpus, the NC varies for each language. It is derived from a base corpus to be compiled and processed according to the following criteria:

The Base Corpus must have at least 5,000,000 tokens (strings isolated by blank space and other word boundary markers). It must be representative of the contemporary standard use of the written language, and should include documents from as many different genres and domains as possible.
The Base Corpus must be segmented (in sentences) and tagged for POS.
The segmented corpus is used to calculate the average sentence length (ASL), which is the median of the length (in words) of all sentences.
The tagged corpus is used to extract the linear sentence structures (LSS), which are sequences of POS, and to calculate their frequency of occurrence.
The average sentence length (ASL) and the linear sentence structures (LSS) are used to generate the NC templates, as follows:
- NC-A1 = 1000 most frequent LSS's where length < (ASL/2) (1000 most frequent shortest structures)
- NC-A2 = 2000 most frequent LSS's where length < (ASL/2) (2000 most frequent shortest structures)
- NC-B1 = 3000 most frequent LSS's where length < ASL (3000 most frequent short structures)
- NC-B2 = 4000 most frequent LSS's where length < ASL (4000 most frequent short structures)
- NC-C1 = 5000 most frequent LSS's (5000 most frequent structures)
- NC-C2 = 6000 most frequent LSS's (6000 most frequent structures
The NC templates are used to compile the NC corpora: the training corpora and the testing corpora. The training corpora consists of 1 exemplar of each LSS, and will be used to prepare the grammar. The testing corpora consists of 4 exemplars of each LSS randomly selected in the Base Corpus. The whole NC corpora (i.e., 5 exemplars for each LSS) is used to calculate the F-measure, which is the parameter for assessing the precision and the recall of the grammars.

Files

Arabic
- Source: Wikipedia
- Total number of distinct sentences: 801,258
- ASL = 16
- Corpus

@@ Line 14: / Line 14: @@
 #The tagged corpus is used to extract the [[LSS|linear sentence structures]] (LSS), which are sequences of POS, and to calculate their frequency of occurrence.
 #The average sentence length (ASL) and the linear sentence structures (LSS) are used to generate the '''NC templates''', as follows:
-#*NC-A1 = 100 most frequent LSS's for noun phrases where length < (ASL/2) (100 most frequent shortest NP's)
+#*NC-A1 = 1000 most frequent LSS's where length < (ASL/2) (1000 most frequent shortest structures)
-#*NC-A2 = 300 most frequent LSS's for verbal phrases where length < (ASL/2) (300 most frequent shortest VP's)
+#*NC-A2 = 2000 most frequent LSS's where length < (ASL/2) (2000 most frequent shortest structures)
-#*NC-B1 = 500 most frequent LSS's where length < ASL (500 most frequent short structures)
+#*NC-B1 = 3000 most frequent LSS's where length < ASL (3000 most frequent short structures)
-#*NC-B2 = 500 most frequent LSS's where length < ASL (500 most frequent short structures)
+#*NC-B2 = 4000 most frequent LSS's where length < ASL (4000 most frequent short structures)
-#*NC-C1 = 500 most frequent LSS's where length >= ASL
+#*NC-C1 = 5000 most frequent LSS's (5000 most frequent structures)
-#*NC-C2 = 500 most frequent SSS's where length >= ASL
+#*NC-C2 = 6000 most frequent LSS's (6000 most frequent structures
 #The NC templates are used to compile the NC corpora: the training corpora and the testing corpora. The training corpora consists of 1 exemplar of each LSS, and will be used to prepare the grammar. The testing corpora consists of 4 exemplars of each LSS randomly selected in the Base Corpus. The whole NC corpora (i.e., 5 exemplars for each LSS) is used to calculate the [[F-measure]], which is the parameter for assessing the precision and the recall of the grammars.
@@ Line 27: / Line 27: @@
 **Total number of distinct sentences: 801,258
 **ASL = 16
-**Raw files
+**Corpus
-***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A.rar Corpus NC_ara_A] (sentences < ASL/2): 121,747 (9 MB)
+***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A1.rar Corpus NC_ara_A1]
-***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B.rar Corpus NC_ara_B] (ASL/2 <= sentences ASL): 256,352 (45 MB)
+***[http://www.unlweb.net/resources/corpus/NC/NC_ara_A2.rar Corpus NC_ara_A2]
-***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C.rar Corpus NC_ara_C] (sentences >= ASL): 423,159
+***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B1.rar Corpus NC_ara_B1]
-**[[LSS|LSS patterns]]
+***[http://www.unlweb.net/resources/corpus/NC/NC_ara_B2.rar Corpus NC_ara_B2]
-***[http://www.unlweb.net/resources/corpus/NC/LSS_ara_A.rar Corpus LSS_ara_A] (46,232 distinct patterns extracted from sentences < ASL/2) (1.8 MB)
+***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C1.rar Corpus NC_ara_C1]
+***[http://www.unlweb.net/resources/corpus/NC/NC_ara_C2.rar Corpus NC_ara_C2]

NL Reference Corpus

Revision as of 21:27, 18 March 2014

Methodology

Files

Views

Personal tools

Search

UNL

Lingware

Software

UNL Program

Navigation

Toolbox

Print/export