The UNDL Foundation has launched the project LACE (Language Acquisition from Comparable tExts). The main goal of the project is to build language modules out of data automatically extracted from comparable corpora. The results are expected to be incorporated in the architecture of UNL-based systems as supplementary resources for natural language disambiguation, both in analysis and generation, and will be used for improving the performance of applications in machine translation, summarization, information retrieval and semantic reasoning.
For the time being, UNL-based systems have been built upon lexical resources provided in a rather manual basis, mainly because the current technology on word sense disambiguation has not achieved yet the maturity level that would dispense the treatment by humans. The increasing availability of natural language data in digital format encourages, however, the exploration of strategies for extracting supplementary lexical information from comparable corpora, which could extend the coverage of the current resources and, in the end, may provide a less expensive alternative for populating lexical databases in the UNL framework. The project LACE aims at compiling, replicating and extending techniques that have been widely used in statistical natural language processing, and evaluating their results in UNL-based applications.
As a long term enterprise, the Project has been divided in three subsidiary projects, devoted to three different types of corpus and involving, therefore, three different extraction strategies:
- LACE[pc] - To extract data from parallel corpora (proceedings from the United Nations and from th e European Parliament) using Moses, a statistical machine translation system;
- LACE[hpc] - To extract data from comparable semi-parallel corpora (Wikipedia) using high-performance computing; and
- LACE[npc] - To extract data from comparable non-parallel corpora (newspapers) using linguistically-motivated models of language automatic acquisition.
The project LACEhpc has already been proposed to and accepted by the Centre for Advanced Modelling Science (CADMOS), and will be developed by a consortium of two universities (the University of Geneva and École Polytechnique Fédérale de Lausanne), in collaboration with the UNDL Foundation. It aims at designing and implementing efficient high-performance computing methods for extracting monolingual and multilingual resources from the Wikipedia. The other two projects are being detailed and are still open to external contributions.
- PhD position at the University of Geneva (UNIGE) to deal with LACE[hpc]
- Presentation in the CADMOS meeting
|< Prev||Next >|