Presentation is loading. Please wait.

Presentation is loading. Please wait.

JRC-Ispra, 17.09.04, Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.

Similar presentations


Presentation on theme: "JRC-Ispra, 17.09.04, Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged."— Presentation transcript:

1 JRC-Ispra, 17.09.04, Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September 2004 http://www.jrc.cec.eu.int/langtech

2 JRC-Ispra, 17.09.04, Slide 2 Analysis Danish Dutch English Finnish French German (Greek) Italian Portuguese Spanish Swedish (Lithuanian) (Bulgarian) (Hungarian) Eurovoc indexing – Extend language coverage Czech Croatian Latvian Lithuanian Polish Slovak Soon also Albanian Romanian Russian Slovene Display Danish Dutch English Finnish French German Greek Italian Portuguese Spanish Swedish

3 JRC-Ispra, 17.09.04, Slide 3 Incentive for collaboration Mutual benefit –We can provide tools and results to you (to non-commercial Member State organisations) –JRC will be able to Eurovoc-index documents for news analysis, etc. No payments by the JRC are foreseen How to go ahead? / What to do next?  We need Eurovoc-indexed texts in your languages (or translations of Eurovoc-indexed texts!) (Acquis Communautaire)

4 JRC-Ispra, 17.09.04, Slide 4 Format to provide training texts to the JRC Ideally: Plain text (not MS-Word, RTF, PDF, etc.) UTF-8 character encoding With CELEX code With Eurovoc descriptor code (mentioning Eurovoc version) XML format, structured Linguistically pre-processed and structured: –lemmatised –annexes / signatures separate –title separate –stop word lists MANY texts: –80,000 English texts were enough to train ca. 3500 descriptors (out of 6000)!

5 JRC-Ispra, 17.09.04, Slide 5 Descriptor distribution in Spanish EP/EC texts

6 JRC-Ispra, 17.09.04, Slide 6 Descriptor distribution in Spanish EP/EC texts

7 JRC-Ispra, 17.09.04, Slide 7 Descriptor distribution in Spanish Congress texts

8 JRC-Ispra, 17.09.04, Slide 8 Descriptor distribution in Hungarian texts

9 JRC-Ispra, 17.09.04, Slide 9 Procedure You provide us with –A big XML file containing the documents –A stop word list We will give back to you –A subset of documents (evaluation set) Same format Additional information on automatic Eurovoc descriptors assigned –Some statistics on descriptor usage frequency, etc. –An online browser interface to see the assignment results –A validation interface

10 JRC-Ispra, 17.09.04, Slide 10 training Descriptor profiles Descriptor profiles Descriptor profiles Descriptor Your corpus pre processing assignment Training set pre processing Evaluation set Eurovoc Assignment export 95% 5%

11 JRC-Ispra, 17.09.04, Slide 11 XML format

12 JRC-Ispra, 17.09.04, Slide 12

13 JRC-Ispra, 17.09.04, Slide 13

14 JRC-Ispra, 17.09.04, Slide 14 Results of descriptor assignment - interface

15 JRC-Ispra, 17.09.04, Slide 15 Results of descriptor assignment - XML PRESIDENCY OF THE EC COUNCIL EUROPEAN UNION PRESIDENT SOCIAL POLICY PRINCIPLE OF SUBSIDIARITY...

16 JRC-Ispra, 17.09.04, Slide 16 Results of descriptor assignment - validation Numeric feedback?

17 JRC-Ispra, 17.09.04, Slide 17 Arranging the collaboration of scientific partners The JRC will be able to provide the tool and indexing results. The JRC does not have specific funds to pay for this work. Possibilities for collaboration between parliament and scientists –informal collaboration without payment –formal collaboration (contract, payment) –apply for a project with national or EU funding (example: Hungary) –M.Sc. Theses (e.g. Lithuanian), internships (e.g. Estonian), … –… We would like to have lemmatisers for the new languages. If necessary, we can train system without linguistic pre-processing.

18 JRC-Ispra, 17.09.04, Slide 18 Pre-processing of the texts (by scientists?) Linguistic pre-processing, needed for each language: –General and corpus-specific list of stop words (several thousand!) –For highly inflected languages: some lemmatiser or stemmer –Multi-word term mark-up for disambiguation purposes? Further text processing –Some document structuring to separate title, text, footer and annex –Conversion to XML –Conversion to UTF-8

19 JRC-Ispra, 17.09.04, Slide 19 Dealing with different versions of Eurovoc Problem has not yet been solved: request for your input En training material was indexed with versions 3.1 and 4 Challenge: new descriptors need new training material  delay Re-training required

20 JRC-Ispra, 17.09.04, Slide 20 Dealing with different versions of Eurovoc (2) Case 1: New descriptor  Search old and new documents for related documents for re-training Case 2: New name for old descriptor  Replace the descriptor name: OLD_NAME  NEW_NAME Case 3: New place in hierarchy  No problem Case 4: Disappearing descriptor  Will no longer be assigned

21 JRC-Ispra, 17.09.04, Slide 21 Dealing with different versions of Eurovoc (2) Case 5: Several descriptors are conflated  No problem Case 6: A descriptor is split into two or more  Re-training required (see Case 1) NEW_NAME_1 OLD_NAME NEW_NAME_2 NEW_NAME_3 OLD_NAME_1 OLD_NAME_2 NEW_NAME OLD_NAME_3

22 JRC-Ispra, 17.09.04, Slide 22 Dealing with different versions of Eurovoc (3) Changes between Eurovoc versions should not only be described in free text. They should be formalised in a machine-readable way (e.g. in XML, in table format, …). This should be done centrally for the thesaurus (i.e. for all thesaurus languages), rather than separately for each language!

23 JRC-Ispra, 17.09.04, Slide 23 Appeal to Eurovoc community / EP / OPOCE Make Eurovoc available to the wide public in machine-readable form Formalise the version differences (e.g. XML) Make Eurovoc-indexed texts available to the scientific community –Controlled by licences, if necessary –E.g. via the Evaluations and Language resources Distribution Agency ELDA See http://www.elda.fr “ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.” –Wealth of ‘parallel texts’ to train multilingual text analysis applications Machine Translation Multilingual Named Entity Recognition Multilingual classification Multi-document summarisation … Automatic indexing  The benefit is yours!

24 JRC-Ispra, 17.09.04, Slide 24


Download ppt "JRC-Ispra, 17.09.04, Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged."

Similar presentations


Ads by Google