JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

Slides:



Advertisements
Similar presentations
Advanced Piloting Cruise Plot.
Advertisements

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Current design issues for digital archives Robert Munro (presented by David Nathan) Endangered Languages Archive (ELAR), School of Oriental and African.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
1 "Quality requirements for web-sites – recommendations for best practice for fulfilling user needs Best practice in communicating statistics: Making statistics.
UNITED NATIONS Shipment Details Report – January 2006.
© 2011, the Book Industry Study Group, Inc Book Industry Study Group Supporting an Industry in Transformation BISG Annual Meeting of Members September.
European Thesaurus on International Relations and Area Studies A multilingual terminological tool on international affairs Axel Huckstorf Stiftung Wissenschaft.
JRC-Ispra, , Slide 1 Introduction – Presentation of the Programme Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
WIPO Patent Information Services
Click to edit Master title style Page - 1 OneSky Teams Step-by-Step Online Corporate Communication Support 2006.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
2010 fotografiert von Jürgen Roßberg © Fr 1 Sa 2 So 3 Mo 4 Di 5 Mi 6 Do 7 Fr 8 Sa 9 So 10 Mo 11 Di 12 Mi 13 Do 14 Fr 15 Sa 16 So 17 Mo 18 Di 19.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
DOROTHY Design Of customeR dRiven shOes and multi-siTe factorY Product and Production Configuration Method (PPCM) ICE 2009 IMS Workshops Dorothy Parallel.
ABC Technology Project
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
1Office for the Coordination of Humanitarian Affairs (OCHA) CAP (Consolidated Appeal Process) Section Online Project System (OPS) for Consolidated / Flash.
2 |SharePoint Saturday New York City
Green Eggs and Ham.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Labour Force Historical Review Sandra Keys, University of Waterloo DLI OntarioTraining University of Guelph, Guelph, ON April 12, 2006.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Benjamin Banneker Charter Academy of Technology Making AYP Benjamin Banneker Charter Academy of Technology Making AYP.
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
Pasewark & Pasewark Microsoft Office XP: Introductory Course 1 INTRODUCTORY MICROSOFT WORD Lesson 8 – Increasing Efficiency Using Word.
25 seconds left…...
Subtraction: Adding UP
Januar MDMDFSSMDMDFSSS
DIKLA GRUTMAN 2014 Databases- presentation and training.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
CINAHL Keyword Searching. This presentation will take you through the procedure of finding reliable information which can be used in your academic work.
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Use the buttons on the top to navigate through the presentation 1 PrevNext Menu.
Know About E-CTLT Teachers Panel and working area.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Presentation transcript:

JRC-Ispra, , Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September

JRC-Ispra, , Slide 2 Applications mentioned so far Thesaurus indexing (summarise main concepts of document) –Fully automatic –Interactive –Monolingual and cross-lingual Document retrieval –Monolingual and cross-lingual Eurovoc indexing can be used for MUCH MORE …

JRC-Ispra, , Slide 3 Main goals of JRCs Language Technology (LT) activity Gather potentially user-relevant documents Analyse texts in various languages –extract information from texts (Eurovoc) –identify similarity between documents (Eurovoc) –Classify documents (Eurovoc) Visualise contents –of individual documents (Eurovoc) –of whole document collections (Eurovoc)

JRC-Ispra, , Slide 4 Eurovoc indexing as part of a tool set

JRC-Ispra, , Slide 5 (Cross-lingual) document similarity calculation English Text English Text Resolution on radio- active waste Spanish Text Resolución sobre los residuos radioactivos monolingual

JRC-Ispra, , Slide 6 (Multilingual) text classification Most current approaches to text classification are monolingual Category 1Category 2 Category 3 Es Fr Es Text classification, via Eurovoc, is multilingual

JRC-Ispra, , Slide 7 (Multilingual) document map © Cartias ThemeScape

JRC-Ispra, , Slide 8 Translation Spotting Why? To test document similarity calculation To compile a collection of parallel texts (for the training and testing of other multilingual text analysis applications) To detect cross-lingual document plagiarism

JRC-Ispra, , Slide 9 Translation Spotting - Results Task: find Spanish translations of English source document in a parallel text collection DS considering the length of documents DS correcting the monolingual bias (83%) Simple document similarity (DS)

JRC-Ispra, , Slide 10 To organise unknown document collections Algorithm: –Find pairs of texts that are most similar –Group them in one cluster, repeat the operation until only one cluster remains (Multilingual) clustering of documents 90% 80% 75% 40% 10%

JRC-Ispra, , Slide 11 Building a (multilingual) cluster tree

JRC-Ispra, , Slide 12 Application to (multilingual) news analysis EMM system in JRCs Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) ( Cluster related news stories and identify duplicates ( news topic identification ) Identify keywords, peoples names, place names, main sentences ( information extraction ) Find related news stories over time ( news topic tracking ) Find related news stories in other languages ( cross-lingual topic tracking mainly via Eurovoc and place names )

JRC-Ispra, , Slide 13 Detection of the major news of the day (EMM)

JRC-Ispra, , Slide 14 Establish Links to Related News over time

JRC-Ispra, , Slide 15 Establish links to related news in other languages

JRC-Ispra, , Slide 16 Subject-specific summarisation (1) Title: "Resolution on the 10th anniversary of the Chernobyl accident" Eurovoc descriptors:

JRC-Ispra, , Slide 17 Subject-specific summarisation (2) Eurovoc descriptors:

JRC-Ispra, , Slide 18 Further JRC LT applications Recognition and translation of: –Place names; + visualisation –Peoples names; + retrieval of images and further information –Dates –Products Recognition of text language

JRC-Ispra, , Slide 19 Place name recognition / Cross-lingual display

JRC-Ispra, , Slide 20 Place name recognition / Visualisation 18 references (Boston, American, America, New York) 11 references (Vietnam) 5 references (Iraq) + 1 reference to Sweden (Andre Heinz(…) Swedish based environmental consultant)

JRC-Ispra, , Slide 21 Place name recognition / Disambiguation Requires disambiguation 14 Paris, 7 Birminghams cities called And, Annan name variants (exonyms) Zoom on Europe

JRC-Ispra, , Slide 22 Recognising names, places, … - News navigation Top-mentioned personalities En/Fr news 26 July 2004

JRC-Ispra, , Slide 23 Automatic recognition of name variants

JRC-Ispra, , Slide 24 Automatic link to online encyclopaedia

JRC-Ispra, , Slide 25 News clusters mentioning a person

JRC-Ispra, , Slide 26 Persons talked about in same news clusters

JRC-Ispra, , Slide 27 Countries talked about in same news clusters

JRC-Ispra, , Slide 28 Frequent keywords for these news clusters

JRC-Ispra, , Slide 29 Recognising products and product groups Sample text

JRC-Ispra, , Slide 30 Recognising products and product groups Identified products

JRC-Ispra, , Slide 31 Recognising products and product groups Cross-lingual display of products found

JRC-Ispra, , Slide 32

Multilingual Information Extraction – Language recognition (demo)demo – Keywords (monolingual; cross-lingual)monolingualcross-lingual – Geographical place names (intro; new EU languages; demo)intronew EU languagesdemo – Products and product groups (slides; demo JRC, demo CIS)slidesdemo JRCdemo CIS – Names of people (demo news names, demo recognition, related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching)demo news namesdemo recognitionrelated namesCyrillic/Greek fuzzy name matchingdemo fuzzy matching – Dates (demo recognition)demo recognition – Terminology extraction – Summarisation (standard sentence extraction; subject-specific summarisation)subject-specific summarisation Cross-lingual navigation and classification – Document similarity (monolingual; cross-lingual; translation spotting)translation spotting –Bottom-up document clustering ; topic detection (demo news analysis)demo news analysis – Classification (multi-monolingual and cross-lingual; pre-classification clustering)cross-lingualpre-classification clustering – Relevance-ranking of documents (slides)slides –News topic tracking (monolingual historical; cross-lingual; demo news analysis)cross-lingualdemo news analysis –Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names).slidesdemo news names Visualisation of textual contents – Individual documents (document profile)document profile – Whole document collections (document map)document map – Geographical information (maps; animated maps, demo)animated mapsdemo – Clustering (ascii, star, tree), key-word-in-context (KWIC), search, …asciistartreeKWIC Further tools – Document Gathering (Lang-Tech crawler; WTs EMM system) WTs EMM system – Document format conversion (PDF, MS-Word, PS, HTML, XML) – Character set conversion (UTF-8, ISO-Latin, HTML, …) Projects IDoRA for OLAF (slides)slides Cross-lingual Indexing (EUROVOC) Breaking News – Detection and Visualisation (BNDV / State-of-the-World)BNDV SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH, AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development)REACH AMINFSO project proposals JRC IntroductionIntroduction Multilingual and crosslingual text analysis