Eleni Galiotou, Dept. of Informatics

Slides:



Advertisements
Similar presentations
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
Advertisements

EU Institutions “To Understand Europe You Have to Be a Genius or French.” --Madeleine Albright, US Secretary of State, 1998.
Why, what were the idea ? 1.Create a data infrastructure, 2.Data + the knowledge products that are produced on the basis of data a) Efficiant access to.
MIG-KOMM-EU Multilingual intercultural business communication in Europe University of Bucharest Faculty of Foreign Languages and Literatures German Studies.
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
 They speak German  8.47 million of people live there.
Multilingual multimedia thesaurus for conservation and restoration collaborative networked model of construction Lucijana Leoni University of Dubrovnik.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Thesaurus Design and Development
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Minority Language Conference Hanasaari-The Swedish- Finnish Cultural Centre November 27th and 28th 2008.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
1 EU & languages Elisabetta Gibertini Michela Sgarbi Mirjam Arula Hanna-Liis Karp.
EuroVoc, Eurlex, EU Bookshop Danica Maleková, Publications Office STS Bratislava, 22 October 2010.
Languages in Action Translating for the European Commission
Translating for the European Commission Vilnius, 7 June 2013 Miroslav Adamiš Director DGT.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Practical approaches to standardizing vocabularies: the Cultural Heritage experience. Phil Carlisle English Heritage National Monuments Record and European.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.
IATE EU tool for translation-oriented terminology work
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
2013 Court of Justice of the European Union Language arrangements at the Court of Justice of the European Union Interpretation - Translation.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Contemporary World. The European Union Since the end of WWII and the Cold War, European countries have gradually developed a feeling of collective identity.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Language Independent Method for Question Classification COLING 2004.
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
UA in ImageCLEF 2005 Maximiliano Saiz Noeda. Index System  Indexing  Retrieval Image category classification  Building  Use Experiments and results.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
ISPRA 2004 Automatic Eurovoc indexing an Experiment in the Czech Parliament Anna Lhotská, Václav Sklenář Office of the Chamber of Deputies, Parliament.
Compiling, processing and accessing the collection of legal regulations of the Republic of Croatia T. Didak Prekpalaj, T. Horvat, D. Miletić, D. Mokriš.
GEMET GEneral Multilingual Environmental Thesaurus leading the way to federated terminologies Stefan Jensen, Head of information services group with input.
1 European Association for Language Testing and Assessment
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Standardisation supporting cultural diversity: From 5 to 28 STF QD Expanding the language coverage of the ETSI spoken command vocabulary standard. Mike.
EU Terminology: Building text-related & translation-oriented projects for IATE 20th European Symposium on Languages for Special Purposes – University.
Sales Presenter Available now
Taxonomies, Lexicons and Organizing Knowledge
EnTag Enhanced Tagging for Discovery Koraljka Golub, Jim Moon,
Mitubishi Chemical Holdings Group
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.

EU and multilingualism
When terminology and semantic web meet
Workshop of “Best practices exchanges” Luxemburg February 2011 User support – New organisation Norbert REINERT/ Henric ANSELM.
Sales Presenter Available now Standard v Slim
Statistics Explained goes multilingual

Presentation transcript:

A Preliminary Investigation into the Automatic EuroVoc Indexing of Greek Documents Eleni Galiotou, Dept. of Informatics Technological Educational Institute of Athens PCI 2014, Athens, Oct. 2-4, 2014

Keyword Identification Search & Retrieving in modern information retrieval systems: Keywords Key-phrases Keywords (Lancaster 1998) Keyword extraction keyword assignment Identified in a controlled vocabulary list (thesaurus) Descriptors do not necessarily appear explicitly in the text PCI 2014, Athens, Oct. 2-4, 2014

Thesauri Natural Language Conceptual WordNet (English): Lexical semantic relations structured around an exhaustive list of synonym sets EuroWordnet (European languages) Balkanet (Balkan languages) Conceptual Descriptors: abstract conceptual terms EuroVoc multilingual, multidisciplinary thesaurus covers activities of the EU, (in particular those of the European Parliament) Used by: European Parliament, European Commission’s Publications Office, many other institutions PCI 2014, Athens, Oct. 2-4, 2014

EuroVoc Terms in at least 27 languages (translated one-to-one) 23 official languages of the EU (Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish). Basque, Catalan, Russian, Serbian and other non-official translations Over 6,700 classes organized hierarchically into eight levels: Relations: Broader Term, Narrower Term and Related Term (BT /NT/RT). Fields covered: law, politics, finance, social issues, transport, environment, geography, science, organizations, etc. PCI 2014, Athens, Oct. 2-4, 2014

Advantages -Drawbacks Hierarchical nature allows query expansion in document retrieval by subject field without having to use other possible search terms multilingual document collections can be searched monolingually since there is an one-to-one translation for each descriptor. human indexing : very complex and therefore, slow and expensive PCI 2014, Athens, Oct. 2-4, 2014

Automatic Categorization using EuroVoc Automatic multi-label categorization tool: could be used as support tool for human annotators by helping them to improve speed and consistency. Language-independent output of such a software: could be used as an input to different text mining applications such as cross-lingual clustering and classification Multi-label classification: Improves the indexing process by the usage of more than one labels to categorize a single document PCI 2014, Athens, Oct. 2-4, 2014

JEX – JRC EuroVoc Indexer Multi-label categorization tool developed at the European Commission Joint Research Center (JRC) Freely available from http://langtech.jrc.ec.europa.eu/Eurovoc.html Performs indexing with the EuroVoc descriptors (classes) following a machine-learning approach: Using statistical methods, the system can learn from manually indexed documents what are the associates of each descriptor (the words that are typical of a document belonging to a particular class). When a new document undergoes the indexing procedure and the software finds associates for a EuroVoc class, it assigns the descriptor to the document in question. PCI 2014, Athens, Oct. 2-4, 2014

The Data set Geodata.gov.gr: aims “to provide a focal point for the aggregation, search, provision and portrayal of open public geospatial information” . in the road map to support enforcement of Law 3979/2011 for eGovernment, as a best practice example for the application of Information & Communication Technologies (ICT) in the public administration, as an open data repository for the provision of geospatial information. first attempt for the free distribution of open geospatial data to citizens or enterprises in Greece. A challenging case for use of EuroVoc descriptors for indexing would allow the creation of a common vocabulary could consist a first step towards the interlinking of Greek geospatial data to similar documents. PCI 2014, Athens, Oct. 2-4, 2014

The nature of the data Collection of texts containing open geospatial data descriptions (http://www.geodata.gov.gr/geodata) Texts of various lengths ranging from 22 words (approx. 150 characters) to 350 words (approx. 2.000 characters). Four to five descriptors are already assigned to these documents manually, some of them in English. So, the results of the automatic indexing with EuroVoc descriptors could be compared to the initial characterization of the documents in question. PCI 2014, Athens, Oct. 2-4, 2014

Text registered under the label “Υποδομές και Επικοινωνίες” (“Infrastructures and Communications”) Document title: “Σταθμοί και στάσεις των Αστικών Συγκοινωνιών της Αθήνας” (“Stations and stops of urban transports in Athens”) Describes geospatial data on the stations and stops of buses, trolleybuses, tram, metro and suburban railway in Athens. PCI 2014, Athens, Oct. 2-4, 2014

Towards an automatic indexing JEX (JRC - EuroVoc indexer) ver. 1.0 for the Greek language (http://langtech.jrc.ec.europa.eu/Eurovoc.html) 15 texts under the labels “Environment”, “Culture”, “Energy”, “Infrastructure and Communications” from the Greek geodata web site (http://geodata.gov.gr) initial indexing of documents using JEX without changing any of its parameters. PCI 2014, Athens, Oct. 2-4, 2014

Automatic descriptor assignment PCI 2014, Athens, Oct. 2-4, 2014

Manual vs. Automatic Indexing Keywords manually assigned to the document: “Transport networks” “αστικές συγκοινωνίες” (“urban transport”), “λεωφορεία” (“buses”), “τρόλλευ” (trolley), “τραμ” (“tram”). Descriptors automatically assigned to document (default number): “bus”, “disclosure of information”, “data processing”, “national implementation of community law”, “merge control”, “competition”. Related terms “bus station” “electronic document” JEX Field term Transport (appears in the title of the document) Communications (related to document category) PCI 2014, Athens, Oct. 2-4, 2014

Associates Descriptor “bus”. Associates : “έκτακτες” (“non-regular’ – nominative /accusative plural), “στάσης” (“stop” – genitive singular), “λεωφορείων” (“bus” – genitive plural), “στάσεις” (“stop” – nominative / accusative plural). PCI 2014, Athens, Oct. 2-4, 2014

Evaluation (1) Document category   Document category Keywords in common between manually assigned and JEX – assigned keywords Environment Infrastructures and Communications 3 Energy 1 Culture PCI 2014, Athens, Oct. 2-4, 2014

Evaluation (2) Keywords produced by the automatic indexing process did not fully match keywords that were already manually assigned to the documents. We cannot draw safe conclusions due to the small size of our corpus. Results are somehow expected since the JEX software was used as it was initially trained without taking into account the particular data JEX should be trained with geodata descriptors in order to meet the requirements of a more accurate keyword assignment to documents containing geospatial information PCI 2014, Athens, Oct. 2-4, 2014

Evaluation (3) Question remains open as for the general terms assignment which could characterize most documents in the collection. The software has correctly assigned such terms to certain documents but the initial annotation had not taken the particular terms into account Existence of different inflected word-forms in the text implies that the task of automatic indexing for texts written in a highly inflected language such Greek would be greatly facilitated by the use of tools such as lemmatizers in a linguistic preprocessing phase PCI 2014, Athens, Oct. 2-4, 2014

Conclusions A first attempt to automatically assign EuroVoc descriptors to Greek open government data Automatic indexing task was performed using the JEX multi-label categorization software on a small corpus of geospatial data descriptions available from the geodata.gov.gr web site. The task of automatic indexing was performed using the JEX multi- label categorization software on a small corpus of geospatial data descriptions available from the geodata.gov.gr web site. General terms were more or less correctly assigned to the documents but, practically no words in common between the two sets of keywords PCI 2014, Athens, Oct. 2-4, 2014

Future Work Repeat our experimentation involving training the software with the appropriate keywords. Examine different sets of stop-words that may result to a better performance of the software.  Develop linguistic pre-processing tools such as lemmatizers which will take into account linguistic knowledge on a highly inflected language such as Greek. PCI 2014, Athens, Oct. 2-4, 2014

Thank you! PCI 2014, Athens, Oct. 2-4, 2014