Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eleni Galiotou, Dept. of Informatics

Similar presentations


Presentation on theme: "Eleni Galiotou, Dept. of Informatics"— Presentation transcript:

1 A Preliminary Investigation into the Automatic EuroVoc Indexing of Greek Documents
Eleni Galiotou, Dept. of Informatics Technological Educational Institute of Athens PCI 2014, Athens, Oct. 2-4, 2014

2 Keyword Identification
Search & Retrieving in modern information retrieval systems: Keywords Key-phrases Keywords (Lancaster 1998) Keyword extraction keyword assignment Identified in a controlled vocabulary list (thesaurus) Descriptors do not necessarily appear explicitly in the text PCI 2014, Athens, Oct. 2-4, 2014

3 Thesauri Natural Language Conceptual
WordNet (English): Lexical semantic relations structured around an exhaustive list of synonym sets EuroWordnet (European languages) Balkanet (Balkan languages) Conceptual Descriptors: abstract conceptual terms EuroVoc multilingual, multidisciplinary thesaurus covers activities of the EU, (in particular those of the European Parliament) Used by: European Parliament, European Commission’s Publications Office, many other institutions PCI 2014, Athens, Oct. 2-4, 2014

4 EuroVoc Terms in at least 27 languages (translated one-to-one)
23 official languages of the EU (Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish). Basque, Catalan, Russian, Serbian and other non-official translations Over 6,700 classes organized hierarchically into eight levels: Relations: Broader Term, Narrower Term and Related Term (BT /NT/RT). Fields covered: law, politics, finance, social issues, transport, environment, geography, science, organizations, etc. PCI 2014, Athens, Oct. 2-4, 2014

5 Advantages -Drawbacks
Hierarchical nature allows query expansion in document retrieval by subject field without having to use other possible search terms multilingual document collections can be searched monolingually since there is an one-to-one translation for each descriptor. human indexing : very complex and therefore, slow and expensive PCI 2014, Athens, Oct. 2-4, 2014

6 Automatic Categorization using EuroVoc
Automatic multi-label categorization tool: could be used as support tool for human annotators by helping them to improve speed and consistency. Language-independent output of such a software: could be used as an input to different text mining applications such as cross-lingual clustering and classification Multi-label classification: Improves the indexing process by the usage of more than one labels to categorize a single document PCI 2014, Athens, Oct. 2-4, 2014

7 JEX – JRC EuroVoc Indexer
Multi-label categorization tool developed at the European Commission Joint Research Center (JRC) Freely available from Performs indexing with the EuroVoc descriptors (classes) following a machine-learning approach: Using statistical methods, the system can learn from manually indexed documents what are the associates of each descriptor (the words that are typical of a document belonging to a particular class). When a new document undergoes the indexing procedure and the software finds associates for a EuroVoc class, it assigns the descriptor to the document in question. PCI 2014, Athens, Oct. 2-4, 2014

8 The Data set Geodata.gov.gr:
aims “to provide a focal point for the aggregation, search, provision and portrayal of open public geospatial information” . in the road map to support enforcement of Law 3979/2011 for eGovernment, as a best practice example for the application of Information & Communication Technologies (ICT) in the public administration, as an open data repository for the provision of geospatial information. first attempt for the free distribution of open geospatial data to citizens or enterprises in Greece. A challenging case for use of EuroVoc descriptors for indexing would allow the creation of a common vocabulary could consist a first step towards the interlinking of Greek geospatial data to similar documents. PCI 2014, Athens, Oct. 2-4, 2014

9 The nature of the data Collection of texts containing open geospatial data descriptions ( Texts of various lengths ranging from 22 words (approx characters) to 350 words (approx characters). Four to five descriptors are already assigned to these documents manually, some of them in English. So, the results of the automatic indexing with EuroVoc descriptors could be compared to the initial characterization of the documents in question. PCI 2014, Athens, Oct. 2-4, 2014

10 Text registered under the label “Υποδομές και Επικοινωνίες” (“Infrastructures and Communications”)
Document title: “Σταθμοί και στάσεις των Αστικών Συγκοινωνιών της Αθήνας” (“Stations and stops of urban transports in Athens”) Describes geospatial data on the stations and stops of buses, trolleybuses, tram, metro and suburban railway in Athens. PCI 2014, Athens, Oct. 2-4, 2014

11 Towards an automatic indexing
JEX (JRC - EuroVoc indexer) ver. 1.0 for the Greek language ( 15 texts under the labels “Environment”, “Culture”, “Energy”, “Infrastructure and Communications” from the Greek geodata web site ( initial indexing of documents using JEX without changing any of its parameters. PCI 2014, Athens, Oct. 2-4, 2014

12 Automatic descriptor assignment
PCI 2014, Athens, Oct. 2-4, 2014

13 Manual vs. Automatic Indexing
Keywords manually assigned to the document: “Transport networks” “αστικές συγκοινωνίες” (“urban transport”), “λεωφορεία” (“buses”), “τρόλλευ” (trolley), “τραμ” (“tram”). Descriptors automatically assigned to document (default number): “bus”, “disclosure of information”, “data processing”, “national implementation of community law”, “merge control”, “competition”. Related terms “bus station” “electronic document” JEX Field term Transport (appears in the title of the document) Communications (related to document category) PCI 2014, Athens, Oct. 2-4, 2014

14 Associates Descriptor “bus”. Associates :
“έκτακτες” (“non-regular’ – nominative /accusative plural), “στάσης” (“stop” – genitive singular), “λεωφορείων” (“bus” – genitive plural), “στάσεις” (“stop” – nominative / accusative plural). PCI 2014, Athens, Oct. 2-4, 2014

15 Evaluation (1) Document category
Document category Keywords in common between manually assigned and JEX – assigned keywords Environment Infrastructures and Communications 3 Energy 1 Culture PCI 2014, Athens, Oct. 2-4, 2014

16 Evaluation (2) Keywords produced by the automatic indexing process did not fully match keywords that were already manually assigned to the documents. We cannot draw safe conclusions due to the small size of our corpus. Results are somehow expected since the JEX software was used as it was initially trained without taking into account the particular data JEX should be trained with geodata descriptors in order to meet the requirements of a more accurate keyword assignment to documents containing geospatial information PCI 2014, Athens, Oct. 2-4, 2014

17 Evaluation (3) Question remains open as for the general terms assignment which could characterize most documents in the collection. The software has correctly assigned such terms to certain documents but the initial annotation had not taken the particular terms into account Existence of different inflected word-forms in the text implies that the task of automatic indexing for texts written in a highly inflected language such Greek would be greatly facilitated by the use of tools such as lemmatizers in a linguistic preprocessing phase PCI 2014, Athens, Oct. 2-4, 2014

18 Conclusions A first attempt to automatically assign EuroVoc descriptors to Greek open government data Automatic indexing task was performed using the JEX multi-label categorization software on a small corpus of geospatial data descriptions available from the geodata.gov.gr web site. The task of automatic indexing was performed using the JEX multi- label categorization software on a small corpus of geospatial data descriptions available from the geodata.gov.gr web site. General terms were more or less correctly assigned to the documents but, practically no words in common between the two sets of keywords PCI 2014, Athens, Oct. 2-4, 2014

19 Future Work Repeat our experimentation involving training the software with the appropriate keywords. Examine different sets of stop-words that may result to a better performance of the software.  Develop linguistic pre-processing tools such as lemmatizers which will take into account linguistic knowledge on a highly inflected language such as Greek. PCI 2014, Athens, Oct. 2-4, 2014

20 Thank you! PCI 2014, Athens, Oct. 2-4, 2014


Download ppt "Eleni Galiotou, Dept. of Informatics"

Similar presentations


Ads by Google