Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.

Similar presentations


Presentation on theme: "Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios."— Presentation transcript:

1 Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios Vouros Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/skel DCAG, Ulm, December 6, 2003

2 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 2 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

3 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 3 CROSSMARC Objectives crawl the Web for interesting Web pages, extract information from pages of different sites without a standardized format (structured, semi-structured, free text), process Web pages written in several languages, be customized semi-automatically to new domains and languages, deliver integrated information according to personalized profiles. Develop technology for Information Integration that can:

4 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 4 CROSSMARC Architecture Ontology

5 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 5 CROSSMARC Ontology Meta-conceptual layer Embodies domain-independent semantics Conceptual layer Contains relevant concepts of each domain Instance layer Contains relevant individuals of each domain The lexical layer Language dependent realizations of domain information

6 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 6 CROSSMARC Ontology … Laptops Processor Processor Name Intel Pentium 3 … Intel Pentium III Pentium III P3 PIII Lexicon Ontology Όνομα Επεξεργαστή Greek Lexicon

7 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 7 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

8 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 8 Ontology Enrichment An ontology captures knowledge in a static way, as it is a snapshot of knowledge from a particular point of view that governs a certain domain of interest in a specific time-period. Evolving nature of ontology Ontology Maintenance Ontology Enrichment part of Instances Conceptualization T-box A-box

9 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 9 Ontology Enrichment Highly evolving domain (e.g. laptop descriptions) –New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. –New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ We concentrate on instances (knowledge of the domain of interest). The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.

10 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 10 Ontology Enrichment Multi-Lingual Domain Ontology Additional annotations Validation Ontology Enrichment / Population Domain Expert Annotating Corpus Using Domain Ontology Information extraction machine learning Corpus

11 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 11 Results: Annotation phase only

12 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 12 Results: Full enrichment cycle Initial Instances Target InstancesIter-0Iter-1Iter-2 processorName615342 cdromSpeed583- - screenResolution37211 Ram4830 - Processor Speed6126- - HDD4830 - Initial Instances Target InstancesIter-0Iter-1Iter-2 processorName81543- cdromSpeed682- screenResolution572- RAM682- Processor Speed91220 HDD682- Initial Instances Target Instances Iter-0Iter-1Iter-2 Processor Name315343 Cdrom Speed2833- Screen Resolution270-- RAM2850- Processor Speed41270 - HDD2850 - 25% of the initial ontology 50% of the initial ontology 75% of the initial ontology

13 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 13 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

14 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 14 Enrichment with synonyms The number of instances for validation increases with the size of the corpus and the ontology. So far, only enrichment with instances that participate in the ‘instance of’ relationship has been supported. There is a need for supporting the enrichment of the ‘synonymy’ relationship (in different languages and domains). ONTOLOGY LEARNING We approach this problem using …

15 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 15 Enrichment with synonyms Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Synonym : ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical : ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Issues to be handled: Combination : ‘Intell Pentium 3’ - ‘P III’

16 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 16 Compression-based Clustering COCL (COmpression-based CLustering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. COCL iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

17 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 17 Compression-based Clustering Given CLUSTERS and candidate INSTANCES while INSTANCES do for each instance in INSTANCES compute CCDiff for every cluster in CLUSTERS end for each select instance from INSTANCES that maximizes the difference between its two smallest CCDiff’s if min(CCDiff) of instance > threshold create new cluster assign instance to new cluster remove instance from INSTANCES calculate code model for the new cluster add new cluster to CLUSTERS else assign instance to cluster of min(CCDiff) remove instance from INSTANCES recalculate code model for the cluster end while

18 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 18 Results - Evaluation Concept Generation Scenario Instances kept (%)CorrectAccuracy (%) 903100 8011100 7015100 6019100 502395,6 402996,5 303494,1 Instance Matching Scenario We hide incrementally one cluster at a time and measure the ability of the algorithm to discover the hidden clusters Cluster’s NameCluster’s Type Instances AmdProcessor Name19 IntelProcessor Name8 Hewlett-PackardManufacturer Name3 Fujitsu-SiemensManufacturer Name5 Windows 98Operating System10 Windows 2000Operating System3 Dataset characteristics Recall : 100% Precision : 75%

19 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 19 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

20 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 20Conclusions CROSSMARC is a complete multi-lingual information integration system. Ontology Maintenance is crucial in evolving domains. Ontology Enrichment helps the adaptation of the system to new domains saving time and effort. Machine-learning based information extraction can assist the discovery of new instances. Compression-based clustering discovers string similarities that support the enrichment with different surface appearances of an instance (“synonyms”).

21 DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 21References 1)B. Hachey, C. Grover, V. Karkaletsis, A. Valarakos, M. T. Pazienza, M. Vindigni, E. Cartier, J. Coch, Use of Ontologies for Cross-lingual Information Management in the Web, In Proceedings of the Ontologies and Information Extraction International Workshop held as part of the EUROLAN 2003, Romania, July 28 - August 8, 2003 2)M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos, V. Karkaletsis, Ontology Integration in a Multilingual e-Retail System, In Proceedings of the HCI International Conference, Volume 4, pp. 785-789, Heraklion, Crete, Greece, June 22-27 2003. 3)A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning, In RANLP, 2003 4)A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, G. Vouros, A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning, In Proceedings of the 6th ICGL workshop on Text Processing for Modern Greek: from Symbolic to Statistical Approaches, held as part of the 6th International Conference in Greek Linguistics, Rethymno, Crete, 20 September, 2003. 5)A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, A Name-Matching Algorithm for Ontology Enrichment, In Proceedings of the Hellenic Artificial Intelligence Conference (SETN’04), Samos, May, 2004.


Download ppt "Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios."

Similar presentations


Ads by Google