Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

PNS: Personalized Multi-Source News Delivery Georgios Paliouras(1), Mouzakidis Alexandros(1), Christos Ntoutsis(2), Angelos Alexopoulos(3), Christos Skourlas(2)
Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
© December 1999 George Paliouras, All Rights Reserved1 Learning Communities of Users on the Internet George Paliouras Christos Papatheodorou Vangelis Karkaletsis.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
Ontology Learning for Chinese Information Organization and Knowledge Discovery in Ethnology and Anthropology Kong Jing Institute of Ethnology & Anthropology,
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Detailed design – class design Domain Modeling SE-2030 Dr. Rob Hasker 1 Based on slides written by Dr. Mark L. Hornick Used with permission.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Dimitrios Skoutas Alkis Simitsis
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Dr. Pythagoras Karampiperis Institute of Informatics & Telecommunications National Centre for Scientific Research "Demokritos“ Greece C2Learn - Creative.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
University of the Aegean AI – LAB ESWC 2008 From Conceptual to Instance Matching George A. Vouros AI Lab Department of Information and Communication Systems.
AIFB Ontology Mapping I3CON Workshop PerMIS August 24-26, 2004 Washington D.C., USA Marc Ehrig Institute AIFB, University of Karlsruhe.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
What is this? SE-2030 Dr. Mark L. Hornick 1. Same images with different levels of detail SE-2030 Dr. Mark L. Hornick 2.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
WP1: Application Ontology Management Maria Teresa Pazienza Dept. Of Computer Science University of Rome “Tor Vergata”
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Yong-Bin Kang, Pari Delir Haghighi, Frada Burstein ESA CFinder: An intelligent key.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Representation and Analysis of Multimedia Content: The BOEMIE Proposal
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Institute of Informatics & Telecommunications NCSR “Demokritos”
Institute of Informatics & Telecommunications
Presentation transcript:

Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios Vouros Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” DCAG, Ulm, December 6, 2003

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 2 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 3 CROSSMARC Objectives crawl the Web for interesting Web pages, extract information from pages of different sites without a standardized format (structured, semi-structured, free text), process Web pages written in several languages, be customized semi-automatically to new domains and languages, deliver integrated information according to personalized profiles. Develop technology for Information Integration that can:

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 4 CROSSMARC Architecture Ontology

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 5 CROSSMARC Ontology Meta-conceptual layer Embodies domain-independent semantics Conceptual layer Contains relevant concepts of each domain Instance layer Contains relevant individuals of each domain The lexical layer Language dependent realizations of domain information

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 6 CROSSMARC Ontology … Laptops Processor Processor Name Intel Pentium 3 … Intel Pentium III Pentium III P3 PIII Lexicon Ontology Όνομα Επεξεργαστή Greek Lexicon

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 7 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 8 Ontology Enrichment An ontology captures knowledge in a static way, as it is a snapshot of knowledge from a particular point of view that governs a certain domain of interest in a specific time-period. Evolving nature of ontology Ontology Maintenance Ontology Enrichment part of Instances Conceptualization T-box A-box

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 9 Ontology Enrichment Highly evolving domain (e.g. laptop descriptions) –New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. –New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ We concentrate on instances (knowledge of the domain of interest). The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 10 Ontology Enrichment Multi-Lingual Domain Ontology Additional annotations Validation Ontology Enrichment / Population Domain Expert Annotating Corpus Using Domain Ontology Information extraction machine learning Corpus

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 11 Results: Annotation phase only

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 12 Results: Full enrichment cycle Initial Instances Target InstancesIter-0Iter-1Iter-2 processorName cdromSpeed screenResolution37211 Ram Processor Speed HDD Initial Instances Target InstancesIter-0Iter-1Iter-2 processorName cdromSpeed682- screenResolution572- RAM682- Processor Speed91220 HDD682- Initial Instances Target Instances Iter-0Iter-1Iter-2 Processor Name Cdrom Speed2833- Screen Resolution270-- RAM2850- Processor Speed HDD % of the initial ontology 50% of the initial ontology 75% of the initial ontology

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 13 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 14 Enrichment with synonyms The number of instances for validation increases with the size of the corpus and the ontology. So far, only enrichment with instances that participate in the ‘instance of’ relationship has been supported. There is a need for supporting the enrichment of the ‘synonymy’ relationship (in different languages and domains). ONTOLOGY LEARNING We approach this problem using …

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 15 Enrichment with synonyms Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Synonym : ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical : ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Issues to be handled: Combination : ‘Intell Pentium 3’ - ‘P III’

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 16 Compression-based Clustering COCL (COmpression-based CLustering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. COCL iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 17 Compression-based Clustering Given CLUSTERS and candidate INSTANCES while INSTANCES do for each instance in INSTANCES compute CCDiff for every cluster in CLUSTERS end for each select instance from INSTANCES that maximizes the difference between its two smallest CCDiff’s if min(CCDiff) of instance > threshold create new cluster assign instance to new cluster remove instance from INSTANCES calculate code model for the new cluster add new cluster to CLUSTERS else assign instance to cluster of min(CCDiff) remove instance from INSTANCES recalculate code model for the cluster end while

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 18 Results - Evaluation Concept Generation Scenario Instances kept (%)CorrectAccuracy (%) , , ,1 Instance Matching Scenario We hide incrementally one cluster at a time and measure the ability of the algorithm to discover the hidden clusters Cluster’s NameCluster’s Type Instances AmdProcessor Name19 IntelProcessor Name8 Hewlett-PackardManufacturer Name3 Fujitsu-SiemensManufacturer Name5 Windows 98Operating System10 Windows 2000Operating System3 Dataset characteristics Recall : 100% Precision : 75%

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 19 Structure of the talk Information integration in CROSSMARC Semi-automated ontology enrichment Clustering “synonyms” Conclusions

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 20Conclusions CROSSMARC is a complete multi-lingual information integration system. Ontology Maintenance is crucial in evolving domains. Ontology Enrichment helps the adaptation of the system to new domains saving time and effort. Machine-learning based information extraction can assist the discovery of new instances. Compression-based clustering discovers string similarities that support the enrichment with different surface appearances of an instance (“synonyms”).

DCAG, Ulm 6/12/2003 Maintaining Information Integration Ontologies 21References 1)B. Hachey, C. Grover, V. Karkaletsis, A. Valarakos, M. T. Pazienza, M. Vindigni, E. Cartier, J. Coch, Use of Ontologies for Cross-lingual Information Management in the Web, In Proceedings of the Ontologies and Information Extraction International Workshop held as part of the EUROLAN 2003, Romania, July 28 - August 8, )M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos, V. Karkaletsis, Ontology Integration in a Multilingual e-Retail System, In Proceedings of the HCI International Conference, Volume 4, pp , Heraklion, Crete, Greece, June )A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning, In RANLP, )A. Valarakos, G. Sigletos, V. Karkaletsis, G. Paliouras, G. Vouros, A Methodology for Enriching a Multi-Lingual Domain Ontology using Machine Learning, In Proceedings of the 6th ICGL workshop on Text Processing for Modern Greek: from Symbolic to Statistical Approaches, held as part of the 6th International Conference in Greek Linguistics, Rethymno, Crete, 20 September, )A. Valarakos, G. Paliouras, V. Karkaletsis, G. Vouros, A Name-Matching Algorithm for Ontology Enrichment, In Proceedings of the Hellenic Artificial Intelligence Conference (SETN’04), Samos, May, 2004.