Funded by: European Commission – 6th Framework Project Reference: IST-2004-026460 WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.

Slides:



Advertisements
Similar presentations
Profiles Construction Eclipse ECESIS Project Construction of Complex UML Profiles UPM ETSI Telecomunicación Ciudad Universitaria s/n Madrid 28040,
Advertisements

Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) & FIRST Consortium M12 scenario: Early prototype demo Luxembourg, Nov 2011.
Chapter 5: Introduction to Information Retrieval
Semantic Matching of candidates’ profile with job data from Linkedln PRESENTED BY: TING XIAO SARABPREET KAUR DHILLON.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
CSI5112 Software Engineering Team: Andrei Anisenia Margi Fumtiwala.
SPICE! An Ontology Based Web Application By Angela Maduko and Felicia Jones Final Presentation For CSCI8350: Enterprise Integration.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Mining the web to improve semantic-based multimedia search and digital libraries
Information Retrieval in Practice
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Funded by: European Commission – 6th Framework Project Reference: IST TAO Bootstrapping Methodology Hai Wang University of Southampton.
© Franz Kurfess Project Topics 1 Topics for Master’s Projects and Theses -- Winter Franz J. Kurfess Computer Science Department Cal Poly.
Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies Database Laboratory University of A Coruña A Coruña,
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
© Anselm SpoerriInfo + Web Tech Course Information Technologies Info + Web Tech Course Anselm Spoerri PhD (MIT) Rutgers University
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
KBS-HYPERBOOK An Open Hyperbook System for Education Peter Fröhlich, Wolfgang Nejdl, Martin Wolpers University of Hannover.
Chapter 5: Information Retrieval and Web Search
MUSCLE WP9 E-Team Integration of structural and semantic models for multimedia metadata management Aims: (Semi-)automatic MM metadata specification process.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Methodology - Conceptual Database Design Transparencies
Methodology Conceptual Databases Design
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Semantic Web services Interoperability for Geospatial decision.
Košice, 10 February Experience Management based on Text Notes The EMBET System Michal Laclavik.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Constructing Data Mining Applications based on Web Services Composition Ali Shaikh Ali and Omer Rana
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
Methodology - Conceptual Database Design
Article by Dunja Mladenic, Marko Grobelnik, Blaz Fortuna, and Miha Grcar, Chapter 3 in Semantic Knowledge Management: Integrating Ontology Management,
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Object-Oriented Parsing and Transformation Kenneth Baclawski Northeastern University Scott A. DeLoach Air Force Institute of Technology Mieczyslaw Kokar.
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Information Retrieval in Practice
WP4 Models and Contents Quality Assessment
SysML v2 Formalism: Requirements & Benefits
Machine Learning with Weka
Project 1: Text Classification by Neural Networks
Semi-Automatic Data-Driven Ontology Construction System
Topic: Semantic Text Mining
Presentation transcript:

Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan Institute

TAO WP Outline of the Presentation  The goal of WP 2  Introduction to application mining  Creating a document network  Transforming a document network into feature vectors  LATINO: Link-analysis and text-mining toolbox  OntoGen: a system for semi-automatic data-driven ontology construction  WP 2 and the Dassault case study  Conclusions and future work

TAO WP  The goal is to facilitate the acquisition of domain ontologies from legacy applications by: 1.Identifying data sources that contain knowledge to be transitioned into an ontology 2.Employing data mining techniques to aid the domain expert in building the ontology Learning Web-service Ontologies

TAO WP Case 1: Regular Web service Case 2: C++/Java source code Case 3: Database Case 4: … Intermediate data representation Ontology OL part works for all cases Case-specific “adapters” Application Mining

TAO WP Application Mining Intermediate data representation Structured data = networks Unstructured data = textual documents Document network Link analysis Text mining = A set of interlinked documents; each link has a type and a weight

TAO WP GATE Case Study  Software library for natural language processing (NLP)  ~600 Java classes  Language resources = data  Processing resources = algorithms  Graphical user interfaces = GUI  Developed at University of Sheffield  Freely available at

TAO WP  Structured  Code samples  Web service usage logs  Source code  Reference manual (function declarations)  WDSL …  Unstructured  Web pages  User’s manual  Tutorials, lectures, forums, newsgroups, etc.  Reference manual (textual descriptions)  Source code comments … Data Sources

TAO WP /** The format of Documents. Subclasses of DocumentFormat know about * particular MIME types and how to unpack the information in any * markup or formatting they contain into GATE annotations. Each MIME * type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat, * RtfDocumentFormat, MpegDocumentFormat. These classes register themselves * with a static index residing here when they are constructed. Static * getDocumentFormat methods can then be used to get the appropriate * format class for a particular document. */ public abstract class DocumentFormat extends AbstractLanguageResource implements LanguageResource{ /** The MIME type of this format. */ private MimeType mimeType = null; /** * Find a DocumentFormat implementation that deals with a particular * MIME type, given that type. aGateDocument this document will receive as a feature * the associated Mime Type. The name of the feature is * MimeType and its value is in the format type/subtype mimeType the mime type that is given as input */ static public DocumentFormat getDocumentFormat(gate.Document aGateDocument, MimeType mimeType){ } // getDocumentFormat(aGateDocument, MimeType) } // class DocumentFormat Class comment Class name Field type Field name Field comment Return type Method name Super-class (base class) Implemented interface A field A method Method comment Comment reference Comment references A Typical Java Class

TAO WP Creating a Document Network DocumentFormat DocumentFormat.class

TAO WP Creating a Document Network DocumentFormat DocumentFormat.class DocumentFormat AbstractLanguageResource MpegDocumentFormat MimeType RtfDocumentFormat XmlDocumentFormat LanguageResource Document 2

TAO WP GATE Comment Reference Network See next slide

TAO WP GATE Comment Reference Network

TAO WP Transforming Networks into Feature Vectors

TAO WP Combined feature vector Combining Feature Vectors Structure feature vector Content feature vector DocumentFormat Feature vector Structure feature vector Content feature vectorStructure feature vector Content feature vector Stop-words Stemming n-grams TF-IDF

TAO WP LATINO & OntoGen Demo  LATINO: Link analysis and text mining toolbox  Software being developed in the course of TAO WP 2  Data preprocessing, machine learning, and data visualization capabilities  OntoGen  A system for data-driven semi-automatic ontology construction  SEKT technology (  Freely available at

TAO WP LATINO & OntoGen Demo GATE source code LATINO Feature vectors OntoGenOntology

TAO WP OntoGen Demo

TAO WP Dassault Case Study: Inclusion Dependencies  Inclusion dependencies express subset- relationships between database tables and are thus important indicators of redundancy  Discovery of ID important in the context of information integration  Dassault Case Study  Problem: Dassault databases contain ID which should be taken into account when transitioning databases to ontologies  LATINO/OntoGen can help detect ID

TAO WP Dassault Case Study: Inclusion Dependencies  Dataset  The content of database tables in XML format  Ignore non-textual and empty table columns  LATINO setting  Instances: columns (i.e. fields) in tables  Documents: concatenated values  Relations between instances:  Cosine similarity between documents  Similarity between sets of values  Jaccard, |AB|/|AB|  Alt., |AB|/min{|A|,|B|}  Edit distance (normalized) between column names

TAO WP Dassault Case Study: Inclusion Dependencies

TAO WP Dassault Case Study: Inclusion Dependencies Candidates according to bag-of-words cosine similarity: 1.00 AC_Periodicity.PER_Aircraft : moop.moop_aircraft 1.00 AC_Periodicity.PER_Aircraft : mopa.mopa_kav 1.00 AC_Periodicity.PER_Aircraft : movi.movi_kav 1.00 AC_Periodicity.PER_Aircraft : AC_Zonal.Zonal_ac 1.00 AC_Tools.ATO_nato_vendor_code : task_miscellaneous.MIS_nato_vendor_code AC_Tools.ATO_nato_vendor_code : task_ingredients_consumable.ING_nato_vendor_code 0.99 task_ingredients_consumable.ING_nato_vendor_code : task_tools.TOO_Nato_vendor_code 0.99 task_ingredients_consumable.ING_nato_vendor_code : task_miscellaneous.MIS_nato_vendor_code 0.99 Task_Id.TID_task_owner : task_ingredients_consumable.ING_nato_vendor_code 0.98 AC_Zonal.Zonal_ac : LRU_SRU_Description.LS_Aircraft task_periodicity.PER_periodicity_usage_parameter2 : task_usage_parameter.USP_Libelle 0.50 task_periodicity.PER_threshold_usage_parameter : task_usage_parameter.USP_Code 0.49 Task_Id.TID_usage parameter : task_periodicity.PER_threshold_usage_parameter mope.mope_kpe : task_periodicity.PER_threshold_tol_usage_param 0.48 mope.mope_kpe : task_periodicity.PER_periodicity_usage_param AC_Vendor_Code.Vendor Code : task_LRU_Ata_Code.LRU_Fab 0.21 AC_Zonal.Zonal_title : moid.moid_lida 0.21 task_compagny_owner.COO_Cage_Code : task_LRU_Ata_Code.LRU_Fab 0.21 Task_Id.TID_Task_Writer : task_miscellaneous.MIS_part_number 0.20 Task_Id.TID_Scheduled-task : task_spare.SPR_part_number Task_Id.TID_task_ownertask_ingredients_consumable.ING_nato_vendor_code

TAO WP Conclusions and Future Work  Plans for LATINO  (Recognized?) open-source architecture for text mining and link analysis  Build a user community, put up a Web site, training, promotion …  Applications!  … in case studies  … in other EU projects  … outside the context of EU projects  … competing in data mining contests  Future work  Implementation of a visualization tool similar to DocumentAtlas (required for setting the weights and exploring the semantic space)  Evaluation!  Can we solve problems introduced by case studies better if we use LATINO methodology rather than using standard text mining approach?  Continue the development of LATINO