Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.

Slides:



Advertisements
Similar presentations
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Advertisements

Mining Multiple-level Association Rules in Large Databases
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Mining Association Analysis: Basic Concepts and Algorithms
Aki Hecht Seminar in Databases (236826) January 2009
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Mining By Archana Ketkar.
1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.
Fast Algorithms for Association Rule Mining
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Data Mining By Dave Maung.
Mining various kinds of Association Rules
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Association Rule Mining
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Data Mining and Decision Support
Information Design Trends Unit Five: Delivery Channels Lecture 2: Portals and Personalization Part 2.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Book web site:
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
DATA MINING © Prentice Hall.
Frequent Pattern Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
William Norris Professor and Head, Department of Computer Science
Lin Lu, Margaret Dunham, and Yu Meng
Spatio-temporal Rule Mining: Issues and Techniques
Data Warehouse and OLAP
Mining Association Rules from Stars
Transactional data Algorithm Applications
I don’t need a title slide for a lecture
Mining Unexpected Rules by Pushing User Dynamics
Mining Sequential Patterns
Stratified Sampling for Data Mining on the Deep Web
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Data Warehouse and OLAP
Presentation transcript:

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December 14, 2010

Outline Introduction Problem Definition Differential Analysis and Approaches Experiment Result Conclusion

Introduction Deep web –Query forms vs. backend databases –Similar information from multiple data sources –What’s their difference? –Application: guiding users’ search process Higher-level knowledge summary –Patterns of values with respects to the same entity

Problem definition Goal –D–Difference between multiple data sources in the same domain Patterns of values of the same entity –D–Different values for the same data entity For example: prices of commodities –H–How different is the data, under what conditions? –D–Differential Rules Capturing the difference of values

Differential Analysis and Approaches Summarizing difference between two data sources Data queried from the deep web –A relational table Attributes –Assumption: data sources have same attributes –Identical attributes Same values for the same data object –Differential attributes Different values for the same data object –Quantitative attributes Differences in values of quantitative attributes

Differential Analysis and Approaches- Useful Identifiers Two data source and –Identical attributes –Differential attributes :attribute in data source –Combining relation tables of A and B –Differential rule where Profile X: the left hand of the rule

Differential Analysis and Approaches- Differential Rule Mining Frequent Item Set Mining –Apriori algorithm –A concept hierarchy Identifying patterns for target attributes –For each frequent itemset X Decide –Paired Z-test : difference between two random variables Hypothesis test vs. if >, then – if >0, then

Differential Analysis and Approaches- Pruning Rules Pruning rules –A large number of rules are generated –Essential rules predict unessential rules –Identifying essential rules Direction of rules

Differential Analysis and Approaches- ancestors of rules Rules R1, R2 are complementary ancestors of rule R –R1: Y->d, R2: Z->d –R: X->d, and Rule R is predicated by complementary ancestors R1 and R2

Differential Analysis and Approaches- Profile Representation Identifying essential Rules –Rules are processed level by level –For rule R in k, all the rules from level 1 to k-1 are visited –Computation cost is expensive Profile Representation –Uniquely describe items contained in the profile X of a rule R –For profile, define would be extremely large when profile X is large –Thus, we modify

Differential Analysis and Approaches- Process of Pruning Hash table is used to store differential rules Each level corresponds to a hash table For each rule R in the k-the level –The ancestor rules from 1 to k/2 are visited –Identifying complementary rules by profile representation –R is unessential rules Predicted by a pair of complementary ancestor rules –Process the next rule

Experiment Results Data Set: four of the most popular travel sites. 120 randomly selected cities all over the world Attributes –Hotel ID, City, Star, Customer Rating, Cleanness Rating, Price, Service Rating Concept Hierarchy for attribute: city

Experiment Results - effectiveness

Experiment Results – Pruning effectiveness

Experiment Results- Efficiency

Experiment Results -Mining-Utility of the Approach

Conclusion A method to extract high-level summary of the differences in multiple data sources Differential rule mining – A new data mining problem Statistic test for discovering differential rules A method to prune unessential rules Hash-table is used to speedup the process. Experiment results on four travel-related deep web data sources show good results.

Questions?