Stratified Sampling for Data Mining on the Deep Web

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Chapter 7 – Classification and Regression Trees
Discussion Sampling Methods
Who and How And How to Mess It up
Sampling.
Sampling.
Ensemble Learning: An Introduction
Sampling and Sample Size Determination
Statistical Inference and Sampling Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
QMS 6351 Statistics and Research Methods Chapter 7 Sampling and Sampling Distributions Prof. Vera Adamchik.
A new sampling method: stratified sampling
Stratified Simple Random Sampling (Chapter 5, Textbook, Barnett, V
S S (5.1) RTI, JAIPUR1 STATISTICAL SAMPLING Presented By RTI, JAIPUR.
Sampling ADV 3500 Fall 2007 Chunsik Lee. A sample is some part of a larger body specifically selected to represent the whole. Sampling is the process.
Stratified Sampling Lecturer: Chad Jensen. Sampling Methods SRS (simple random sample) SRS (simple random sample) Systematic Systematic Convenience Convenience.
Formalizing the Concepts: STRATIFICATION. These objectives are often contradictory in practice Sampling weights need to be used to analyze the data Sampling.
Key terms in Sampling Sample: A fraction or portion of the population of interest e.g. consumers, brands, companies, products, etc Population: All the.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan,
Basic Data Mining Techniques
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
1 1 Slide Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction to Sampling Distributions Introduction to.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Sampling technique  It is a procedure where we select a group of subjects (a sample) for study from a larger group (a population)
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Statistical sampling in Audit for the NSO Comptroller & Auditor General Of India1 Stratification & Cluster Sampling.
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,
Machine Learning: Ensemble Methods
Sample design strategies
Statistics Stratification.
Sampling Chapter 5.
Sampling.
Dr. Unnikrishnan P.C. Professor, EEE
Trees, bagging, boosting, and stacking
Graduate School of Business Leadership
Sampling: Stratified vs Cluster
SAMPLE DESIGN.
Meeting-6 SAMPLING DESIGN
Sampling: Theory and Methods
StreamApprox Approximate Stream Analytics in Apache Flink
Stratified Sampling STAT262.
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Sampling Design.
I don’t need a title slide for a lecture
Observational Studies, Experiments, and Simple Random Sampling
Sampling.
Decision Trees for Mining Data Streams
Sampling and estimation
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Fraction-Score: A New Support Measure for Co-location Pattern Mining
Presentation transcript:

Stratified Sampling for Data Mining on the Deep Web Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010

Outline Introduction Background Knowledge Basic Formulation Association Rule Mining Differential Rule Mining Basic Formulation Main Technical Approach A Greedy Stratification Method Experiment Result Conclusion

Introduction Deep Web Data mining on the deep web Query interface vs. backend database Input attribute vs. Output attribute Data mining on the deep web High level summary of the data Challenge Databases cannot be accessed directly Sampling Deep web querying is time consuming Efficient Sampling Method

Background Knowledge-Association Rule Mining Aim: co-occurrence patterns for items Frequent Itemset: Support of the itemset is larger than a threshold Rule: is a frequent itemset Confidence is larger than threshold

Background Knowledge-Differential Rule Mining Aim: differences between two deep web data sources E.g. Price of the same hotels on two web sites Identical attributes vs. Differential attributes Same vs. different values Rule: X: Frequent itemset composed of identical attributes t: differential or target attribute D1, D2: data sources

Basic Formulation-Problem Formulation Two step sampling procedure A pilot sample Randomly drawn from the deep web Interesting rules are identified Additional sample Verify identified rules Association rules and differential rules Sampling more data records satisfying X X only contains input attributes – easy X contains output attributes Randomly sampling ? not efficient! how?

Basic Formulation-Problem Formulation in Detail Considering rules with A single output attribute in the left hand Association Rule Estimate or, Differential Rule Estimate mean of given A=a Goal – sampling High estimation accuracy Low sampling cost

Basic Formulation-Stratified Sampling Sampling separately from strata Heterogeneous across strata & homogenous within stratum Estimating mean value of : : size, and sampled mean value Association Rule Mining : whether an itemset is contained in a transaction If an itemset is contained in a transaction, Differential Rule Mining :the value of target attribute

Background-Neymann Allocation Sample Allocation Determining sample size for each stratum Fixed sum of sample size Neymann Allocation Minimizing variance of the stratified sampling Problem of application in Deep Web The probability of A = a in each stratum is not considered Possible large sampling cost Sampling cost: number of queries submitted to the deep web

Sampling Cost Sampling Cost on the Deep web Integrated Cost Aim: obtain data records with Sampling Cost: : number of data records with : probability of finding a data record with Integrated Cost Combing sampling cost and estimation variance Two adjustable weights

Main technical Approach –Stratification Process Stratification by a tree on the query space A top-down construction manner Best split to create child nodes Input attribute with the smallest integrated cost The splitting process stops Integrated cost at each leaf node is small Leaf nodes: final strata for sampling

Experiment Result Data Set: US census Two Metrics The income of US households from 2008 US Census 40,000 data records 7 categorical and 2 numerical attributes Two Metrics Variance of Estimation Sampling Cost

Experiment Result-Settings Five sampling procedures Four different weights for variance and sampling cost Full_Var: Var7 : Var5 : Var3 : Rand : simple random sampling

Experiment Result – Variance of Estimation Association Rule Mining Increase of variance of estimation by decreasing Random Sampling has higher estimation of variance

Experiment Result – Sampling Cost Association Rule Mining Decrease of sampling cost by decreasing Random Sampling has higher sampling cost

Conclusion Stratified sampling for data mining on the deep web Considering estimation accuracy and sampling cost A tree model for the relation between input attributes and output attributes A greedy stratification to maximally reduce an integrated cost metric Our experiments show that Higher sampling accuracy and lower sampling cost compared with simple random sampling Reducing sampling costs by trading-off a fraction of estimation error

Questions & Comments?