Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Slides:



Advertisements
Similar presentations
Evaluating the Effects of Business Register Updates on Monthly Survey Estimates Daniel Lewis.
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 1 An Introduction to Business Statistics.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
Chapter 5 Stratified Random Sampling n Advantages of stratified random sampling n How to select stratified random sample n Estimating population mean and.
Discussion Sampling Methods
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
QBM117 Business Statistics Statistical Inference Sampling 1.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Radial Basis Functions
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Chapter 17 Additional Topics in Sampling
Basic Data Mining Techniques
Clustering.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
STAT262: Lecture 5 (Ratio estimation)
A new sampling method: stratified sampling
STAT 4060 Design and Analysis of Surveys Exam: 60% Mid Test: 20% Mini Project: 10% Continuous assessment: 10%
FLANN Fast Library for Approximate Nearest Neighbors
Sampling ADV 3500 Fall 2007 Chunsik Lee. A sample is some part of a larger body specifically selected to represent the whole. Sampling is the process.
Key terms in Sampling Sample: A fraction or portion of the population of interest e.g. consumers, brands, companies, products, etc Population: All the.
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
July, 2014 HQ Sampling. AGENDA 1. A brief Overview of Sampling 2. Types of Random Sampling Simple Random and Systematic Random 3. Types of Probability.
Definitions Observation unit Target population Sample Sampled population Sampling unit Sampling frame.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Chapter 18 Additional Topics in Sampling ©. Steps in Sampling Study Step 1: Information Required? Step 2: Relevant Population? Step 3: Sample Selection?
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
SAMPLING TECHNIQUES. Definitions Statistical inference: is a conclusion concerning a population of observations (or units) made on the bases of the results.
1 Chapter Two: Sampling Methods §know the reasons of sampling §use the table of random numbers §perform Simple Random, Systematic, Stratified, Cluster,
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Chapter 1: The Nature of Statistics 1.4 Other Sampling Designs.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Part III – Gathering Data
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
1 Sampling Theory Dr. T. T. Kachwala. 2 Population Characteristics There are two ways in which reliable data or information for Population Characteristics.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
Probability Sampling. Simple Random Sample (SRS) Stratified Random Sampling Cluster Sampling The only way to ensure a representative sample is to obtain.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
Chapter 12 Vocabulary. Matching: any attempt to force a sample to resemble specified attributed of the population Population Parameter: a numerically.
STT 350: SURVEY SAMPLING Dr. Cuixian Chen Chapter 5: Stratified Random Samples Elementary Survey Sampling, 7E, Scheaffer, Mendenhall, Ott and Gerow 1 Chapter.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
Sampling Dr Hidayathulla Shaikh. Contents At the end of lecture student should know  Why sampling is done  Terminologies involved  Different Sampling.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Statistical sampling in Audit for the NSO Comptroller & Auditor General Of India1 Stratification & Cluster Sampling.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Confidence Intervals and Sample Size
Semi-Supervised Clustering
CHAPTER 4 Designing Studies
Sampling: Design and Procedures
Two-Phase Sampling (Double Sampling)
Stratified Sampling STAT262.
2. Stratified Random Sampling.
Stratified Sampling for Data Mining on the Deep Web
Sampling and estimation
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug. 14, 2012

Outline Introduction –Deep Web –Clustering on the deep web Stratified K-means Clustering –Stratification –Sample Allocation Conclusion

Deep Web Data sources hidden from the Internet –Online query interface vs. Database –Database accessible through online Interface –Input attribute vs. Output attribute An example of Deep Web

Data Mining over the Deep Web High level summary of data –Scenario 1: a user wants to relocate to the county. Summary of the residences of the county? –Age, Price, Square Footage –County property assessor’s web-site only allows simple queries

Challenges Databases cannot be accessed directly –S–Sampling method for Deep web mining Obtaining data is time consuming –E–Efficient sampling method –H–High accuracy with low sampling cost

An Example of Deep Web for Real- Estate

k-means clustering over a deep web data source Goal: Estimating k centers for the underlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

Overview of Method Stratification Sample Allocation

Stratification on the deep web Partitioning the entire population in to strata –Stratifies on the query space of input attributes –Goal: Homogenous Query subspaces –Radius of query subspace: –Rule: Choosing the input attribute that mostly decreases the radius of a node –For an input attribute, decrease of radius:

Partition on Space of Output Attributes

Sampling Allocation Methods We have created c*k partitions and c*k subspaces –A pilot sample –C*k-mean clustering generate c*k partitions Representative sampling –Good Estimation on statistics of c*k subspaces Centers Proportions

Representative Sampling-Centers Center of a subspace –Mean vector of all data points belonging to the subspace Let sample S={DR 1, DR 2, …, DR n } –For i-th subspace, center :

Distance Function For c*k estimated centers with true centers Using Euclidean Distance –Integrated variance Computed based on pilot sample – : # of sample drawn from j-th stratum

Optimized Sample Allocation Goal: Using Lagrange multipliers: We are going to sample stratum with large variance Data is spread in a wide area, and more data are need to represent the population

Active Learning based sampling Method In machine learning –Passive learning: data are randomly chosen –Active Learning Certain data are selected, to help build a better model Obtaining data is costly and/or time-consuming Choosing stratum i, the estimated decrease of distance function is Iterative Sampling Process –At each iteration, stratum with largest decrease of distance function is selected for sampling –Integrated variance is updated

Representative Sampling-Proportion Proportion of a sub-space: –Fraction of data records belonging to the sub-space –Depends on proportion of the sub-space in each stratum In j-th stratum, Risk function –Distance between estimated factions and their true values Iterative Sampling Process –At each iteration, stratum with largest decrease of risk function is chosen for sampling –Parameters are updated

Stratified K-means Clustering Weight for data records in i-th stratum –, : size of population, : size of sample Similar to k-means clustering –Center for i-th cluster

Experiment Result Data Set: –Yahoo! data set: Data on used cars 8,000 data records Average Distance

Representative Sampling-Yahoo! Data set Benefit of Stratification –Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% Benefit of Representative Sampling –Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% Center based sampling methods have better performance Optimized sampling method has better performance in the long run

Conclusion Clustering over a deep web data source is challenging A Stratified k-means clustering method over the deep web Representative Sampling –Centers –Proportions The experiment results show the efficiency of our work