1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Active Appearance Models
Reinforcement Learning
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Biointelligence Laboratory, Seoul National University
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Visual Recognition Tutorial
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Evaluating Hypotheses
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
Presented by Zeehasham Rasheed
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area.
Chapter 1: Introduction to Statistics
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
1 Reading Report 9 Yin Chen 29 Mar 2004 Reference: Multivariate Resource Performance Forecasting in the Network Weather Service, Martin Swany and Rich.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Dan Piett STAT West Virginia University
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 Normal Probability Distributions 6-1 Review and Preview 6-2 The Standard Normal.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Distributions of the Sample Mean
Non-Photorealistic Rendering and Content- Based Image Retrieval Yuan-Hao Lai Pacific Graphics (2003)
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Simulated data sets Extracted from:. The data sets shared a common time period of 30 years and age range from 0 to 16 years. The data were provided to.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Chapter 13 Sampling distributions
NTU & MSRA Ming-Feng Tsai
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
PDF, Normal Distribution and Linear Regression
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Qiang Huo(*) and Chorkin Chan(**)
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007

Outline  Problem definition  Example applications  A Bayesian approach  Experiment and conclusion 2

3 Problem definition Given a rank k and a finite data set D (each data point in D maps to a real value) Problem: Can we use a random sample without replacement from D to predict the k th largest/smallest value in the entire data set?

4 Problem definition (cont.) Running Example: k=2; D={1,2,3,4…79…98,10 5,10 6 } S={1, 3,79} Can we use S to predict 10 5 ?  Very hard when D is the result set of applying an arbitrary function (such as selection predicate) to each record in a database 2 nd

5 Outline  Problem definition  Example applications  A Bayesian approach  Experiment and conclusion

6 Example Applications  Data mining Outlier detection Top-k patterns mining  Database management Min/max online aggregation Top-k query processing Query optimization Distance join  Any research with keywords Top/kth/max/min/rank/extreme…

7 Outline  Problem definition  Example applications  A Bayesian approach  Experiment and conclusion

8 A Natural Estimator Running Example: k=2; D={1,2,3,4…79…98,10 5,10 6 } S={1, 3,79} Can we use S to predict 10 5 ?  Best “guess” with just S. 2 nd Our estimator is the (k’) th largest value in the sample S

9 A Natural Estimator (cont.)  In general, (k’) th largest in S may be much smaller (or even larger) than k th largest in D  The relationship between (k’) th largest in S and k th largest in D varies from data set to data set  Key strategy is that we want to characterize the distribution of ratio k th / (k’) th

10  Given distribution on the ratio of k th / (k’) th, deriving bounds on k th largest value in D becomes easy For example, if there is 95% chance the ratio is between l and h, then there is 95% chance the k th largest in D is between l x (k’) th and h x (k’) th A Natural Estimator (cont.)

11 Characterize the Ratio k th / (k’) th Running Example: k=2; D={1,2,3,4…79…98,10 5,10 6 } k’=1; S={1, 3,79}  Imagine D is the query result set obtained by applying an arbitrary function f() to a database  Impossible to predict the ratio of 10 5 /79  Without any knowledge about f(), we can’t bias 79 to larger values  But with some domain (prior) knowledge and a sample of f(), we may reasonably guess the behaviors of f() 2 nd

12 Characterize the Ratio k th / (k’) th (cont.)  Workload in DB gives abundant domain knowledge each query corresponds to a data set in the same domain some queries may result in f() values that are typically small, with few outliers boost k th largest some queries may result in f() values that are tightly distributed around its mean; k th and (k’) th are very close  When a new query is asked, we guess which type of “typical” queries by a few f() samples  A model is needed to classify the queries Key Question: What aspect of the queries we want to model, in order to describe different distributions of k th / (k’) th ?

13 Importance of the Histogram Shape of a Query Result Set Setup: 4 data sets with different histogram shapes; |D|=10,000, k=1 Experiment: repeat sampling 500 times, sample size 100, k’=1, get the avg ratio k th /(k’) th Ratio: 3.07Ratio: 1.06 Ratio: 4.45 Ratio: 1.01 Conclusion: We need to model the histogram shape!  Histogram shape of the query result set affects the ratio of k th / (k’) th  Scale does not matter

14 First Step to Define the Model  Now, we want a model to capture the histogram shape of a query result set  Before deriving the math, it is beneficial to think how the model will work Since if we know how the model works, we can find an appropriate probability density function to describe the model

15 Our Generative Model  Assume there is a set of histogram shapes; each shape has a weight, where  To generate a data set 1.A biased die ( s) is rolled to determine by which shape pattern the new data set will be generated 2.An arbitrary scale is selected to define the magnitude of the items in the data set 3.The shape and the scale are used to instantiate a distribution and we repeatedly sample from this distribution to produce the data set

16 Our Generative Model (cont.)  The initial set of weights ( s) are our prior distribution Indicating our belief that how likely a new query result set’s histogram shape will match each shape pattern

17 Bayesian Approach  Problem: prior weights cannot give enough info One histogram shape may indicate that our estimator (k’) th is close to the extreme value k th One histogram shape may indicate that our estimator (k’) th is far from the extreme value k th Question: How do we determine which histogram shape we are experiencing once we have sampled S?

18 Bayesian Approach (cont.)  We can rely on a principled Bayesian approach Can combine the informative prior with the sample taken from the new data set  After a sample of new data set is taken, we use it to update the prior weights, the updated weights are our posterior distribution Incorporate the new evidence to determine the current histogram shape Place more weight on the most probably shape The updated weights s are used for bounding extreme value

19 Overview of Bayesian Framework 1.Build Prior Model: a set of weighted histogram shapes 2.Update Prior Model with Sample: adjusting prior weight of each shape to better describe the new data set’s histogram shape 3.Use the Updated Model to Confidence Bound on the Ratio of k th / (k’) th

20 Bayesian Framework Step 1:Prior Shape Model  Modified mixture of Gamma(x|α,β) distribution  Treat scale parameter β as a random variable; integrate β, the result: Each mixture component probability density function (pdf) p is indexed by a shape parameter α The above pdf’s input only requires M is the productivity S is the sum N is the cardinality

21 Bayesian Framework Step 1:Prior Shape Model  Use  The probability density function of our model becomes  We use EM algorithm to learn this model from historical workload

22 Bayesian Framework Step 1 Why Gamma distribution?  Gamma(α,β) distribution can produce shape with arbitrary right leaning skew

23  Aggregate sample to the form  Apply Bayes rule to update weights  The resulting pdf Bayesian Framework Step 2:Update Prior Weights

24 Bayesian Framework Step 3:Confidence Bound Extreme Value  Recall that each histogram shape characterizes a ratio distribution of k th /(k’) th

25 Bayesian Framework Step 3:Confidence Bound Extreme Value(cont.)  Each shape has a corresponding ratio distribution  Step 2 gives the posterior weight for each shape pattern  Our model now can be perceived as someone choose a shape with a die roll biased by  And we do not know which shape the die roll has chosen  Then, the resulting ratio distribution is a mixture of each shape pattern’s ratio distribution  Probability of the i th ratio distribution in the mixture distribution is

26 Bayesian Framework Step 3:Confidence Bound Extreme Value(cont.)  Now we have 1.The ratio distribution in a mixture form 2.Our estimator, the (k’) th laregest in the sample  We can confidence bound the k th largest value

27 Bayesian Framework Step 3:Confidence Bound Extreme Value(cont.)  Problem left is that for a given shape, how to efficiently get its ratio distribution?  We devised a method called TKD sampling (details in the paper) accomplishes this in O(num*k’) time

28 Outline  Problem definition  Example applications  A Bayesian approach  Experiment and conclusion

Experiment Setup 29  7 real multi-dimensional data sets  A query is created by: 1.A tuple t is randomly selected 2.A selectivity s is randomly picked from 5% to 20% 3.s x (DB size) nearest neighbors of t are chosen as the query result set 4.The scoring function is the randomly-weighted sum of three arbitrary attributes.  500 training queries & 500 testing queries  Confidence level is set to 95%

30 Experiment Results Data setsk=1k=5k=10k=20 Letter CAHouse El Nino Cover Type KDDCup Person Household  The Coverage rates for 95% confidence bound, with 10% sample for various k.

31 Experiment Results  Increasing sample size for k=1 and k=10

32 Conclusion  Defined the problem of estimating the K th extreme value in a data set  A Bayesian approach is proposed  Experimental verification on real data

33 Questions