Clustering V. Outline Validating clustering results Randomization tests.

Slides:



Advertisements
Similar presentations
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
1 Hypothesis Testing Chapter 8 of Howell How do we know when we can generalize our research findings? External validity must be good must have statistical.
1 Hypothesis testing. 2 A common aim in many studies is to check whether the data agree with certain predictions. These predictions are hypotheses about.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
PSY 307 – Statistics for the Behavioral Sciences
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Topic 2: Statistical Concepts and Market Returns
Hypothesis testing & Inferential Statistics
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 2. Inference.
Inference.ppt - © Aki Taanila1 Sampling Probability sample Non probability sample Statistical inference Sampling error.
Cluster Validation.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Experimental Evaluation
EC Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 2. Inference.
Chapter 9 Hypothesis Testing.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.
The t Tests Independent Samples.
Inferential Statistics
Learning Objective Chapter 13 Data Processing, Basic Data Analysis, and Statistical Testing of Differences CHAPTER thirteen Data Processing, Basic Data.
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Statistical Inference Dr. Mona Hassan Ahmed Prof. of Biostatistics HIPH, Alexandria University.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Copyright © Cengage Learning. All rights reserved. 11 Applications of Chi-Square.
AM Recitation 2/10/11.
Overview Definition Hypothesis
Testing Hypotheses about a Population Proportion Lecture 29 Sections 9.1 – 9.3 Tue, Oct 23, 2007.
Statistical inference: confidence intervals and hypothesis testing.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
Topic 5 Statistical inference: point and interval estimate
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.2.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
Lecture 20: Cluster Validation
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
AP STATISTICS LESSON 10 – 2 DAY 1 TEST OF SIGNIFICANCE.
Chapter 12 Tests of a Single Mean When σ is Unknown.
The Practice of Statistics Third Edition Chapter 10: Estimating with Confidence Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
Testing of Hypothesis Fundamentals of Hypothesis.
CHAPTER 17: Tests of Significance: The Basics
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Confidence intervals and hypothesis testing Petter Mostad
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
Marina Maksimova Session 7 Comments for “Classical Tests”
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 2. Inference.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
What is a Hypothesis? A hypothesis is a claim (assumption) about the population parameter Examples of parameters are population mean or proportion The.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
Chi-Square Analysis Test of Homogeneity. Sometimes we compare samples of different populations for certain characteristics. When we do so we are performing.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Clustering Evaluation The EM Algorithm
CSE 4705 Artificial Intelligence
Critical Issues with Respect to Clustering
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Processing, Basic Data Analysis, and the
Presentation transcript:

Clustering V

Outline Validating clustering results Randomization tests

Cluster Validity All clustering algorithms provided with a set of points output a clustering How to evaluate the “goodness” of the resulting clusters? Tricky because “clusters are in the eye of the beholder”! Then why do we want to evaluate them? – To compare clustering algorithms – To compare two sets of clusters – To compare two clusters – To decide whether there is noise in the data

Clusters found in Random Data Random Points K-means DBSCAN Complete Link

Use the objective function F Dataset X, Objective function F Algorithms: A 1, A 2,…A k Question: Which algorithm is the best for this objective function? R 1 = A 1 (X), R 2 = A 2 (X),…,R k =A k (X) Compare F(R 1 ), F(R 2 ),…,F(R k )

Evaluating clusters Function H computes the cohesiveness of a cluster (e.g., smaller values larger cohesiveness) Examples of cohesiveness? Goodness of a cluster c is H(c) c is better than c’ if H(c) < H(c’)

Evaluating clusterings using cluster cohesiveness? For a clustering C consisting of k clusters c 1,…,c k H(C) = Φ i H(c i ) What is Φ ?

Cluster separation? Function S that measures the separation between two clusters c i, c j Ideas for S(c i,c j )? How can we measure the goodness of a clustering C = {c 1,…,c k } using the separation function S?

Silhouette Coefficient combines ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, I – a = average distance of i to the points in the same cluster – b = min (average distance of i to points in another cluster) – silhouette coefficient of i: s = 1 – a/b if a < b – Typically between 0 and 1. – The closer to 1 the better. Can calculate the Average Silhouette width for a cluster or a clustering Silhouette Coefficient

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity

Assessing the significance of clustering (and other data mining) results Dataset X and algorithm A Beautiful result A(D) But: what does it mean? How to determine whether the result is really interesting or just due to chance?

Examples Pattern discovery: frequent itemsets or association rules From data X we can find a collection of nice patterns Significance of individual patterns is sometimes straightforward to test What about the whole collection of patterns? Is it surprising to see such a collection?

Examples In clustering or mixture modeling: we always get a result How to test if the whole idea of components/clusters in the data is good? Do they really exist clusters in the data?

Classical methods – Hypothesis testing Example: Two datasets of real numbers X and Y (|X|=|Y|=n) Question: Are the means of X and Y (resp. E(X), E(Y)) are significantly different Test statistic: t = (E(X) – E(Y))/s, (s: an estimate of the standard deviation) The test statistic follows (under certain assumptions) the t distribution with 2n-2 degrees of freedom

Classical methods – Hypothesis testing The result can be something like: “the difference in the means is significant at the level of 0.01” That is, if we take two samples of size n, such a difference would occur by chance only in about 1 out of 100 trials Problems: – What if we are testing many hypotheses (multiple hypotheses testing) – What if there is no closed form available?

Classical methods: testing independence Are columns X and Y independent? Independence: Pr(X,Y) = Pr(X)*Pr(Y) Pr(X=1) = 8/11, Pr(X=0)=3/11, Pr(Y=1) = 8/11, Pr(Y=0) = 3/11 Actual joint probabilities: Pr(X=1,Y=1) = 6/11, Pr(X=1,Y=0)=2/11, Pr(X=0,Y=1) = 2/11, Pr(X=0.Y=0)=1/11 Expected joint probabilities: Pr(X=1,Y=1) = 64/121, Pr(X=1,Y=0)=24/121, Pr(X=0,Y=1) = 24/121, Pr(X=0,Y=0)=9/121

Testing independence using χ 2 Are columns X and Y independent? Y=1Y=0∑row X=1628 Y=0213 ∑column8311 So what?

Classical methods – Hypothesis testing The result can be something like: “the independence between X and Y is significant at the level of 0.01” That is, if we take two columns X and Y with the observed P(X=1) and P(Y=1) and n rows, such degree of independence would occur by chance only in about 1 out of 100 trials

Problems with classical methods What if we are testing many hypotheses (multiple hypotheses testing) What if there is no closed form available?

Randomization methods Goal: assessing the significance of results – Could the result have occurred by chance? Methodology: create datasets that somehow reflect the characteristics of the true data

Randomization methods Create randomized versions from the data X X 1, X 2,…,X k Run algorithm A on these, producing results A(X 1 ), A(X 2 ),…,A(X k ) Check if the result A(X) on the real data is somehow different from these Empirical p-value: the fraction of cases for which the result on real data is (say) larger than A(X) If the empirical p-value is small, then there is something interesting in the data

Randomization for testing independence P x = Pr(X=1) and P y = Pr(Y=1) Generate random instances of columns (X i,Y i ) with parameters P x and P y [independence assumption] p-value: Compute the in how many random instances, the χ 2 statistic is greater/smaller than its value in the input data

Randomization methods for other tasks Instantiation of randomization for clustering? Instantiation of randomization for frequent- itemset mining

Columnwise randomization: no global view of the data

X and Y are not more surprisingly correlated given that they both have 1s in dense rows and 0s in sparse rows

Questions What is a good way of randomizing the data? Can the sample X 1, X 2, …, X k be computed efficiently? Can the values A(X 1 ), A(X 2 ), …, A(X k ) be computed efficiently?

What is a good way of randomizing the data? How are datasets X i generated? What is the underlying “null model”/ ”null hypothesis”

Swap randomization 0—1 data: n rows, m columns, presence/absence Randomize the dataset by generating random datasets with the same row and column margins as the original data Reference: A. Gionis, H. Mannila, T. Mielikainen and P. Tsaparas: Assessing data-mining results via swap randomization (TKDD 2006)

Basic idea Maintains the degree structure of the data Such datasets can be generated by swaps

Fixed margins Null hypothesis: the row and the column margins of the data are fixed If the marginal information is known, then what else can you say about the data? What other structure is there in the data?

Example Significant co-occurrence of X and Y No significant co- occurrence of X and Y

Swap randomization and clustering