1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.

Slides:



Advertisements
Similar presentations
Dept of Bioenvironmental Systems Engineering National Taiwan University Lab for Remote Sensing Hydrology and Spatial Modeling STATISTICS Hypotheses Test.
Advertisements

Tests of Hypotheses Based on a Single Sample
Hypothesis testing Another judgment method of sampling data.
Dhon G. Dungca, PIE, M.Eng’g.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
AP Statistics – Chapter 9 Test Review
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Topic 2: Statistical Concepts and Market Returns
Evaluating Hypotheses
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Chapter 11 Multiple Regression.
Cluster Validation.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Lecture II-2: Probability Review
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Overview of Statistical Hypothesis Testing: The z-Test
1 © Lecture note 3 Hypothesis Testing MAKE HYPOTHESIS ©
Hypothesis Testing.
STATISTICS HYPOTHESES TEST (I) Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Copyright © 2012 by Nelson Education Limited. Chapter 8 Hypothesis Testing II: The Two-Sample Case 8-1.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.2.
Lecture 20: Cluster Validation
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
STATISTICAL ANALYSIS OF FATIGUE SIMULATION DATA J R Technical Services, LLC Julian Raphael 140 Fairway Drive Abingdon, Virginia.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
4 Hypothesis & Testing. CHAPTER OUTLINE 4-1 STATISTICAL INFERENCE 4-2 POINT ESTIMATION 4-3 HYPOTHESIS TESTING Statistical Hypotheses Testing.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
1 CLUSTERING ALGORITHMS  Number of possible clusterings Let X={x 1,x 2,…,x N }. Question: In how many ways the N points can be assigned into m groups?
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Ch8.2 Ch8.2 Population Mean Test Case I: A Normal Population With Known Null hypothesis: Test statistic value: Alternative Hypothesis Rejection Region.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Appendix I A Refresher on some Statistical Terms and Tests.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Hierarchical Clustering: Time and Space requirements
STATISTICS HYPOTHESES TEST (I)
CHAPTER 10 Comparing Two Populations or Groups
Clustering (3) Center-based algorithms Fuzzy k-means
CORRELATION ANALYSIS.
CONCEPTS OF HYPOTHESIS TESTING
CSE 4705 Artificial Intelligence
Statistical Inference: One- Sample Confidence Interval
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
CHAPTER 10 Comparing Two Populations or Groups
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
Chapter Outline Inferences About the Difference Between Two Population Means: s 1 and s 2 Known.
CHAPTER 10 Comparing Two Populations or Groups
STATISTICS HYPOTHESES TEST (I)
Presentation transcript:

1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not possess a clustering structure.  Before we apply any clustering algorithm on X, it must first be verified that X possesses a clustering structure. This is known as clustering tendency.  Clustering tendency is heavily based on hypothesis testing. Specifically, it is based on testing the randomness (null) hypothesis ( H 0 ) against the regularity (H 1 ) hypothesis and the clustering (H 2 ) hypothesis. Randomness hypothesis ( H 0 ): “The vectors of Χ are randomly distributed, according to the uniform distribution in the sampling window (the compact convex support set for the underlying distribution of the vectors of the data set X ) of X ”. Regularity hypothesis ( H 1 ): “The vectors of X are regularly spaced (that is they are not too close to each other) in the sampling window”. Clustering hypothesis ( H 2 ): “The vectors of X form clusters”.

2 p(q|H 0 ), p(q|H 1 ) and p(q|H 2 ) are estimated via Monte Carlo simulations Some tests for spatial randomness, when the input space dimensionality greater than or equal to 2 are: Tests based on structural graphs  Test that utilizes the idea of the minimum spanning tree (MST) Tests based on nearest neighbor distances  The Hopkins test  The Cox-Lewis test A method based on sparse decomposition.

3  Important notes: Clustering algorithms should be applied on X, only if the randomness and the regularity hypotheses are rejected. Otherwise, methods different than clustering must be used to describe the structure of X. Most studies in clustering tendency focus on the detection of compact clusters.  The basic steps of the clustering tendency philosophy are: Definition of a test statistic q suitable for the detection of clustering tendency. Estimation of the pdf of q under the null ( H 0 ) hypothesis, p(q|H 0 ). Estimation of p(q|H 1 ) and p(q|H 2 ) (they are necessary for measuring the power of q (the probability of making a correct decision when H 0 is rejected) against the regularity and the clustering tendency hypotheses). Evaluation of q for the data set at hand, X, and examination whether it lies in the critical interval of p(q|H 0 ), which corresponds to a predetermined significance level ρ.

4  Cluster validity  In the sequel it is assumed that the clustering tendency procedure indicated the existence of a clustering structure in X.  Applying a clustering algorithm on X, with inappropriate values of the involved parameters, poor results may be obtained. Hence the need for further evaluation of clustering results is apparent.  Cluster validity: a task that evaluates quantitatively the results of a clustering algorithm.  A clustering structure C, resulting from an algorithm may be either A hierarchy of clusterings or A single clustering.

5  Cluster validity may be approached in three possible directions: C is evaluated in terms of an independently drawn structure, imposed on X a priori. The criteria used in this case are called external criteria. C is evaluated in terms of quantities that involve the vectors of X themselves (e.g., proximity matrix). The criteria used in this case are called internal criteria. C is evaluated by comparing it with other clustering structures, resulting from the application of the same clustering algorithm but with different parameter values, or other clustering algorithms, on X. Criteria of this kind are called relative criteria.

6  Cluster validity for the cases of external and internal criteria Hypothesis testing is employed. The null hypothesis H 0, which is a statement of randomness concerning the structure of X, is defined. The generation of a reference data population under the random hypothesis takes place. An appropriate statistic, q, whose values are indicative of the structure of a data set, is defined. The value of q that results from our data set X is compared against the values obtained for q when the elements of the reference (random) population are considered. Ways for generating reference populations under the null hypothesis (each one used in different situations): Random position hypothesis. Random graph hypothesis. Random label hypothesis.

7  Statistics suitable for external criteria For the comparison of C with an independently drawn partition P of X  Rand statistic  Jaccard statistic  Fowlkes-Mallows index  Hubert’s Γ statistic  Normalized Γ statistic For assessing the agreement between P and the proximity matrix P.  Γ statistic.  Statistics suitable for internal criteria Validation of hierarchy of clusterings  Cophenetic correlation coefficient ( CPCC )  γ statistic  Kudall’s τ statistic. Validation of individual clusterings  Γ statistic  Normalized Γ statistic

8  Cluster validity for the cases of relative criteria Let A denote the set of parameters of a clustering algorithm. Statement of the problem “Among the clusterings produced by a specific clustering algorithm, for different values of the parameters in A, choose the one that best fits the data set X ”. We consider two cases (a) A does not contain the number of clusters m. The estimation of the best set of parameter values is carried out as follows: Run the algorithm for a wide range of values of its parameters. Plot the number of clusters, m, versus the parameters of A. Choose the widest range for which m remains constant. Adopt the clustering that corresponds to the values of the parameters in A that lie in the middle of this range.

9 (b) A does contain the number of clusters m. The estimation of the best set of parameter values is carried out as follows: Select a suitable performance index q (the best clustering is identified in terms of q ). For m=m min to m max  Run the algorithm r times using different sets of values for the other parameters of A and each time compute q.  Choose the clustering that corresponds to the best q. End for Plot the best values of q for each m versus m. The presence of a significant knee indicates the number of clusters underlying X. Adopt the clustering that corresponds to that knee. The absence of such a knee indicates that X possesses no clustering structure.

10  Statistics suitable for relative criteria Hard clustering  Modified Hubert Γ statistic  Dunn and Dunn-like indices  Davies-Bouldin (DB) and DB-like indices Fuzzy clustering  Indices for clusters with point representatives oPartition coefficient (PC) oPartition entropy coefficient (PE) oXie-Beni (XB) index oFukuyama-Sugeno index oTotal fuzzy hypervolume oAverage partition density oPartition density

11  Statistics suitable for relative criteria (cont.) Fuzzy clustering (cont.)  Indices for shell-shaped clusters oFuzzy shell density oAverage partition shell density oShell partition density oTotal fuzzy average shell thickness