Estimating the Number of Data Clusters via the Gap Statistic Paper by: Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001),

Slides:

Advertisements

Similar presentations

PARTITIONAL CLUSTERING

Advertisements

Topics Today: Case I: t-test single mean: Does a particular sample belong to a hypothesized population? Thursday: Case II: t-test independent means: Are.

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Ch11 Curve Fitting Dr. Deshi Ye

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing.

6/1/2015Raffaele Giancarlo1 Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

THE MEANING OF STATISTICAL SIGNIFICANCE: STANDARD ERRORS AND CONFIDENCE INTERVALS.

University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Reduced Support Vector Machine

Stat 301 – Day 14 Review. Previously Instead of sampling from a process  Each trick or treater makes a “random” choice of what item to select; Sarah.

Comparing Systems Using Sample Data

The Simple Regression Model

Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 17: Nonparametric Tests & Course Summary.

Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.

1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.

PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.

What is Cluster Analysis?

PROBABILITY AND SAMPLES: THE DISTRIBUTION OF SAMPLE MEANS.

Chapter 11: Inference for Distributions

Testing models against data Bas Kooijman Dept theoretical biology Vrije Universiteit Amsterdam master course WTC.

Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.

The Multivariate Normal Distribution, Part 1 BMTRY 726 1/10/2014.

Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.

The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,

1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo

1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:

Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)

Random Sampling, Point Estimation and Maximum Likelihood.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset Moira Regelson, Ph.D. September 15, 2005.

Lecture 20: Cluster Validation

BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.

Limits to Statistical Theory Bootstrap analysis ESM April 2006.

Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 6 Hypothesis Tests with Means.

Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.

Point Pattern Analysis Point Patterns fall between the two extremes, highly clustered and highly dispersed. Most tests of point patterns compare the observed.

Point Pattern Analysis

1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.

© Copyright McGraw-Hill 2004

Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Module 25: Confidence Intervals and Hypothesis Tests for Variances for One Sample This module discusses confidence intervals and hypothesis tests.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

Hypothesis Testing and Statistical Significance

Chapter 3: Uncertainty "variation arises in data generated by a model" "how to transform knowledge of this variation into statements about the uncertainty.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.

CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)

Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.

Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Estimating standard error using bootstrap

Unsupervised Learning

Comparing Systems Using Sample Data

Statistical Quality Control, 7th Edition by Douglas C. Montgomery.

Confidence Intervals and Hypothesis Tests for Variances for One Sample

Cluster Analysis of Microarray Data

Hypothesis Testing.

One-Way Analysis of Variance

Statistical Inference for the Mean: t-test

Machine Learning: Lecture 5

Unsupervised Learning

Presentation transcript:

Estimating the Number of Data Clusters via the Gap Statistic Paper by: Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001), 63, pp BIOSTAT M278, Winter 2004 Presented by Andy M. Yip February 19, 2004

Part I: General Discussion on Number of Clusters

Cluster Analysis Goal: partition the observations {x i } so that –C(i)=C(j) if x i and x j are “similar” –C(i)  C(j) if x i and x j are “dissimilar” A natural question: how many clusters? –Input parameter to some clustering algorithms –Validate the number of clusters suggested by a clustering algorithm –Conform with domain knowledge?

What’s a Cluster? No rigorous definition Subjective Scale/Resolution dependent (e.g. hierarchy) A reasonable answer seems to be: application dependent (domain knowledge required)

What do we want? An index that tells us: Consistency/Uniformity more likely to be 2 than 3 more likely to be 36 than 11 more likely to be 2 than 36? (depends, what if each circle represents 1000 objects?)

What do we want? An index that tells us: Separability increasing confidence to be 2

What do we want? An index that tells us: Separability increasing confidence to be 2

What do we want? An index that tells us: Separability increasing confidence to be 2

What do we want? An index that tells us: Separability increasing confidence to be 2

What do we want? An index that tells us: Separability increasing confidence to be 2

Do we want? An index that is –independent of cluster “volume”? –independent of cluster size? –independent of cluster shape? –sensitive to outliers? –etc… Domain Knowledge!

Part II: The Gap Statistic

Within-Cluster Sum of Squares xixi xjxj

Measure of compactness of clusters

Using W k to determine # clusters Idea of L-Curve Method: use the k corresponding to the “elbow” (the most significant increase in goodness-of-fit)

Gap Statistic Problem w/ using the L-Curve method: –no reference clustering to compare –the differences W k  W k  1 ’s are not normalized for comparison Gap Statistic: –normalize the curve log W k v.s. k –null hypothesis: reference distribution –Gap(k) := E * (log W k )  log W k –Find the k that maximizes Gap(k) (within some tolerance)

Choosing the Reference Distribution A single-component is modelled by a log- concave distribution (strong unimodality (Ibragimov’s theorem)) –f(x) = e  (x) where  (x) is concave Counting # modes in a unimodal distribution doesn’t work --- impossible to set C.I. for # modes  need strong unimodality

Choosing the Reference Distribution Insights from the k-means algorithm: Note that Gap(1) = 0 Find X * (log-concave) that corresponds to no cluster structure (k=1) Solution in 1-D:

However, in higher dimensional cases, no log- concave distribution solves The authors suggest to mimic the 1-D case and use a uniform distribution as reference in higher dimensional cases

Two Types of Uniform Distributions 1.Align with feature axes (data-geometry independent) Observations Bounding Box (aligned with feature axes) Monte Carlo Simulations

Two Types of Uniform Distributions 2.Align with principle axes (data-geometry dependent) Observations Bounding Box (aligned with principle axes) Monte Carlo Simulations

Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X 1b, X 2b, …, X nb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log W k for l = 1 to B Cluster the M.C. sample into k groups and compute log W kb Compute Compute sd(k), the s.d. of {log W kb } l=1,…,B Set the total s.e. Find the smallest k such that Error-tolerant normalized elbow!

2-Cluster Example

No-Cluster Example (tech. report version)

No-Cluster Example (journal version)

Example on DNA Microarray Data 6834 genes 64 human tumour

The Gap curve raises at k = 2 and 6

Calinski and Harabasz ‘74 Krzanowski and Lai ’85 Hartigan ’75 Kaufman and Rousseeuw ’90 (silhouette) Other Approaches

Simulations (50x) a.1 cluster: 200 points in 10-D, uniformly distributed b.3 clusters: each with 25 or 50 points in 2-D, normally distributed, w/ centers (0,0), (0,5) and (5,-3) c.4 clusters: each with 25 or 50 points in 3-D, normally distributed, w/ centers randomly chosen from N(0,5I) (simulation w/ clusters having min distance less than 1.0 was discarded.) d.4 clusters: each w/ 25 or 50 points in 10-D, normally distributed, w/ centers randomly chosen from N(0,1.9I) (simulation w/ clusters having min distance less than 1.0 was discarded.) e.2 clusters: each cluster contains 100 points in 3-D, elongated shape, well-separated

Overlapping Classes 50 observations from each of two bivariate normal populations with means (0,0) and ( ,0), and covariance I.  = 10 value in [0, 5] 10 simulations for each 

Conclusions Gap outperforms existing indices by normalizing against the 1-cluster null hypothesis Gap is simple to use No study on data sets having hierarchical structures is given Choice of reference distribution in high-D cases? Clustering algorithm dependent?