Chapter 1 Introduction to Clustering. Section 1.1 Introduction.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

PARTITIONAL CLUSTERING
Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
Sampling Distributions (§ )
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Data Mining Techniques Outline
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Data mining and statistical learning - lecture 14 Clustering methods  Partitional clustering in which clusters are represented by their centroids (proc.
Chapter 28 Design of Experiments (DOE). Objectives Define basic design of experiments (DOE) terminology. Apply DOE principles. Plan, organize, and evaluate.
Introduction to Bioinformatics - Tutorial no. 12
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Chapter 7 Estimation: Single Population
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Evaluating Performance for Data Mining Techniques
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
Gene expression profiling identifies molecular subtypes of gliomas
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 15 Correlation and Regression
DATA MINING CLUSTERING K-Means.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 1 Algebraic Reasoning Chapter 2 Integers and Rational Numbers Chapter 3 Applying Rational Numbers Chapter 4 Patterns and Functions Chapter 5 Proportional.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
8 th Grade Math Common Core Standards. The Number System 8.NS Know that there are numbers that are not rational, and approximate them by rational numbers.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering Procedure Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 16, 2015.
Biostatistics Unit 5 – Samples. Sampling distributions Sampling distributions are important in the understanding of statistical inference. Probability.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Copyright © 2013 Pearson Education. All rights reserved. Chapter 1 Introduction to Statistics and Probability.
How Errors Propagate Error in a Series Errors in a Sum Error in Redundant Measurement.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Chapter 1: Square Roots and the Pythagorean Theorem Unit Review.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Sampling Distributions 8.
3.Learning In previous lecture, we discussed the biological foundations of of neural computation including  single neuron models  connecting single neuron.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions 5-4 Mean, Variance and Standard Deviation.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Chapter 8 Fundamental Sampling Distributions and Data Descriptions.
Variable Reduction for Predictive Modeling with Clustering
Revision (Part II) Ke Chen
Revision (Part II) Ke Chen
Dimension reduction : PCA and Clustering
INTRODUCTION TO Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Cluster Analysis.
Ch 4.1 & 4.2 Two dimensions concept
Presentation transcript:

Chapter 1 Introduction to Clustering

Section 1.1 Introduction

3 Objectives Introduce clustering and unsupervised learning. Explain the various forms of cluster analysis. Outline several key distance metrics used as estimates of experimental unit similarity.

4 Course Overview Variable Selection VARCLUS Plot Data PRINCOMP,MDS,CANDISC Preprocessing ACECLUS ‘Fuzzy’ Clustering FACTOR Discrete Clustering Hierarchical Clustering CLUSTER Optimization Clustering Parametric Clustering FASTCLUS Non-Parametric Clustering MODECLUS

5 “Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.” B. S. Everitt (1998), “The Cambridge Dictionary of Statistics” Definition Cluster Solution Sensible  Interpretable Given Class Derived Class Un- interpretable

6 Learning without a priori knowledge about the classification of samples; learning without a teacher. Kohonen (1995), “Self-Organizing Maps” Unsupervised Learning

Section 1.2 Types of Clustering

8 Distinguish between the two major classes of clustering methods: –hierarchical clustering –optimization (partitive) clustering. Objectives

9 Hierarchical Clustering AgglomerativeDivisiveIteration

10 Propagation of Errors Iteration (error)

11 Optimization (Partitive) Clustering “Seeds”Observations X X X X Initial StateFinal State Old location X X X X X X X X New location

12 Heuristic Search 1.Find an initial partition of the n objects into g groups. 2.Calculate the change in the error function produced by moving each observation from its own cluster to another group. 3.Make the change resulting in the greatest improvement in the error function. 4.Repeat steps 2 and 3 until no move results in improvement.

Section 1.3 Similarity Metrics

14 Define similarity and what comprises a good measure of similarity. Describe a variety of similarity metrics. Objectives

15 Although the concept of similarity is fundamental to our thinking, it is also often difficult to precisely quantify. Which is more similar to a duck: a crow or a penguin? The metric that you choose to operationalize similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover. What Is Similarity?

16 The following principles have been identified as a foundation of any good similarity metric: 1.symmetry: d(x,y) = d(y,x) 2.non-identical distinguishability: if d(x,y)  0 then x  y 3.identical non-distinguishability: if d(x,y) = 0 then x = y Some popular similarity metrics (for example, correlation) fail to meet one or more of these criteria. What Makes a Good Similarity Metric?

17 Euclidean Distance Similarity Metric Pythagorean Theorem: The square of the hypotenuse is equal to the sum of the squares of the other two sides. x1x1 x2x2 (x1,x2)(x1,x2) (0, 0)

18 City block (Manhattan) distance is the distance between two points measured along axes at right angles. City Block Distance Similarity Metric (w 1,w 2 ) (x 1,x 2 )

19 Similar Tom Marie Correlation Similarity Metrics Dissimilar Jerry Marie Tom Jerry No Similarity

20 The Problem with Correlation VariableObservation 1Observation 2 x x x x x Mean333 Std. Dev The correlation between observations 1 and 2 is a perfect 1.0, but are the observations really similar?

21 Density Estimate Based Similarity Metrics Clusters can be seen as areas of increased observation density. Similarity is a function of the distance between the identified density bubbles (hyper-spheres). similarity Density Estimate 1 (Cluster 1) Density Estimate 2 (Cluster 2)

…17 Gene A Gene B D H = = 5 Gene expression levels under 17 conditions (low=0, high=1) Hamming Distance Similarity Metric

23 The DISTANCE Procedure General form of the DISTANCE procedure: Both the PROC DISTANCE statement and the VAR statement are required. PROC DISTANCE METHOD=method ; COPY variables; VAR level (variables ) ; RUN; PROC DISTANCE METHOD=method ; COPY variables; VAR level (variables ) ; RUN;

24 This demonstration illustrates the impact on cluster formation of two distance metrics generated by the DISTANCE procedure. Generating Distances ch1s3d1

Section 1.4 Classification Performance

26 Use classification matrices to determine the quality of a proposed cluster solution. Use the chi-square and Cramer’s V statistic to assess the relative strength of the derived association. Objectives

27 Perfect Solution Quality of the Cluster Solution Typical Solution No Solution

28 Probability of Cluster Assignment Frequency The probability that a cluster number represents a given class is given by the cluster’s proportion of the row total. Probability

29 The Chi-Square Statistic The chi-square statistic (and associated probability) determine whether an association exists depend on sample size do not measure the strength of the association.

30 Measuring Strength of an Association WEAKSTRONG 01 CRAMER'S V STATISTI C  Cramer’s V ranges from -1 to 1 for 2X2 tables.