CLUSTERING PROXIMITY MEASURES

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Chapter 16: Correlation.
Clustering.
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
TYPES OF DATA. Qualitative vs. Quantitative Data A qualitative variable is one in which the “true” or naturally occurring levels or categories taken by.
Matrix Algebra Matrix algebra is a means of expressing large numbers of calculations made upon ordered sets of numbers. Often referred to as Linear Algebra.
Distance and Similarity Measures
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Chapter 5 Orthogonality
Nominal Level Measurement n numbers used as ways to identify or name categories n numbers do not indicate degrees of a variable but simple groupings of.
A quick introduction to the analysis of questionnaire data John Richardson.
Distance Measures Tan et al. From Chapter 2.
Introduction to Educational Statistics
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Statistical Analysis of Microarray Data
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
Separate multivariate observations
Scales of Measurement What is a nominal scale? A scale that categorizes items What is an ordinal scale? A scale that categorizes and rank orders items.
COSC 4335 DM: Preprocessing Techniques
Geo479/579: Geostatistics Ch13. Block Kriging. Block Estimate  Requirements An estimate of the average value of a variable within a prescribed local.
Chapter 3 Statistical Concepts.
Principles of Pattern Recognition
Computational Intelligence: Methods and Applications Lecture 5 EDA and linear transformations. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
VARIATION, VARIABLE & DATA POSTGRADUATE METHODOLOGY COURSE Hairul Hafiz Mahsol Institute for Tropical Biology & Conservation School of Science & Technology.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Statistics and Linear Algebra (the real thing). Vector A vector is a rectangular arrangement of number in several rows and one column. A vector is denoted.
1  Specific number numerical measurement determined by a set of data Example: Twenty-three percent of people polled believed that there are too many polls.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Classification. Similarity measures Each ordination or classification method is based (explicitely or implicitely) on some similarity measure (Two possible.
Vector Norms and the related Matrix Norms. Properties of a Vector Norm: Euclidean Vector Norm: Riemannian metric:
Chapter 13 Descriptive Data Analysis. Statistics  Science is empirical in that knowledge is acquired by observation  Data collection requires that we.
Chapter 2: Getting to Know Your Data
Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.
Correlation & Regression Analysis
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Chapter 15: Correlation. Correlations: Measuring and Describing Relationships A correlation is a statistical method used to measure and describe the relationship.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Introduction to Scale Space and Deep Structure. Importance of Scale Painting by Dali Objects exist at certain ranges of scale. It is not known a priory.
Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality.
Pattern Recognition Mathematic Review Hamid R. Rabiee Jafar Muhammadi Ali Jalali.
Function of a random variable Let X be a random variable in a probabilistic space with a probability distribution F(x) Sometimes we may be interested in.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Unit 1 Section 1.2.
CLUSTERING Basic Concepts
Pharmaceutical Statistics
Distance and Similarity Measures
Chapter 2: Getting to Know Your Data
Lecture 2-2 Data Exploration: Understanding Data
CH 5: Multivariate Methods
APPROACHES TO QUANTITATIVE DATA ANALYSIS
COP 6726: New Directions in Database Systems
Computing Reliability
Lecture Notes for Chapter 2 Introduction to Data Mining
Clustering (3) Center-based algorithms Fuzzy k-means
Similarity and Dissimilarity
School of Computer Science & Engineering
Lecture Notes for Chapter 2 Introduction to Data Mining
Clustering and Multidimensional Scaling
Since When is it Standard to Be Deviant?
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
What Is Good Clustering?
Multidimensional Scaling
Group 9 – Data Mining: Data
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

CLUSTERING PROXIMITY MEASURES Presentation slide for courses, classes, lectures et al. CLUSTERING PROXIMITY MEASURES By Çağrı Sarıgöz Submitted to Assoc. Prof. Turgay İbrikçi EE 639

Classification Classifying has been one of the crucial though activities of human kind. It makes it easy to perceive the outside world and act accordingly. Aristotle’s Classification of Living Things is one of the most famous classification works dating back to ancient times

Cluster Analysis Cluster analysis brings mathematical methodology to the solution of classification problems It deals with classification or grouping of data into a set of categories or clusters. Data objects that are in the same cluster should be similar and the ones that are in different clusters should be dissimilar in some context. It’s generally a subjective matter to determine this context.

Approaching the Data Objects Feature Types Continuous Discrete Binary Measurement Levels Qualitative Nominal Ordinal Quantitative Interval Ratio

Feature Types A continuous feature can take a value from an uncountably infinite range. Exact weight of a person. Whereas a discrete feature has a range of value that is finite or countably infinite. Number of heartbeats of a person, in bpm. Binary feature is a special case of discrete features where there is only 2 values that the feature can take. Presence or absence of tattoos on a person’s skin.

Measurement Levels: Qualitative Features at nominal level have no mathematical meaning; they generally are levels, states or names. Color of a car, condition of weather, etc.. Features at ordinal level are still just names, but with a certain order. But, the difference between the values are still meaningless in mathematical sense. Degrees of headache: none, slight, moderate, severe, unbearable, etc..

Measurement Levels: Quantitative At interval level, difference between feature values has a meaning, but there is no true zero in the range of level, i.e. the ratio between two values has no meaning. IQ score. A person with 140 IQ score isn’t necessarily two times intelligent than a person with 70 IQ score. Features at ratio level have all the properties of the other, plus a true zero, so that the ratio between two values has a mathematical meaning. Number of cars in a parking lot.

Definition of Proximity Measures: Dissimilarity (Distance) A dissimilarity or distance function D on a data set X is defined to satisfy these conditions: Symmetry: D(xi , xj) = D(xj , xi ) Positivity: D(xi, xj) ≥ 0 for all xi and xj. It’s called a dissimilarity metric if these conditions also hold, Triangle inequality: D(xi, xj) ≤ D(xi, xk) + D(xk, xj) for all xi, xj and xk Reflexivity: D(xi, xj) = 0 iff xi = xj It’s called a semimetric if triangle inequality does not hold If the following condition also holds, it’s called a ultrametric: D(xi, xj) ≤ max(D(xi, xk),D(xj, xk)) for all xi, xj and xk.

Definition of Proximity Measures: Similarity A similarity function S is defined to satisfy the following conditions: Symmetry: S(xi , xj) = S(xj , xi); Positivity: 0 ≤ S(xi, xj) ≤ 1, for all xi and xj. It’s called a similarity metric if the following additional conditions also hold: For all xi , xj, and xk, S (xi , xj)S (xj , xk) ≤ [S (xi , xj) + S (xj , xk)]S (xi , xk) S(xi , xj) = 1 iff xi = xj

Proximity Measures for Continuous Variables Euclidean distance (also known as L2 norm) : xi and xj are d-dimensional data objects Euclidean distance is a metric, tending to form hyperspherical clusters. Also, clusters formed with Euclidean distance are invariant to translations and rotations in the feature space. Without normalizing the data, features with large values and variances will tend to dominate over other features. A commonly used method is data standardization, in which each feature has zero mean and unit variance, where xil* represents the raw data and sample mean ml and sample standard sl are defined as and respectively.

Proximity Measures for Continuous Variables Another normalization approach: The Euclidean distance can be generalized as a special case of a family of metrics, called Minkowski distance or Lp norm, defined as: When p = 2, the distance becomes the Euclidean distance. p = 1: the city-block (Manhattan distance) or L1 norm, p → ∞ : the sup distance or L∞ norm,

Proximity Measures for Continuous Variables The squared Mahalanobis distance is also a metric: Where S is the within-class covariance matrix defined as S = E[(x − μ)(x − μ)T] where μ is the mean vector and E[·] calculates the expected value of a random variable. Mahalanobis distance tends to form hyperellipsodial clusters, which are invariant to any nonsingular linear transformation. The calculation of the inverse of S may cause some computational burden for large-scale data. When features are not correlated, S equals to an identity matrix, making Mahalanobis distance equal to Euclidean distance.

Proximity Measures for Continuous Variables The point symmetry distance is based on the assumption that the cluster’s structure is symmetric: Where xr is a reference point (e.g. the centroid of the cluster) and ||·|| represents the Euclidean norm. It calculates the distance between an object xi and xr, the reference point, given other N – 1 objects and minimized when a symmetric pattern exists.

Proximity Measures for Continuous Variables The distance measure can also be derived from a correlation coefficient, such as the Pearson correlation coefficient, defined as, The correlation coefficient is in the range of [-1,1], with -1 and 1 indicating the strongest negative and positive corre- lation respectively. So we can define the distance measure as which is in the range of [0,1]. Features should be measured on the same scales, otherwise the calculation of the mean or variance in calculating the Pearson correlation coefficient would have no meaning.

Proximity Measures for Continuous Variables Cosine similarity is an example of similarity measures, which can be used to compare a pair of data objects with continuous variables, given as, which can be constructed as a distance measure by simply using D(xi, xj) = 1 − S(xi, xj). Like Pearson correlation coefficient, the cosine similarity is unable to provide information on the magnitude of differences.

Examples and Applications of the Proximity Measures for Continuous Variables

Proximity Measures for Discrete Variables: Binary Variables Invariant similarity measures for symmetric binary variables: 1-1 match and 0-0 match of the variables are regarded as equally important. Unmatched pairs are weighted based on their contribution to the similarity. For the simple matching coefficient, the corresponding dissimilarity measure from D(xi, xj) = 1 − S(xi, xj) is known as the Hamming distance.

Proximity Measures for Discrete Variables: Binary Variables Non-invariant similarity measures for asymmetric binary variables: These measures focus on 1-1 match features while ignoring the effect of 0-0 match, which is considered uninformative. Again, the unmatched pairs are weighted depending on their importance.

Proximity Measures for Discrete Variables with More than Two Values One simple and direct approach is to map the variables into new binary features. It is simple, but it may cause introducing too many binary variables. A more effective and commonly used method is based on matching criterion. For a pair of d- dimensional objects xi and xj, the similarity using the simple matching criterion is given as: where

Proximity Measures for Discrete Variables with More than Two Values The categorical features may display certain orders, known as the ordinal features. In this case, the codes from 1 to Ml, where Ml is the highest level, are no meaningless in similarity measures. In fact, the closer the two levels are, the more similar the two objects in that feature. Objects with this type of feature can be compared using the continuous dissimilarity measures. Since the number of possible levels varies with the different features, the original ranks ril* for the ith object in the lth feature are usually converted into the new ranks ril in the range of [0,1], using the following method: Then city-block or Euclidean distance can be used.

Proximity Measures for Mixed Variables The similarity measure for a pair of d-dimensional mixed data objects xi and xj can be defined as: where Sijl indicates the similarity for the lth feature between the two objects, and δijl is a 0-1 coefficient based on whether the measure of the two objects is missing. Correspondingly, the dissimilarity measure can be obtained by simply using D(xi, xj) = 1 − S(xi, xj). The component similarity for discrete variables: For continuous variables: where Rl is the range of the lth variable over all objects, written as

Questions?