Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

What is Cluster Analysis?
Clustering.
Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
CLUSTERING PROXIMITY MEASURES
What is Cluster Analysis?
Clustering (slide from Han and Kamber)
Clustering.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Distance Measures Tan et al. From Chapter 2.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Cluster Analysis (1).
Cluster Analysis.
CS Instance Based Learning1 Instance Based Learning.
Clustering Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English CS240B lecture notes.
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
UIC - CS 5941 Chapter 5: Clustering. UIC - CS 5942 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering,
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
2013 Teaching of Clustering
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Chapter 3 Statistical Concepts.
Cluster Analysis Part I
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
Chapter 2: Getting to Know Your Data
Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Similarity Measures Spring 2009 Ben-Gurion University of the Negev.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Data Mining Lecture 7. Course Syllabus Clustering Techniques (Week 6) –K-Means Clustering –Other Clustering Techniques.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Chapter 2: Getting to Know Your Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture 2-2 Data Exploration: Understanding Data
Data Mining: Concepts and Techniques
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Lecture Notes for Chapter 2 Introduction to Data Mining
©Jiawei Han and Micheline Kamber Department of Computer Science
Similarity and Dissimilarity
Selected Topics in AI: Data Clustering
Instance Based Learning (Adapted from various sources)
School of Computer Science & Engineering
K Nearest Neighbor Classification
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Classification Algorithms
What Is Good Clustering?
Clustering Wei Wang.
Nearest Neighbors CSC 576: Data Mining.
Group 9 – Data Mining: Data
Data Pre-processing Lecture Notes for Chapter 2
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
What is Cluster Analysis?
About Data Analysis.
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009

Book Information Data Mining, Concepts and Techniques Chapter 7, Section 2, Types of Data in Cluster Analysis Advances in Instance-Based Learning Algorithms, Dissertation by D. Randall Wilson, August Chapters 4 and 5. Prototype Styles of Generalization Thesis by D. Randall Wilson, August Chapters 3.

Data Each instance (point, record, example) Composed of one or more features. Feature Composed of a data type Data type has a range of values.

Data Types Interval-Scaled Real Integer Complex Ratio-Scaled Binary Symmetric Asymmetric

Data Types Categorical Ordinal Discrete Continuous Others Vectors Shape Etc.

Comparing Instances How does one compare instances? Clustering Classification Instance-Base Classifiers Artificial Neural Networks Support Vector Machines Distance Functions (Measures)

Distance Measures Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j)

Interval-Scaled Variables Many Different Distance Measures Euclidean Manhattan (City Block) Minkowski For purpose of discussion, assume all features in data point are Interval- Scaled.

Euclidean Also called the L 2 norm Assumes a straight-line from two points Where i, j are two different instances n is the number of interval-features X iz is the value at z th feature value for i.

Manhattan Also classed the L 1 norm Non-Linear. Where i, j are two different instances n is the number of interval-features X iz is the value at z th feature value for i.

Minkowski Euclidean and Manhattan Special Cases Where p is a positive integer Also called the L p norm fuction

Minkowski Not all features are equal. Some are irrelevant Some are should be highly influential Where, W z is the ‘weight’ W z >= 0.

Example x1 = (1,2), x2 = (3,5) Euclidean: Manhattan: Minkowski (q=3):

Other Distance Measures Camberra Chebychev Quadratic Mahalanobis Correlation Chi-Squared Kendall’s Rank Correlation And so forth.

Problem Feature value ranges may distort results. Example: Feature 1: [0, 2] Feature 2: [-2, 2] Changes in feature 2, in the distance functions, has greater impact.

Scaling Scale each feature to a range [0,1] [-1, 1] Possible Issue Say feature range is [0, 2]. 99% of the data >= 1.5 Outliers have large impact on distance Normal values have almost none.

Normalize Modify each feature so Mean (m f ) = 0 Standard Deviation (  f ) = 1, where y if is the new feature value N is the number of data points.

Z-Score where z f is the z-score s f is the mean absolute deviation More robust to outliers, compared to standard deviation.

Symmetric Binary Assume, for now, all features are symmetric binary. How to compare? Can use Euclidean, Manhattan, or Minkowski functions. Symmetric binary similarity

Symmetric Binary q, r, s and t are counts. Object i Object j 1 0 sum 10 t r r + t s + t q + r p s q q + s

Symmetric Binary Properties Range is [0, 1] 0 indicates perfect match 1 indicates no matches

Asymmetric Binary Assume, for now, all features are asymmetric binary. Like Symmetric Binary Can use Euclidean, Manhattan, or Minkowski functions. Alternately, can use Asymmetric binary similarity

Asymmetric Binary q, r, s and t are counts. Object i Object j 1 0 sum 10 t r r + t s + t q + r p s q q + s

Asymmetric Binary Properties Range is [0, 1] 0 indicates perfect match 1 indicates no matches Note, as (0==0) is considered unimportant, it is not factored in.

Examples Set Y and P to 1 N to 0 Symmetric d(Jack, Mary) = (0 + 1) / 6 = Asymmetric d(Jack, Mary) = (0 + 1) / ( ) = 0.33 NameFeverCoughTest-1Test-2Test-3Test-4 JackYNPNNN MaryYNPNPN

Categorical Where p = number of variable m = number of matches

Example d(2, 1) = (1 – 0) / 1 = 0 d(1, 4) = (1 – 1) / 1 = 1 StudentTest-1 (categorical) Test-2 (ordinal) Test-3 (ratio) 1Code-AExcellent445 2Code-BFair22 3Code-CGood164 4Code-AExcellent1,210

Categorical Weighting Can add weights to Increase effect of m Increase importance of variables with more states Can do this for Binary as well. Convention Some of weights should be equal to 1.

Categorical – Other measures Value Difference Metric For Classification problems (not Clustering). Estimates conditional probabilities for each feature value for each class. Distance base on difference in conditional probabilities. Includes a weighting scheme. Modified Value Difference Metric Handles weight estimation differently.

Value Difference Metric (VDM) Where P(x if,g) = conditional probability of the class g occuring, given the value x i for feature f. C is the number of classes n is the number of features q is either 1 or 2. Note, for simplification, weights are not included.

Ordinal Assume all Features are Ordinal. Feature f has M f ordered states, representing ranking 1, 2, …, M f. For each instance i For each feature f Replace value x if by corresponding rank r if To calculate d(i,j) Use Interval-Scaled Distance Functions.

Ordinal Like Interval-Scaled Different Ordinal features may have different number of states. This leads to different features having different implicit weights. Hence, scaling necessary.

Example Mappings Fair = 1, Good = 2, Excellent = 3 Normalized Values Fair = 0.0, Good = 0.5, Excellent = 1.0 StudentTest-1 (categorical) Test-2 (ordinal) Test-3 (ratio) 1Code-AExcellent445 2Code-BFair22 3Code-CGood164 4Code-AExcellent1,210

Example Euclidean: StudentTest-1 (categorical) Test-2 (ordinal) Test-3 (ratio) 1Code-AExcellent445 2Code-BFair22 3Code-CGood164 4Code-AExcellent1,210

Ordinal – Other Measures Hamming Distance Absolute Difference Normalized Absolute Difference Normalized Hamming Distance

Ratio-Scaled Can’t treat directly as Interval-Scaled The scale for Ratio-Scaled would lead to distortion of results. Apply a logarithmic transformation first. y if = log(x if ) Other type of transformation. Treat result as continuous Ordinal Data.

Example StudentTest-1 (categorical) Test-2 (ordinal) Test-3 (ratio) Test-3 (logarithmic) 1Code-AExcellent Code-BFair Code-CGood Code-AExcellent1, Euclidean:

Mixed Types The above approaches assumed that all features are the same type! This is rarely the case. Need a distance function that handles all types.

Mixed Distance Where  ij, for feature f is 0 If either x if or x jf is missing (x if == x jf == 0) and f is asymmetric binary Else 1

Mixed Distance Where If feature f is Interval-scaled, use this formula Where h runs over non-missing values for feature f. Ensures distance returned is in range [0,1].

Mixed Distance Where If feature f is Binary or categorical If x if == x jf, d ij = 0 Else, d ij = 1 Ordinal Compute ranks and apply the ordinal scaling Then use the interval-scaled distance measure.

Mixed Distance Where If feature f is Ratio-Scaled Do logarithmic (or similar) transform and then apply interval-scaled distance. Or, treat as ordinal data.

Mixed Distance Distance calculation for each feature will be 0 to 1. Final distance calculation will be [0.0, 1.0]

Example StudentTest-1 (categorical) Test-2 (ordinal) Test-3 (ratio) Test-3 (logarithmic) 1Code-AExcellent Code-BFair Code-CGood Code-AExcellent1,

Mixed Distance Problems Doesn’t permit use, for interval-scaled, more advanced distance functions. Binary and categorical values have more potential impact than other types of features.

Mixed Distance Minkowski Heterogeneous Overlap-Euclidean Metric Heterogeneous Value Difference Metric Interpolated Value Difference Metric Windowed Value Difference Metric K* Violates some of the conditions for distance measure. Not a complete list.

Questions?