Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering different types of data

Similar presentations


Presentation on theme: "Clustering different types of data"— Presentation transcript:

1 Clustering different types of data
Pasi Fränti

2 Data types Numeric Binary Categorical Text Time series

3 Part I: Numeric data

4 Distance measures Type Possible operations Example variable
Example values Nominal == Major subject Computer science Mathematics Physics Ordinal ==, <, > Degree Bachelor Master Licentiate Doctor Interval ==, <, >, - Temperature 10 °C 20 °C 10 °F Ratio ==, <, >, -, / Weight 0 kg 10 kg 20 kg

5 Definition of distance metric
A distance function is metric if the following conditions are met for all data points x, y, z: All distances are non-negative: d(x, y) ≥ 0 Distance to point itself is zero: d(x, x) = 0 All distances are symmetric: d(x, y) = d(y, x) Triangular inequality: d(x, y)  d(x, z) + d(z, y)

6 Common distance metrics
Xj = (xj1, xj2, …, xjp) dij = ? Minkowski distance Euclidean distance q = 2 Manhattan distance q = 1 Xi = (xi1, xi2, …, xip) 1st dimension 2nd dimension pth dimension

7 Distance metrics example
5 10 2D example x1 = (2,8) x2 = (6,3) Euclidean distance Manhattan distance X1 = (2,8) 5 X2 = (6,3) 4

8 Chebyshev distance In case of q  , the distance equals to the maximum difference of the attributes. Useful if the worst case must be avoided: Example:

9 Hierarchical clustering Cost functions
Three cost functions exist: Single linkage Complete linkage Average linkage

10 Single Link The smallest distance between vectors in clusters i and j:
xi Min distance xj

11 Complete Link The largest distance between vectors in clusters i and j: Cluster 1 Cluster 2 xj Max distance xi

12 Average Link The average distance between vectors in clusters i and j:
Av. distance xj xi

13 Cost function example [Theodoridis, Koutroumbas, 2006]
1.1 1 1.2 1.3 1.4 1.5 x1 x2 x3 x4 x5 x6 x7 Data Set Single Link: Complete Link: x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7

14 Part II: Binary data

15 Hamming Distance (Binary and categorical data)
Number of different attribute values. Distance of ( ) and ( ) is 2. Distance ( ) and ( ) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

16 Hard thresholding of centroid
(0.40, 0.60, 0.75, 0.20, 0.45, 0.25)

17 Hard and soft centroids
Bridge (binary version)

18 Distance and distortion
General distance function: Distortion function:

19 Distortion for binary data
Cost of a single attribute: The number of zeroes is qjk, the number of ones is rjk and cjk is the current centroid value for variable k of group j.

20 Optimal centroid position
Optimal centroid position depends on the metric. Given parameter: The optimal position is:

21 Example of centroid location

22 Centroid location

23 Categorical clustering
Three attributes Director Actor Genre t1 (Godfather II) Coppola De Niro Crime t2 (Good Fellas) Scorsese t3 (Vertigo) Hitchcock Stewart Thriller t4 (N by NW) Grant t5 (Bishop's Wife) Koster Comedy t6 (Harvey)

24 Categorical clustering Sample 2-d data: color and shape
Model A Model B Model C

25 Hamming Distance (Binary and categorical data)
Number of different attribute values. Distance of ( ) and ( ) is 2. Distance ( ) and ( ) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

26 Histogram-based methods:
K-means variants Methods: Histogram-based methods: k-modes k-medoids k-distributions k-histograms k-populations k-representatives

27 Entropy-based cost functions
Category utility: Entropy of data set: Entropies of the clusters relative to the data:

28 Iterative algorithms

29 K-modes clustering Distance function
Vector and mode A F I A D G Distance +1 2 +1

30 K-modes clustering Prototype of cluster
Vectors Mode A D G B D H A F I A D

31 K-medoids clustering Prototype of cluster
Vector with minimal total distance to others 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

32 K-medoids Example

33 K-medoids Calculation

34 K-histograms D 2/3 F 1/3

35 K-distributions Cost function with ε addition

36 Example of cluster allocation Change of entropy

37 Problem of non-convergence Non-convergence

38 Results with Census dataset

39 Literature Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), , March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical data clustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp , Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp , 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp , 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp , Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/ ,

40 Part IV: Text data

41 Applications of text clustering
Query relaxation Spell-checking Automatic categorization Document clustering

42 Query relaxation Current solution Matching suffixes from database
Alternate solution From semantic clustering

43 Spell-checking Word kahvila (café): one correct
two incorrect spellings

44 Automatic categorization
Category by clustering

45 Document clustering Motivation: Clustering Process:
Group related documents based on their content No predefined training set (taxonomy) Generate a taxonomy at runtime Clustering Process: Data preprocessing: tokenize, remove stop words, stem, feature extraction and lexical analysis Define cost function Perform clustering 45

46 Text clustering String similarity is the basis for clustering text data A measure is required to calculate the similarity between two strings

47 String similarity Semantic: Syntactic: car and auto
automobile and auto отель and готель sauna and sana

48 Semantic similarity Lexical database: WordNet
object artifact conveyance, transport article ware vehicle bike, bicycle cutlery, eating utensil truck table ware fork instrumentality wheeled vehicle car, auto automotive, motor English Relations via generalization Sets of synonyms (synsets)

49 Similarity using WordNet [Wu and Palmer, 2004]
Input : word1: wolf , word 2: hunting dog Output: similarity value = 0.89

50 Hierarchical clustering by WordNet
Need better

51 Syntactic similarity Need examples
Operates on the words and their characters Can be divided into three components: Character-level similarity measures Matching Techniques Token similarity Need examples

52 Syntactic similarity workflow

53 Character-level measures
Treat strings as a sequence of characters Determine the similarity by one of three ways Exact match Transformation Longest common substring Use these examples also later! The Point 3 Tigne Point 1 No topless and other restrictions Tigne Point mall Tigne Point Blokker and other shops The Avenue Acqua terra e mare Lonely tree between houses 2 ? Golden house Chinese restaurant The Palace

54 Exact match Machine Learning Machine Learning Machine Learning
Binary result: 1 = if the strings are identical 0 = otherwise Machine Learning Machine Learning Machine Learning Machine Learned 1 (match) 0 (mismatch)

55 Transformation Edit distance: Single edit operations (insertion, deletion, substitution) to transfer a string into another Hamming: Allows only substitutions. Length of the strings must be equal Jaro/Winkler: Based on the number of matching and transposed characters (a/u, u/a)

56 Levenshtein edit distance Example
Input: string 1: kitten, string 2: sitting Output: 3 substitute s with k: sitten substitute e with i: sittin insert g: sitting

57 Longest common substring
Finds the longest contiguous sequence of characters that co-occur in two strings Example 1: Example 2: ABABC AAAAA BABCA LCS =3 ED =2 ED = 2 LCS =1 AXAXA ABCBA

58 String segmentation b i n g o n
Q-grams: divides string into substrings of length q. Tokenization: breaks a string into words and symbols called tokens using whitespace, line breaks, and punctuation characters. The club at the Ivy b i n g o n 2-grams

59 Matching techniques

60 Matching techniques

61 Token similarity Two alternatives to compare tokens: Exact matching:
1 if match, 0 otherwise. Approximate matching: compute similarity between tokens using a character-level measure

62 Approximate matching Example [Monge and Elkan , 1996]
Input: string1: gray color, string 2: the grey colour Output: similarity value 0.85 Pairwise similarities using edit distance (smith-waterman-Gotoh) the grey colour Maximum gray 0.20 0.90 0.30 color 0.80

63 Similarities for sample data
Compared Strings Edit distance Q-gram Q=2 Q-gram Q=3 Q-gram Q=4 Cosine distance Pizza Express Café Pizza Express 72% 79% 74% 70% 82% Lounasravintola Pinja Ky – ravintoloita Lounasravintola Pinja 54% 68% 67 % 65% 63 % Kioski Piirakkapaja Kioski Marttakahvio 47% 45% 33% 32% 50% Kauppa Kulta Keidas Kauppa Kulta Nalle 67% 60% Ravintola Beer Stop Pub Baari, Beer Stop R-kylä 39% 42% 36% 31% Ravintola Foxie s Bar Foxie Karsikko 25% 15% 12% 24% Play baari Ravintola Bar Play – Ravintoloita 21% 17% 8% Different Different

64 Part V: Time series

65 Clustering of time-series

66 Dynamic Time Warping Align two time-series by minimizing distance of the aligned observations Solve by dynamic programming!

67 Example of DTW

68 Prototype of a cluster Sequence c that minimizes E(Sj,c) is called a Steiner sequence. Good approximation to Steiner problem, is to use medoid of the cluster (discrete median). Medoid is such a time-series in the cluster that minimizes E(Sj,c).

69 Calculating the prototype
Can be solved by dynamic programming. Complexity is exponential to the number of time-series in a cluster.

70 Averaging heuristic Calculate the medoid sequence
Calculate warping paths from the medoid to all other time series in the cluster New prototype is the average sequence over warping paths

71 Local search heuristics

72 Example of the three methods
E(S) = 159 E(S) = 138 E(S) = 118 LS provides better fit in terms of the Steiner cost function. It cannot modify sequence length during the iterations. In datasets with varying lengths it might provide better fit, but non-sensitive prototypes

73 Experiments

74 Part VI: Other clustering problems

75 Clustering of GPS trajectories

76 Density clusters Walking street Swim hall Market place Science park
Homes of users Shop

77 Objects of different colors
Image segmentation Objects of different colors

78 Literature S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd edition, 2006. P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids", Signal Processing: Image Communication, 14 (9), 677‑681, 1999. I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, , October Michael Pucher, F.T.W.: Performance Evaluation of WordNet-based Semantic Relatedness Measures for Word Prediction in Conversational Speech (2004) A. E.Monge, and C. Elkan. The Field Matching Problem: Algorithms and Applications. Int. Conf. on Knowledge Discovery and Data Mining, ,1996.


Download ppt "Clustering different types of data"

Similar presentations


Ads by Google