Presentation on theme: "Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit."— Presentation transcript:
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit colour yellow, green, purple, red, orange Binary data:fruit / no fruit
Similarity matrix We define a similarity between units – like the correlation between continuous variables. (also can be a dissimilarity or distance matrix) A similarity can be constructed as an average of the similarities between the units on each variable. (can use weighted average) This provides a way of combining different types of variables.
relevant for continuous variables: Euclidean city block or Manhattan Distance metrics A B A B (also many other variations)
Similarity coefficients for binary data simple matching count if both units 0 or both units 1 Jaccard count only if both units 1 (and many other variants) simple matching can be extended to categorical data
Clustering methods hierarchical divisive put everything together and split monothetic / polythetic agglomerative keep everything separate and join the most similar points (classical cluster analysis) non-hierarchical k-means clustering
Agglomerative hierarchical Single linkage or nearest neighbour finds the minimum spanning tree: shortest tree that connects all points chaining Complete linkage or furthest neighbour Compact clusters of approximately equal size. (makes compact groups even when none exist) Average linkage methods between single and average linkage