Tree Clustering & COBWEB. Remember: k-Means Clustering.

Presentation on theme: "Tree Clustering & COBWEB. Remember: k-Means Clustering."— Presentation transcript:

Tree Clustering & COBWEB

Remember: k-Means Clustering

k-Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

EM-Clustering

Tree clustering Linkage rules Conceptual Clustering COBWEB Category utility

Tree Clustering Tree clustering algorithm allow us to reveal the internal similarities of a given pattern set To structure these similarities hierarchically Applied to a small set of typical patterns For n patterns these algorithm generates a sequence of 1 to n clusters

The sequence of 1 to n clusters has a form of a binary tree (two branches for each tree node) Tree can be structured bottom up Merging algorithm starting with the individual patterns Splitting algorithm starting with a cluster composed of all patterns

Merging Algorithm given n patterns x i consider initially k=n singleton clusters C i ={x i }; /*every cluster has only one element*/ while k ≥ 1 do { determine the two nearest clusters C i and C j using an approximate similarity rule; merge C i and C j : C ij ={C i,C j }, therefore obtaining a solution with k-1 clusters; k=k-1; }

The determination of the nearest clusters depends on: the similarity measure the rule used to access the similarity of the clusters

Example Similarity between two clusters is assessed by measuring the similarity of the furthest pair of patterns (each one from the distinct cluster) This is the so-called complete linkage rule

As the merging process evolves, the similarity of the merged clusters decreases

Schedule graph may be of help for selecting the best solution Solutions with very small or even singletons clusters are rather suspicious

Linkage rules Complete linkage (FN furthest neighbor) Evaluates the dissimilarity between two clusters as the greatest distance between any two patterns, one from each cluster This rule performs well when the clusters are compact and of equal size Inadequate for filamentary clusters

Single linkage (NN nearest neighbor) Dissimilarity between two clusters as the dissimilarity of the nearest patterns, one from each cluster Produce chaining effect and works well with filamentary shape

a globular data b filamentary data

Average linkage between groups Also known as UPGMA (un-weighted pair-group method using arithmetic averages) This rule assesses the distance between two clusters as the average of the distances between all pairs of patterns from a distinct cluster

Impact of cluster distance measures “Single-Link” (inter-cluster distance= distance between closest pair of points) “Complete-Link” (inter-cluster distance= distance between farthest pair of points)

Conceptual Clustering - COBWEB Conceptual Clustering Begins with a collection of unclassified objects and some means of measuring the similarity of objects Numeric taxonomy: representation of objects as a collection of features, each which may have some numerical value Objects a treated by a distance function as a vector of n features

bird is defined by the following features: flies, sings, lays eggs, nests in trees, eats insects. bat is defined by the following features: flies, gives milk, eats insects

Humans distinguish degrees of category membership We generally think of a robin as a better example of a bird than a chicken Oak is more typical example of a tree than a palm

Family resemblance theory (Wittgenstein 1953) Categories are defined by a complex systems of similarities between members A category may not have shared properties by all their members Games: Not all games require two or more players - solitaire (paciacia) Not all games are fun for the players - football Not all games involve competition - jumping rope Game category is well defined

Logic, feature vectors or decision trees do not account for these effects COBWEB (Fisher 1987) addresses these issues Models base-level categorization and degrees of category membership Represents categories probabilistically, instead of defining category memberships as a set of values that must be present Builds up a hierarchy (tree)

COBWEB represents the probability with which each feature values is present of an object p(f i =v ij |c k ) is the conditional probability with which each feature f i will have a value v ij, given that an object is in category c k

Example COBWEB forms a taxonomy (tree, hierarchy) of categories Example: Categorization of four single-cell animals

Each animal is defined by number of features Number of tails, color, and number of nuclei Category C3: have a 1.0 probability of having 2 tails, a 0.5 probability of having light color, and a 1.0 probability of having 2 nuclei

When given a new example, COBWEB considers the overall quality of either placing the example in an existing category or modifying the hierarchy The criterion COBWEB uses for evaluating the quality of the classification is called category utility

Category utility Was developed in research of human categorization (Gluck and Corter 1985) Category utility attempts to maximize both the probability that two objects in the same category have values in common and the probability that objects in different categories will have different property values

Category utility This sum is taken across all categories c k, all features f i and all feature values v ij

p(f i =v ij |c k ) is called predictability, it is the probability that an object has the value v ij for feature f i given that the object belongs to category c k The higher this probability, the more likely two objects in a category share the same feature values p(c k |f i =v ij ) is called predictiveness is the probability with which an objects belongs to the category c k given it has a value v ij for a feature f i The greater this probability, the less likely objects not in the category will have those values p(f i =v ij ) serves as a weight, frequent features exert a stronger influence

By combining these values, high category utility measures indicate a high likelihood that objects in the same category will share properties, while decreasing the likelihood of objects in different categories having properties in common

COBWEB performs a hill-climbing search of the space of possible taxonomies (trees) using category utility to evaluate and select possible categorizations

Initializes the taxonomy to a single category whose features are those of the first example For each example, the algorithm begins with the root category and moves through the tree At each level is uses category utility to evaluate the taxonomies 1. Placing the example in the best category 2. Adding a new category containing the example 3. Merging two existing categories and adding the example to the category 4. Splitting two existing categories and placing the example into the best category in the tree

COBWEB is efficient in producing trees with reasonable number of classes Because is allows probabilistic membership, its categories are flexible and robust

Tree clustering Linkage rules Conceptual Clustering COBWEB Category utility

Assessment Cluster validation