© Copyright 2012, Natasha Balac 1 Clustering SDSC Summer Institute 2012 Natasha Balac, Ph.D.

© Copyright 2012 Natasha Balac CLUSTERING  Basic idea: Group similar things together  Unsupervised Learning – Useful when no other info is available  K-means u Partitioning instances into k disjoint clusters  Measure of similarity

© Copyright 2012, Natasha Balac Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided).

© Copyright 2012, Natasha Balac Common uses of Clustering Often used as an exploratory data analysis tool In one-dimension, a good way to quantify real-valued variables into k non-uniform buckets Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization) Also used for choosing color palettes on old fashioned graphical display devices Color Image Segmentation

© Copyright 2012, Natasha Balac Clustering Unsupervised: no target value to be predicted Differences ways clustering results can be produced/represented/learned: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

© Copyright 2012, Natasha Balac Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

© Copyright 2012, Natasha Balac Clusters the data into k groups where k is specified in advance 1. Cluster centers are chosen at random 2. Instances are assigned to clusters based on their distance to the cluster centers 3. Centroids of clusters are computed – “means” 4. Go to 1st step until convergence The k-means algorithm Iterative distance based clustering

© Copyright 2012 Natasha Balac K-Means Clustering Pros & Cons Simple and reasonably effective The final cluster centers do not represent a global minimum but only a local one Result can vary significantly based on initial choice of seeds Completely different final clusters can arise from differences in the initial randomly chosen cluster centers Algorithm can easily fail to find a reasonable clustering

© Copyright 2012, Natasha Balac Getting trapped in a local minimum Example: four instances at the vertices of a two-dimensional rectangle Local minimum: two cluster centers at the midpoints of the rectangle’s long sides Simple way to increase chance of finding a global optimum: restart with different random seeds

© Copyright 2012, Natasha Balac K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s 1, s 2,… s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d(x i, s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j =  (c j )

© Copyright 2012, Natasha Balac Seed Choice Results can vary based on random seed selection Some seeds can result in poor convergence rate, or convergence to sub- optimal clusters Select good seeds using a heuristic or the results of another method

© Copyright 2012, Natasha Balac 16 Incremental clustering Works incrementally instance by instance forming a a hierarchy of clusters COBWEB - nominal; CLASSIT– numeric attributes Instances are added one at the time tree is updated appropriately at each step finding the right leaf for an instance restructuring the tree How and where to update based on category utility value

© Copyright 2012 Natasha Balac 19 Category utility Category utility is a kind of quadratic loss function defined on conditional probabilities: C 1,..C k are k clusters a i is the ith attribute and it takes on values v i1, v i2, …

© Copyright 2012, Natasha Balac 20 Probability-based clustering Problems with K-means & Hierarchical methods: Division by k Order of examples Merging/splitting operations might not be sufficient to reverse the effects of bad initial ordering Is result at least local minimum of category utility? Solution: Find the most likely clusters given the data Instance has certain probability of belonging to a particular cluster

© Copyright 2012, Natasha Balac 21 Soft Clustering So far clustering methods assumed that each instance has a “hard” assignment to exactly one cluster No uncertainty about class membership or an instance belonging to more than one cluster Soft clustering gives probabilities that an instance belongs to each of a set of clusters Each instance is assigned a probability distribution across a set of discovered clusters probabilities of all categories must sum to 1

© Copyright 2012, Natasha Balac 22 Finite mixtures Probabilistic clustering algorithms model the data using a mixture of distributions Each cluster is represented by one distribution The distribution governs the probabilities of attributes values in the corresponding cluster They are called finite mixtures because there is only a finite number of clusters being represented Usually individual distributions are normal Distributions are combined using cluster weights

© Copyright 2012, Natasha Balac 25 Learning the clusters Assume we know that there are k clusters To learn the clusters we need to determine their parameters I.e. their means and standard deviations Start with the initial guess for the 5 parameters use them to calculate cluster probabilities for each instance, use these probabilities to re estimate the parameters and repeat We actually have a performance criterion: the likelihood of the training data given the clusters

© Copyright 2012, Natasha Balac 26 Expectation Maximization (EM) Probabilistic method for soft clustering Iterative method for learning probabilistic categorization model from unsupervised data Direct method that assumes k clusters: {c 1, c 2,… c k } Soft version of k-means Assumes a probabilistic model of categories that allows computing P(c i | E) for each category, c i, for a given example, E

© Copyright 2012, Natasha Balac 27 EM Algorithm Initially assume random assignment of examples to categories Learn an initial probabilistic model by estimating model parameters from this randomly labeled data

© Copyright 2012, Natasha Balac 28 The EM algorithm EM algorithm: expectation-maximization algorithm Generalization of k-means to probabilistic setting Similar iterative procedure: 1. Calculate cluster probability for each instance (expectation step) 2. Estimate distribution parameters based on the cluster probabilities (maximization step) Cluster probabilities are stored as instance weights

© Copyright 2012, Natasha Balac 29 Summary Labeled clusters can be interpreted by using supervised learning - train a tree or learn rules Can be used to fill in missing attribute values All methods have a basic assumption of independence between the attributes Some methods allow the user to specify in advanced that two of more attributes are dependent and should be modeled with a joint probability

© Copyright 2012, Natasha Balac Regression Trees Continuous goal variables Induction by means of an efficient recursive partitioning algorithm Uses linear regression to select internal node Quinlan, 1992

© Copyright 2012, Natasha Balac Regression trees Differences to decision trees: Splitting: minimizing intra-subset variation Pruning: numeric error measure Leaf node predicts average class values of training instances reaching that node Can approximate piecewise constant functions Easy to interpret and understand the structure Special kind: Model Trees

© Copyright 2012, Natasha Balac Model trees RT with linear regression functions at each leaf Linear regression (LR) applied to instances that reach a node after full regression tree has been built Only a subset of the attributes is used for LR Fast Overhead for LR is minimal as only a small subset of attributes is used in tree

© Copyright 2012, Natasha Balac Building the tree Splitting criterion: standard deviation reduction (T portion of the data reaching the node) Where T1,T2, are the sets that result from splitting the node according to the chosen attribute Termination criteria Standard deviation becomes smaller than certain fraction of sd for full training set (5%) Too few instances remain (< 4)

© Copyright 2012, Natasha Balac Nominal attributes Nominal attributes are converted into binary attributes (that can be treated as numeric ones) Nominal values are sorted using average class value If there are k values, k-1 binary attributes are generated It can be proven that the best split on one of the new attributes is the best binary split on original M5‘ only does the conversion once

© Copyright 2012 Natasha Balac Pruning Model Trees Based on estimated absolute error of LR models Heuristic estimate for smoothing calculation P’ is prediction past up to the next node; p is passed to nod below; q is predicted value by the model; n # of training instances in the node; K smoothing const Pruned by greedily removing terms to minimize the estimated error Heavy pruning allowed – single LR model can replace a whole subtree Pruning proceeds bottom up - e rror for LR model at internal node is compared to error fosubtree

© Copyright 2012, Natasha Balac Building the tree Splitting criterion standard deviation reduction T – portion T of the training data T1, T2, …sets that result from splitting the node on the chosen attribute Treating SD of the class values in T as a measure of the error at the node; calc expected reduction in error by testing each attribute at node Termination criteria Standard deviation becomes smaller than certain fraction of SD for full training set (5%) Too few instances remain (< 4)

© Copyright 2012, Natasha Balac Pseudo-code for M5’ Four methods: Main method: MakeModelTree() Method for splitting: split() Method for pruning: prune() Method that computes error: subtreeError() We’ll briefly look at each method in turn Linear regression method is assumed to perform attribute subset selection based on error

© Copyright 2012 Natasha Balac MakeModelTree SD =sd(instances) for each k-valued nominal attribute convert into k-1 synthetic binary attributes root =newNode root.instances = instances split(root) prune(root) printTree(root)

© Copyright 2012, Natasha Balac split(node) if sizeof(node.instances) < 4 or sd(node.instances) < 0.05*SD node.type = LEAF else node.type = INTERIOR for each attribute for all possible split positions of the attribute calculate the attribute’s SDR node.attribute = attribute with maximum SDR split(node.left) split(node.right)

© Copyright 2012, Natasha Balac prune() if node = INTERIOR then prune(node.leftChild) prune(node.rightChild) node.model = linearRegression(node) if subtreeError(node) > error(node) then node.type = LEAF

© Copyright 2012, Natasha Balac subtreeError() l = node.left; r = node.right if node = INTERIOR then return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r)) /sizeof(node.instances) else return error(node)

© Copyright 2012, Natasha Balac MULTI-VARIATE REGRESSION TREES * All the characteristics of a regression tree Capable of predicting two or more outcomes Example: Activity and toxicity, monetary gain and time Balac, Gaines ICML 2001

© Copyright 2012, Natasha Balac MULTI-VARIATE REGRESSION TREE INDUCTION … >0.5 AND >-3.61 Var 1 Var 3 >0.5 OR >-3.56 <=0.5 AND <=-3.56 >0.5 OR >-3.61 <=-2 AND <=4.8 Var 2 Var 4 Var 1 Var 4 >-4.71 OR >4.83 <=-4.71 AND <=4.83 Var 1 Var 2 Activity = 7.05 Toxicity = 0.173 Activity = 7.39 Toxicity = 2.89

© Copyright 2012, Natasha Balac 47 Incremental clustering Works incrementally instance by instance forming a a hierarchy of clusters COBWEB - nominal; CLASSIT– numeric attributes Instances are added one at the time tree is updated appropriately at each step finding the right leaf for an instance restructuring the tree How and where to update based on category utility value

© Copyright 2012, Natasha Balac 51 Merging Consider all pairs of nodes for merging and evaluate category utility of each Computationally expensive When scanning nodes for a suitable host – both the best matching node and the runner-up are noted The best will form the host for new instance unless merging host and runner-up produces better CU

© Copyright 2012 Natasha Balac 54 Category utility Category utility is a kind of quadratic loss function defined on conditional probabilities: C 1,..C k are k clusters a i is the ith attribute and it takes on values v i1, v i2, …

© Copyright 2012, Natasha Balac 55 Category Utility Extended to Numeric attributes Assuming normal distribution: When Standard deviation of attribute a i is zero it produced infinite value of the category utility formula Acuity parameter: pre-specified minimum variance on each attribute only one instance in a node produces 0 variance

© Copyright 2012, Natasha Balac 58 Finite mixtures Probabilistic clustering algorithms model the data using a mixture of distributions Each cluster is represented by one distribution The distribution governs the probabilities of attributes values in the corresponding cluster They are called finite mixtures because there is only a finite number of clusters being represented Usually individual distributions are normal Distributions are combined using cluster weights

© Copyright 2012, Natasha Balac 61 Learning the clusters Assume we know that there are k clusters To learn the clusters we need to determine their parameters I.e. their means and standard deviations Start with the initial guess for the 5 parameters use them to calculate cluster probabilities for each instance, use these probabilities to re estimate the parameters and repeat We actually have a performance criterion: the likelihood of the training data given the clusters

© Copyright 2012, Natasha Balac 62 Expectation Maximization (EM) Probabilistic method for soft clustering Iterative method for learning probabilistic categorization model from unsupervised data Direct method that assumes k clusters: {c 1, c 2,… c k } Soft version of k-means Assumes a probabilistic model of categories that allows computing P(c i | E) for each category, c i, for a given example, E

© Copyright 2012, Natasha Balac 1 Clustering SDSC Summer Institute 2012 Natasha Balac, Ph.D.

Similar presentations

Presentation on theme: "© Copyright 2012, Natasha Balac 1 Clustering SDSC Summer Institute 2012 Natasha Balac, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Copyright 2012, Natasha Balac 1 Clustering SDSC Summer Institute 2012 Natasha Balac, Ph.D.

Similar presentations

Presentation on theme: "© Copyright 2012, Natasha Balac 1 Clustering SDSC Summer Institute 2012 Natasha Balac, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback