1 CSC 463 Fall 2010 Dr. Adam P. Anthony Class #27.

Presentation on theme: "1 CSC 463 Fall 2010 Dr. Adam P. Anthony Class #27."— Presentation transcript:

1 CSC 463 Fall 2010 Dr. Adam P. Anthony Class #27

2 Machine Learning III Chapter 18.9

3 Today’s class Support vector machines Clustering (unsupervised learning)

Support Vector Machines 4 These SVM slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.

Methods For Classification Decision Trees –Model-based data structure, works best with discrete data –For a new instance, choose label C based on rules laid out by tree Probabilistic Classifiers –Model-based as well, works with any type of data –For a new instance, choose label C that maximizes P([f 1 …f n,C] | Data) K-Nearest Neighbor –Instance-based –For new instance, choose label based on the majority vote of k nearest points in Data Boundary-Based Classifiers (NEW!) –Model-based, only works with continuous data –Establish a numerical function that acts as a fence between positive, negative examples 5

Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 Classifier: Given values for x 1,x 2 : If formula above > 0 then point is above line If formula < 0 then point is below line f(x,w,b) = sign(w. x + b) Line x 2 = mx 1 + b OR: w 1 x 1 - w 2 x 2 + b’ = 0 where m = w 1 /w 2 and b = b’/w 2

Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine....but which is best?

Copyright © 2001, 2003, Andrew W. Moore Classifier Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM

Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.LOOCV is easy since the model is immune to removal of any non-support- vector datapoints. 4.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5.Empirically it works very very well.

Copyright © 2001, 2003, Andrew W. Moore Specifying a line and margin How do we represent this mathematically? …in m input dimensions? Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone

Copyright © 2001, 2003, Andrew W. Moore Specifying a line and margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as..+1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1 wx+b=1 wx+b=0 wx+b=-1

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b?

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? Let u and v be two vectors on the Plus Plane. What is w. ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ Any location in  m : not necessarily a datapoint Any location in R m : not necessarily a datapoint

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } The vector w is perpendicular to the Plus Plane Let x - be any point on the minus plane Let x + be the closest plus-plane-point to x -. Claim: x + = x - + w for some value of. Why? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width How do we compute M in terms of w and b? x-x- x+x+ The line from x - to x + is perpendicular to the planes. So to get from x - to x + travel some distance in direction w.

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width x-x- x+x+

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M It’s now easy to get M in terms of w and b “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width w. (x - + w) + b = 1 => w. x - + b + w.w = 1 => -1 + w.w = 1 => x-x- x+x+

Copyright © 2001, 2003, Andrew W. Moore Computing the margin width What we know: w. x + + b = +1 w. x - + b = -1 x + = x - + w |x + - x - | = M “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = M = |x + - x - | =| w |= x-x- x+x+

Copyright © 2001, 2003, Andrew W. Moore Learning the Maximum Margin Classifier Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? “Predict Class = +1” zone “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 M = Margin Width = x-x- x+x+

Learning SVMs Trick #1: Just find the points that would be closest to the optimal separating plane (the “support vectors”) and work directly from those instances. Trick #2: Represent as a quadratic optimization problem, and use quadratic programming techniques. Trick #3 (the “kernel trick”): –Instead of just using the features, represent the data using a high- dimensional feature space constructed from a set of basis functions (polynomial and Gaussian combinations of the base features are the most common). –Then find a separating plane / SVM in that high-dimensional space –Voila: A nonlinear classifier! 27

Copyright © 2001, 2003, Andrew W. Moore Common SVM basis functions z k = ( polynomial terms of x k of degree 1 to q ) z k = ( radial basis functions of x k ) z k = ( sigmoid functions of x k )

Copyright © 2001, 2003, Andrew W. Moore SVM Performance Anecdotally they work very very well indeed. Example: They are currently the best-known classifier on a well- studied hand-written-character recognition benchmark Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. There is a lot of excitement and religious fervor about SVMs as of 2001.

Unsupervised Learning: Clustering 30

Unsupervised Learning Learn without a “supervisor” who labels instances –Clustering –Scientific discovery –Pattern discovery –Associative learning Clustering: –Given a set of instances without labels, partition them such that each instance is: similar to other instances in its partition (inter-cluster similarity) dissimilar from instances in other partitions (intra-cluster dissimilarity) 31

Clustering Techniques Partitional clustering –k-means clustering Agglomerative clustering –Single-link clustering –Complete-link clustering –Average-link clustering Spectral clustering 32

33 Formal Data Clustering Data clustering is: –Dividing a set of data objects into groups such that there is a clear pattern (e.g. similarity to each other) for why objects are in the same cluster A clustering algorithm requires: –A data set D –A clustering description C –A clustering objective Obj(C) –An optimization method Opt(D) ~ C Obj measures the goodness of the best clustering C that Opt(D) can find

What does D look like? Training Set 6.3,2.5,5.0,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3.0,5.1,1.8,Iris-virginica 5.7,3.0,4.2,1.2,Iris-versicolor 5.7,2.9,4.2,1.3,Iris-versicolor 6.2,2.9,4.3,1.3,Iris-versicolor 5.1,2.5,3.0,1.1,Iris-versicolor 5.1,3.4,1.5,0.2,Iris-setosa 5.0,3.5,1.3,0.3,Iris-setosa 4.5,2.3,1.3,0.3,Iris-setosa 4.4,3.2,1.3,0.2,Iris-setosa Test Set 5.1,3.5,1.4,0.2,?? 4.9,3.0,1.4,0.2,?? 4.7,3.2,1.3,0.2,?? 4.6,3.1,1.5,0.2,?? 5.0,3.6,1.4,0.2,?? 5.4,3.9,1.7,0.4,?? 4.6,3.4,1.4,0.3,?? 5.0,3.4,1.5,0.2,?? 4.4,2.9,1.4,0.2,?? 4.9,3.1,1.5,0.1,?? 5.4,3.7,1.5,0.2,?? 4.8,3.4,1.6,0.2,?? 34 Supervised learning (KNN, C.45, SVM, etc.)

What does D look like? Training Set 6.3,2.5,5.0,1.9,?? 6.5,3.0,5.2,2.0,?? 6.2,3.4,5.4,2.3,?? 5.9,3.0,5.1,1.8,?? 5.7,3.0,4.2,1.2,?? 5.7,2.9,4.2,1.3,?? 6.2,2.9,4.3,1.3,?? 5.1,2.5,3.0,1.1,?? 5.1,3.4,1.5,0.2,?? 5.0,3.5,1.3,0.3,?? 4.5,2.3,1.3,0.3,?? 4.4,3.2,1.3,0.2,?? Test Set 5.1,3.5,1.4,0.2,?? 4.9,3.0,1.4,0.2,?? 4.7,3.2,1.3,0.2,?? 4.6,3.1,1.5,0.2,?? 5.0,3.6,1.4,0.2,?? 5.4,3.9,1.7,0.4,?? 4.6,3.4,1.4,0.3,?? 5.0,3.4,1.5,0.2,?? 4.4,2.9,1.4,0.2,?? 4.9,3.1,1.5,0.1,?? 5.4,3.7,1.5,0.2,?? 4.8,3.4,1.6,0.2,?? 35 Un-supervised learning (Clustering!)

What does C look like? After clustering, the output looks like a ‘labeled’ data set for a supervised learning algorithm: –6.3,2.5,5.0,1.9,1 6.5,3.0,5.2,2.0,1 6.2,3.4,5.4,2.3,1 5.9,3.0,5.1,1.8,1 5.7,3.0,4.2,1.2,2 5.7,2.9,4.2,1.3,2 6.2,2.9,4.3,1.3,2 5.1,2.5,3.0,1.1,2 5.1,3.4,1.5,0.2,3 5.0,3.5,1.3,0.3,3 4.5,2.3,1.3,0.3,3 4.4,3.2,1.3,0.2,3 36 111122223333111122223333 Clustering Vector

Big Questions About Clustering How do we even begin clustering? How do we know we’ve found anything? How do we know if what we found is even useful? –How to evaluate the results? What do we apply this to? –What’s the truth, versus the hope, of reality? 37

38 K-Means Clustering D = numeric d-dimensional data C = partitioning of data points into k clusters Obj(C) = Root Mean Squared Error (RMSE) –Average distance between each object and its cluster’s mean value Optimization Method 1.Select k random objects as the initial means 2.While the current clustering is different from the previous: 1.Move each object to the cluster with the closest mean 2.Re-compute the cluster means

39 K-Means Demo

K-Means Comments K-means has some randomness in its initialization, which means: –Two different executions on the same data, same number of clusters will likely have different results –Two different executions may have very different run-times due to the convergence test In practice, run multiple times and take result with the best RMSE 40

41 ___-Link Clustering 1.Initialize each object in its own cluster 2.Compute the cluster distance matrix M by the selected criterion (below) 3.While there is more than k clusters: 1.Join the clusters with the shortest distance 2.Update M by the selected criterion Criterion for ___-link clustering –Single-link: use the distance of the closest objects between two clusters –Complete-link: use the distance of the most distant objects between the two clusters

42 ___-Link Demo How can we measure the distance between these clusters? What is best for: –Spherical data (above)? –Chain-like data?  Single-Link Distance Complete-Link Distance

___-Link Comments The –Link algorithms are not random in any way, which means: –You’ll get the same results whenever you use the same data and same number of clusters Choosing between these algorithms, and K-means (or any other clustering algorithm) requires lots of research, and careful analysis 43

44 My Research: Relational Data Clustering

45 The task of organizing objects into logical groups, or clusters, taking into account the relational links between objects Relational Data Clustering is:

46 Relational Data Formally: –A set of object domains –Sets of instances from those domains –Sets of relational tuples, or links between instances In Practice: –“Relational data” refers only to data that necessitates the use of links –Information not encoded using a relation is referred to as an attribute Spaces: –Attribute space = Ignore relations –Relation space = Ignore attributes People NameGender SallyF FredM JoeM Friends SallyFred Joe {Sally,F}{Joe,M} {Fred,M}

What does D Look Like Now? Nodes + Edges (pointers!!!): Adjacency Matrix: Aggregation Methods: –AverageAgeOfNeighbors, DominantGenderOfNeighbors, AvgSalaryOfNeighbors –Leads to a non-relational space –Clustered using methods previously discussed 47 Implementation Representation Conceptual Representation

48 Block Models A block model is a partitioning of the links in a relation –Reorder the rows and columns of an adjacency matrix by cluster label, place boundaries between clusters Block b ij : Set of edges from cluster i to cluster j (also referred to as a block position for a single link) If some are dense, and the rest are sparse, we can generate a summary graph Block modeling is useful for both visualization and numerical analysis 1 2 3 1 2 3 1 3 2 0.9 0.5 0.1 0.3 0.8

49 Two Relational Clustering Algorithms Community Detection Maximizes connectivity within clusters and minimizes connectivity between clusters Intuitive concept that links identify classes Equivalent to maximizing density only on the diagonal blocks Faster than more general relational clustering approaches Stochatic Block Modeling Maximizes the likelihood that two objects in the same cluster have the same linkage pattern –Linkage may be within, or between clusters Subsumes community detection Equivalent to maximizing density in any block, rather than just the diagonal Generalizes relational clustering

50 My Work: Block Modularity General block-model-based clustering approach Models relations only Motivated by poor scalability of stochastic block modeling –Would be useful to have a block modeling approach that scales as well as community detection algorithms Contributions: –A clearly defined measure of general relational structure (block modularity) –An Iterative clustering algorithm that is much faster than prior works

51 Relational Structure What is “structure” –High level: non-randomness –Relational structure: non-random connectivity pattern A relation is structured if its observed connectivity pattern is clearly distinguished from that of a random relation

52 Approach Overview Assume that there exists a “model” random relation: In contrast, for any non-random relation: –There should exist at least one clustering that distinguishes this relation from the random block model: Random Clustering Structure- Identifying Clustering Any clustering of this relation will have a similar block model Structure-Based Clustering Requires: 1.Means of comparing relational structures 2.Definition of a “model” random relation 3.Method for finding the most structure identifying clustering

53 Comparing Structure: Block Modularity Given an input relation, a model random relation*, and a structure- identifying clustering, we compute block modularity: 1.Find the block model for each relation: 2.Compute the absolute difference of the number of links in each block: 3.Compute the sum of all the cells in the difference matrix: 158 4.(Optional) Normalize value by twice the number of links: 0.4389 6000 33918 01446 20 4020 13112 20626 Input Relation Model Random Relation *Required: the model random relation should have the same number of links as the input relation

54 Finding a Structure-Identifying Clustering ( Or, Clustering With Block Modularity ) Referred to as BMOD for brevity

55 Experimental Evaluation Work-in-Progress Past Evaluation: Comparing with small, manageable data sets to evaluate increase in speed New Ideas: –Non-Block-Modeling algorithm is a current popular approach Is BMOD faster than it? If not, how much slower? –SCALING UP Demonstrated speed on “small” data sets –~3000 nodes, 4000 edges How would we do on, say, Facebook? –500 M nodes, given avg. 100 friends per node, 5 B edges –Challenges: Can’t download Facebook or any data source that is comparable How to generate a ‘realistic’ artificial data set that has similar features as FB? –Anyone want to help???

56 Block Modularity Clustering Results

Methodology 57 Goals: assess speed, accuracy of block modularity vs. leading stochastic method –Degree-Corrected Stochastic Block Model (DCBM) (Karrer & Newman, 2011) Accuracy: Normalized Mutual Information Data: Generated using DCBM (next slide)

Data Generation 58 Given a degree distribution, and parameters for DCBM, provide a block-model configuration matrix: Mix perfect model with a random graph model:

Results 59

Stress Test: Mock Facebook 60 Sampled degree distribution from subset of 100K Facebook users with 8M edges (Gjoka et. al, 2010) Planted an artificial cluster structure –Repeated bridges for 1000 total clusters

Future Work 61 1000’s of clusters: getting nowhere fast? –Post-analysis and applications –Information Propagation Map/Reduce Implementation

62 Conclusion Fast and effective when compared to stochastic block modeling Iterative, and requires some basic counting mechanisms –Much simpler and less error-prone than implementing a stochastic algorithm –Fewer mathematical prerequisites makes the algorithm accessible to more programmers A measure of structure, not just an identifier, and its value can be used for other applications

Similar presentations