5Multi-label classification Is it eatable?Is it sweet?Is it a fruit?Is it a banana?Is it a banana?Is it an apple?Is it an orange?Is it a pineapple?Is it a banana?Is it yellow?Is it sweet?Is it round?Different structuresNested/ HierarchicalExclusive/ Multi-classGeneral/Structured
6Nearest Neighbor, Decision Trees - From the classification lecture:NN and k-NN were already phrased in a multi-class frameworkFor decision tree, want purity of leaves depending on the proportion of each class (want one class to be clearly dominant)
7Generative models As in the binary case: Learn p(y) and p(y|x) Use Bayes rule:Classify asp(y)p(x|y)p(y|x)
8Generative models Advantages: Fast to train: only the data from class k is needed to learn the kth model (reduction by a factor k compared with other method)Works well with little data provided the model is reasonableDrawbacks:Depends on the quality of the modelDoesn’t model p(y|x) directlyWith a lot of datapoints doesn’t perform as well as discriminative methods
9Naïve Bayes Assumption: Given the class the features are independent X1X2X3X4X5YBag-of-words modelsFeaturesIf the features are discrete:
10Linear classification Each class has a parameter vector (wk,bk)x is assigned to class k iffNote that we can break the symmetry and choose (w1,b1)=0For simplicity set bk=0 (add a dimension and include it in wk)So learning goal given separable data: choose wk s.t.
12Geometry of Linear classification Perceptron K-class logistic regression K-class SVM
13Multiclass Perceptron Online: for each datapointUpdate:Predict:Advantages :Extremely simple updates (no gradient to calculate)No need to have all the data in memory (some point stay classified correctly after a while)DrawbacksIf the data is not separable decrease a slowly…
14Polychotomous logistic regression distribution in exponential formOnline: for each datapointBatch: all descent methodsEspecially in large dimension, use regularizationsmall flip label probability (0,0,1) (.1,.1,.8)Advantages:Smooth functionGet probability estimatesDrawbacks:Non sparse
15Solved in the dual formulation, also Quadratic Program Multi-class SVMIntuitive formulation: without regularization / for the separable casePrimal problem: QPSolved in the dual formulation, also Quadratic ProgramMain advantage: Sparsity (but not systematic)Speed with SMO (heuristic use of sparsity)Sparse solutionsDrawbacks:Need to recalculate or store xiTxjOutputs not probabilities
16Real world classification problems Digit recognitionAutomated protein classificationObject recognition10Phoneme recognition100The number of classes is sometimes bigThe multi-class algorithm can be heavy50[Waibel, Hanzawa, Hinton,Shikano, Lang 1989]
17Combining binary classifiers One-vs-all For each class build a classifier for that class vs the restOften very imbalanced classifiers (use asymmetric regularization)All-vs-all For each class build a classifier for that class vs the restA priori a large number of classifiers to build but…The pairwise classification are way much fasterThe classifications are balanced (easier to find the best regularization)… so that in many cases it is clearly faster than one-vs-all
18Confusion Matrix Predicted classes Classification of 20 news groupsPredicted classesVisualize which classes are more difficult to learnCan also be used to compare two different classifiersCluster classes and go hierachical [Godbole, ‘02]Actual classes[Godbole, ‘02]BLAST classification of proteins in 850 superfamilies
19Calibration How to measure the confidence in a class prediction? Crucial for:Comparison between different classifiersRanking the prediction for ROC/Precision-Recall curveIn several application domains having a measure of confidence for each individual answer is very important (e.g. tumor detection)Want universal confidence score for different methods…Some methods have an implicit notion of confidence e.g. for SVM the distance to the class boundary relative to the size of the margin other like logistic regression have an explicit one.
20CalibrationDefinition: the decision function f of a classifier is said to be calibrated or well-calibrated ifInformally f is a good estimate of the probability of classifying correctly a new datapoint x which would have output value x.Intuitively if the “raw” output of a classifier is g you can calibrate it by estimating the probability of x being well classified given that g(x)=y for all y values possible.
21CalibrationExample: a logistic regression, or more generally calculating a Bayes posterior should yield a reasonably well-calibrated decision function.
22Combining OVA calibrated classifiers ++++p1p2p3p4CalibrationRenormalizepotherconsistent (p1,p2,…,p4,pother)
23Other methods for calibration Simple calibrationLogistic regressionIntraclass density estimation + Naïve BayesIsotonic regressionMore sophisticated calibrationsCalibration for A-vs-A by Hastie and Tibshirani
25b r a r e Local Classification Ignores correlations! Classify using local information Ignores correlations![thanks to Ben Taskar for slide!]
26Local Classification building tree shrub ground Let’s look at the output.Look at majority of the pixels. Roughly correct. In fact, around 70% accuracy.However, there are some serious problems[thanks to Ben Taskar for slide!]
27Structured Classification braceUse local informationExploit correlations[thanks to Ben Taskar for slide!]
28Structured Classification buildingtreeshrubground[thanks to Ben Taskar for slide!]
29Structured Classification Structured classification : direct approachesGenerative approach: Markov Random Fields (Bayesian modeling with graphical models)Linear classification:Structured PerceptronConditional Random Fields (counterpart of logistic regression)Large-margin structured classification
30Structured classification Simple example HMM:“Label sequence”“Observation sequence”b r a c eOptical Character Recognition
31space of feasible outputs Structured ModelMain idea: define scoring function which decomposes as sum of features scores k on “parts” p:Label examples by looking for max score:Parts = nodes, edges, etc.space of feasible outputs
36Decoding and Learning b r a c e HMM example: Three important operations on a general structured (e.g. graphical) model:Decoding: find the right label sequenceInference: compute probabilities of labelsLearning: find model + parameters w so that decoding worksb r a c eHMM example:Decoding: Viterbi algorithmInference: forward-backward algorithmLearning: e.g. transition and emission counts (case of learning a generative model from fully labeled training data)
37Decoding and Learning Focus of graphical model class Decoding: algorithm on the graph (eg. max-product)Inference: algorithm on the graph (eg. sum-product, belief propagation, junction tree, sampling)Learning: inference + optimizationUse dynamic programming to take advantage of the structureFocus of graphical model classNeed 2 essential concepts:cliques: variables that directly depend on one anotherfeatures (of the cliques): some functions of the cliques
38Cliques and Features b r a c e b r a c e In undirected graphs: cliques = groups of completely interconnected variablesIn directed graphs:cliques = variable+its parents
39Exponential formOnce the graph is defined the model can be written in exponential formparameter vectorfeature vectorComparing two labellings with the likelihood ratio
40Our favorite (discriminative) algorithms The devil is the details...……
41(Averaged) Perceptron For each datapointPredict:Update:Averaged perceptron:
43CRF M3net Introduction by Hannah M.Wallach Z difficult to compute with complicated graphsConditioned on all the observationsIntroduction by Hannah M.WallachMEMM & CRF, Mayssam Sayyadian, Rob McCannanhai.cs.uiuc.edu/courses/498ad-fall04/local/my-slides/crf-students.pdfM3netNo Z …The margin penalty can “factorize” according to the problem structureIntroduction by Simon Lacoste-Julien
45Object Segmentation Results [thanks to Ben Taskar for slide!]Object Segmentation ResultsData: [Stanford Quad by Segbot]Trained on 30,000 point sceneTested on 3,000,000 point scenesEvaluated on 180,000 point sceneLaser Range FinderSegbotM. MontemerloS. ThrunModelErrorLocal learningLocal prediction32%+smoothing27%Structured method7%buildingtreeshrubground[Taskar+al 04,Anguelov+Taskar+al 05]