Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide.

Similar presentations


Presentation on theme: "Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide."— Presentation transcript:

1 Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide deck courtesy of Dan Klein at UC Berkeley

2 Example Naïve Bayes Models  Bag-of-words for text  One feature for every word position in the document  All features share the same conditional distributions  Maximum likelihood estimates: word frequencies, by label Y W1W1 WnWn W2W2  Pixels for images  One feature for every pixel, indicating whether it is on (black)  Each pixel has a different conditional distribution  Maximum likelihood estimates: how often a pixel is on, by label Y F 0,0 F n,n F 0,1

3 Naïve Bayes Training  Data: labeled instances, e.g. emails marked as spam/ham by a person  Divide into training, held-out, and test  Features are known for every training, held-out and test instance  Estimation: count feature values in the training set and normalize to get maximum likelihood estimates of probabilities  Smoothing (aka regularization): adjust estimates to account for unseen data Training Set Held-Out Set Test Set

4 Estimation: Smoothing  Problems with maximum likelihood estimates:  If I flip a coin once, and it’s heads, what’s the estimate for P(heads)?  What if I flip 10 times with 8 heads?  What if I flip 10M times with 8M heads?  Basic idea:  We have some prior expectation about parameters (here, the probability of heads)  Given little evidence, we should skew towards our prior  Given a lot of evidence, we should listen to the data

5 Estimation: Smoothing  Relative frequencies are the maximum likelihood estimates  In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution ????

6 Recap: Laplace Smoothing  Laplace’s estimate (extended):  Pretend you saw every outcome k extra times  What’s Laplace with k = 0?  k is the strength of the prior  Laplace for conditionals:  Smooth each condition:  Can be derived by dividing HHT 6

7 Estimation: Linear Interpolation  In practice, Laplace often performs poorly for P(X|Y):  When |X| is very large  When |Y| is very large  Another option: linear interpolation  Also get P(X) from the data  Make sure the estimate of P(X|Y) isn’t too different from P(X)  What if  is 0? 1?

8 Better: Linear Interpolation  Linear interpolation for conditional likelihoods  Idea: the conditional probability of a feature x given a label y should be close to the marginal probability of x  Example: A rare word like “interpolation” should be similarly rare in both ham and spam (a priori)  Procedure: Collect relative frequency estimates of both conditional and marginal, then average  Effect: Features have odds ratios closer to 1 8

9 Real NB: Smoothing  Odds ratios without smoothing: south-west : inf nation : inf morally : inf nicely : inf extent : inf... screens : inf minute : inf guaranteed : inf $205.00 : inf delivery : inf...

10 Real NB: Smoothing helvetica : 11.4 seems : 10.8 group : 10.2 ago : 8.4 areas : 8.3... verdana : 28.8 Credit : 28.4 ORDER : 27.2 : 26.9 money : 26.5... Do these make more sense?  Odds ratios after smoothing:

11 Tuning on Held-Out Data  Now we’ve got two kinds of unknowns  Parameters: P(F i |Y) and P(Y)  Hyperparameters, like the amount of smoothing to do: k,   Where to learn which unknowns  Learn parameters from training set  Can’t tune hyperparameters on training data (why?)  For each possible value of the hyperparameters, train and test on the held-out data  Choose the best value and do a final test on the test data Proportion of P ML (x) in P(x|y)

12 Baselines  First task when classifying: get a baseline  Baselines are very simple “straw man” procedures  Help determine how hard the task is  Help know what a “good” accuracy is  Weak baseline: most frequent label classifier  Gives all test instances whatever label was most common in the training set  E.g. for spam filtering, might label everything as spam  Accuracy might be very high if the problem is skewed  When conducting real research, we usually use previous work as a (strong) baseline

13 Confidences from a Classifier  The confidence of a classifier:  Posterior of the most likely label  Represents how sure the classifier is of the classification  Any probabilistic model will have confidences  No guarantee confidence is correct  Calibration  Strong calibration: confidence predicts accuracy rate  Weak calibration: higher confidences mean higher accuracy  What’s the value of calibration?

14 Precision vs. Recall  Let’s say we want to classify web pages as homepages or not  In a test set of 1K pages, there are 3 homepages  Our classifier says they are all non-homepages  99.7 accuracy!  Need new measures for rare positive events  Precision: fraction of guessed positives which were actually positive  Recall: fraction of actual positives which were guessed as positive  Say we guess 5 homepages, of which 2 were actually homepages  Precision: 2 correct / 5 guessed = 0.4  Recall: 2 correct / 3 true = 0.67  Which is more important in customer support email automation?  Which is more important in airport face recognition? - guessed + actual +

15 Precision vs. Recall  Precision/recall tradeoff  Often, you can trade off precision and recall  Only works well with weakly calibrated classifiers  To summarize the tradeoff:  Break-even point: precision value when p = r  F-measure: harmonic mean of p and r:

16 Naïve Bayes Summary  Bayes rule lets us do diagnostic queries with causal probabilities  The naïve Bayes assumption takes all features to be independent given the class label  We can build classifiers out of a naïve Bayes model using training data  Smoothing estimates is important in real systems  Confidences are useful when the classifier is calibrated

17 Example Errors Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the...... To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click... 17

18 What to Do About Errors  Problem: there’s still spam in your inbox  Need more features – words aren’t enough!  Have you emailed the sender before?  Have 1K other people just gotten the same email?  Is the sending information consistent?  Is the email in ALL CAPS?  Do inline URLs point where they say they point?  Does the email address you by (your) name?  Naïve Bayes models can incorporate a variety of features, but tend to do best in homogeneous cases (e.g. all features are word occurrences) 18

19 Features  A feature is a function that signals a property of the input  Naïve Bayes: features are random variables & each value has conditional probabilities given the label.  Most classifiers: features are real-valued functions  Common special cases:  Indicator features take values 0 and 1 (or -1 and 1)  Count features return non-negative integers  Features are anything you can think of for which you can write code to evaluate on an input  Many are cheap, but some are expensive to compute  Can even be the output of another classifier or model  Domain knowledge goes here! 19

20 Feature Extractors  Features: anything you can compute about the input  A feature extractor maps inputs to feature vectors  Many classifiers take feature vectors as inputs  Feature vectors usually very sparse, use sparse encodings (i.e. only represent non-zero keys) Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … W=dear : 1 W=sir : 1 W=this : 2... W=wish : 0... MISSPELLED : 2 YOUR_NAME : 1 ALL_CAPS : 0 NUM_URLS : 0... 20

21 Generative vs. Discriminative  Generative classifiers:  E.g. naïve Bayes  A causal model with evidence variables  Query model for causes given evidence  Discriminative classifiers:  No causal model, no Bayes rule, often no probabilities at all!  Try to predict the label Y directly from X  Robust, accurate with varied features  Loosely: mistake driven rather than model driven 21

22 Nearest-Neighbor Classification  Nearest neighbor for digits:  Take new image  Compare to all training images  Assign based on closest example  Encoding: image is vector of intensities:  What’s the similarity function?  Dot product of two images vectors?  Usually normalize vectors so ||x|| = 1  min = 0 (when?), max = 1 (when?)

23

24 Clustering  Clustering systems:  Unsupervised learning  Detect patterns in unlabeled data  E.g. group emails or search results  E.g. find categories of customers  E.g. detect anomalous program executions  Useful when don’t know what you’re looking for  Requires data, but no labels  Often get gibberish 24

25 Clustering  Basic idea: group together similar instances  Example: 2D point patterns  What could “similar” mean?  One option: small (squared) Euclidean distance 25

26 K-Means  An iterative clustering algorithm  Pick K random points as cluster centers (means)  Alternate:  Assign data instances to closest mean  Assign each mean to the average of its assigned points  Stop when no points’ assignments change 26

27 K-Means Example 27

28 Example: K-Means  [web demo]  http://www.cs.washington.edu/research/image database/demo/kmcluster/ http://www.cs.washington.edu/research/image database/demo/kmcluster/ 28

29 K-Means as Optimization  Consider the total distance to the means:  Each iteration reduces phi  Two stages each iteration:  Update assignments: fix means c, change assignments a  Update means: fix assignments a, change means c points assignments means 29

30 Phase I: Update Assignments  For each point, re-assign to closest mean:  Can only decrease total distance phi! 30

31 Phase II: Update Means  Move each mean to the average of its assigned points:  Also can only decrease total distance… (Why?)  Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean 31

32 Initialization  K-means is non-deterministic  Requires initial means  It does matter what you pick!  What can go wrong?  Various schemes for preventing this kind of thing: variance- based split / merge, initialization heuristics 32

33 K-Means Getting Stuck  A local optimum: Why doesn’t this work out like the earlier example, with the purple taking over half the blue? 33

34 K-Means Questions  Will K-means converge?  To a global optimum?  Will it always find the true patterns in the data?  If the patterns are very very clear?  Will it find something interesting?  Do people ever use it?  How many clusters to pick? 34

35 Agglomerative Clustering  Agglomerative clustering:  First merge very similar instances  Incrementally build larger clusters out of smaller clusters  Algorithm:  Maintain a set of clusters  Initially, each instance in its own cluster  Repeat:  Pick the two closest clusters  Merge them into a new cluster  Stop when there’s only one cluster left  Produces not one clustering, but a family of clusterings represented by a dendrogram 35

36 Agglomerative Clustering  How should we define “closest” for clusters with multiple elements?  Many options  Closest pair (single-link clustering)  Farthest pair (complete-link clustering)  Average of all pairs  Ward’s method (min variance, like k-means)  Different choices create different clustering behaviors 36

37 Clustering Application 37 Top-level categories: supervised classification Story groupings: unsupervised clustering


Download ppt "Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide."

Similar presentations


Ads by Google