Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide.

Slides:



Advertisements
Similar presentations
Projects Project 4 due tonight Project 5 (last project!) out later
Advertisements

Classification Classification Examples
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Unsupervised Learning
Imbalanced data David Kauchak CS 451 – Fall 2013.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Unsupervised learning
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Spring 2007 Lecture 19: Perceptrons 4/2/2007 Srini Narayanan – ICSI and UC Berkeley.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Fall 2009 Lecture 23: Perceptrons 11/17/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual.
Clustering Unsupervised learning Generating “classes”
Crash Course on Machine Learning
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Advanced Multimedia Text Classification Tamara Berg.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Made by: Maor Levy, Temple University  Up until now: how to reason in a give model  Machine learning: how to acquire a model on the basis of.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
CS 188: Artificial Intelligence Spring 2007 Lecture 18: Classification: Part I Naïve Bayes 03/22/2007 Srini Narayanan – ICSI and UC Berkeley.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CSE 511a: Artificial Intelligence Spring 2013
Classification: Feature Vectors
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein – UC Berkeley Many slides from either Stuart Russell or Andrew Moore.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan.
Announcements  Project 5  Due Friday 4/10 at 5pm  Homework 9  Released soon, due Monday 4/13 at 11:59pm.
CS 188: Artificial Intelligence Fall 2007 Lecture 25: Kernels 11/27/2007 Dan Klein – UC Berkeley.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Announcements  Midterm 2  Monday 4/20, 6-9pm  Rooms:  2040 Valley LSB [Last names beginning with A-C]  2060 Valley LSB [Last names beginning with.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein – UC Berkeley 1.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
KNN & Naïve Bayes Hongning Wang
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Data Science Credibility: Evaluating What’s Been Learned
CAP 5636 – Advanced Artificial Intelligence
Artificial Intelligence
Naïve Bayes Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 188: Artificial Intelligence Fall 2006
Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CS 4/527: Artificial Intelligence
Revision (Part II) Ke Chen
CS 188: Artificial Intelligence Spring 2007
CHAPTER 7 BAYESIAN NETWORK INDEPENDENCE BAYESIAN NETWORK INFERENCE MACHINE LEARNING ISSUES.
CS 188: Artificial Intelligence Fall 2008
Naïve Bayes Classifiers
CS 188: Artificial Intelligence Spring 2006
Speech recognition, machine learning
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Naïve Bayes Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Speech recognition, machine learning
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Recap: General Naïve Bayes  A general naïve Bayes model:  Y: label to be predicted  F 1, …, F n : features of each instance Y F1F1 FnFn F2F2 This slide deck courtesy of Dan Klein at UC Berkeley

Example Naïve Bayes Models  Bag-of-words for text  One feature for every word position in the document  All features share the same conditional distributions  Maximum likelihood estimates: word frequencies, by label Y W1W1 WnWn W2W2  Pixels for images  One feature for every pixel, indicating whether it is on (black)  Each pixel has a different conditional distribution  Maximum likelihood estimates: how often a pixel is on, by label Y F 0,0 F n,n F 0,1

Naïve Bayes Training  Data: labeled instances, e.g. s marked as spam/ham by a person  Divide into training, held-out, and test  Features are known for every training, held-out and test instance  Estimation: count feature values in the training set and normalize to get maximum likelihood estimates of probabilities  Smoothing (aka regularization): adjust estimates to account for unseen data Training Set Held-Out Set Test Set

Estimation: Smoothing  Problems with maximum likelihood estimates:  If I flip a coin once, and it’s heads, what’s the estimate for P(heads)?  What if I flip 10 times with 8 heads?  What if I flip 10M times with 8M heads?  Basic idea:  We have some prior expectation about parameters (here, the probability of heads)  Given little evidence, we should skew towards our prior  Given a lot of evidence, we should listen to the data

Estimation: Smoothing  Relative frequencies are the maximum likelihood estimates  In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution ????

Recap: Laplace Smoothing  Laplace’s estimate (extended):  Pretend you saw every outcome k extra times  What’s Laplace with k = 0?  k is the strength of the prior  Laplace for conditionals:  Smooth each condition:  Can be derived by dividing HHT 6

Estimation: Linear Interpolation  In practice, Laplace often performs poorly for P(X|Y):  When |X| is very large  When |Y| is very large  Another option: linear interpolation  Also get P(X) from the data  Make sure the estimate of P(X|Y) isn’t too different from P(X)  What if  is 0? 1?

Better: Linear Interpolation  Linear interpolation for conditional likelihoods  Idea: the conditional probability of a feature x given a label y should be close to the marginal probability of x  Example: A rare word like “interpolation” should be similarly rare in both ham and spam (a priori)  Procedure: Collect relative frequency estimates of both conditional and marginal, then average  Effect: Features have odds ratios closer to 1 8

Real NB: Smoothing  Odds ratios without smoothing: south-west : inf nation : inf morally : inf nicely : inf extent : inf... screens : inf minute : inf guaranteed : inf $ : inf delivery : inf...

Real NB: Smoothing helvetica : 11.4 seems : 10.8 group : 10.2 ago : 8.4 areas : verdana : 28.8 Credit : 28.4 ORDER : 27.2 : 26.9 money : Do these make more sense?  Odds ratios after smoothing:

Tuning on Held-Out Data  Now we’ve got two kinds of unknowns  Parameters: P(F i |Y) and P(Y)  Hyperparameters, like the amount of smoothing to do: k,   Where to learn which unknowns  Learn parameters from training set  Can’t tune hyperparameters on training data (why?)  For each possible value of the hyperparameters, train and test on the held-out data  Choose the best value and do a final test on the test data Proportion of P ML (x) in P(x|y)

Baselines  First task when classifying: get a baseline  Baselines are very simple “straw man” procedures  Help determine how hard the task is  Help know what a “good” accuracy is  Weak baseline: most frequent label classifier  Gives all test instances whatever label was most common in the training set  E.g. for spam filtering, might label everything as spam  Accuracy might be very high if the problem is skewed  When conducting real research, we usually use previous work as a (strong) baseline

Confidences from a Classifier  The confidence of a classifier:  Posterior of the most likely label  Represents how sure the classifier is of the classification  Any probabilistic model will have confidences  No guarantee confidence is correct  Calibration  Strong calibration: confidence predicts accuracy rate  Weak calibration: higher confidences mean higher accuracy  What’s the value of calibration?

Precision vs. Recall  Let’s say we want to classify web pages as homepages or not  In a test set of 1K pages, there are 3 homepages  Our classifier says they are all non-homepages  99.7 accuracy!  Need new measures for rare positive events  Precision: fraction of guessed positives which were actually positive  Recall: fraction of actual positives which were guessed as positive  Say we guess 5 homepages, of which 2 were actually homepages  Precision: 2 correct / 5 guessed = 0.4  Recall: 2 correct / 3 true = 0.67  Which is more important in customer support automation?  Which is more important in airport face recognition? - guessed + actual +

Precision vs. Recall  Precision/recall tradeoff  Often, you can trade off precision and recall  Only works well with weakly calibrated classifiers  To summarize the tradeoff:  Break-even point: precision value when p = r  F-measure: harmonic mean of p and r:

Naïve Bayes Summary  Bayes rule lets us do diagnostic queries with causal probabilities  The naïve Bayes assumption takes all features to be independent given the class label  We can build classifiers out of a naïve Bayes model using training data  Smoothing estimates is important in real systems  Confidences are useful when the classifier is calibrated

Example Errors Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the To receive your $30 Amazon.com promotional certificate, click through to and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future s announcing new store launches, please click... 17

What to Do About Errors  Problem: there’s still spam in your inbox  Need more features – words aren’t enough!  Have you ed the sender before?  Have 1K other people just gotten the same ?  Is the sending information consistent?  Is the in ALL CAPS?  Do inline URLs point where they say they point?  Does the address you by (your) name?  Naïve Bayes models can incorporate a variety of features, but tend to do best in homogeneous cases (e.g. all features are word occurrences) 18

Features  A feature is a function that signals a property of the input  Naïve Bayes: features are random variables & each value has conditional probabilities given the label.  Most classifiers: features are real-valued functions  Common special cases:  Indicator features take values 0 and 1 (or -1 and 1)  Count features return non-negative integers  Features are anything you can think of for which you can write code to evaluate on an input  Many are cheap, but some are expensive to compute  Can even be the output of another classifier or model  Domain knowledge goes here! 19

Feature Extractors  Features: anything you can compute about the input  A feature extractor maps inputs to feature vectors  Many classifiers take feature vectors as inputs  Feature vectors usually very sparse, use sparse encodings (i.e. only represent non-zero keys) Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … W=dear : 1 W=sir : 1 W=this : 2... W=wish : 0... MISSPELLED : 2 YOUR_NAME : 1 ALL_CAPS : 0 NUM_URLS :

Generative vs. Discriminative  Generative classifiers:  E.g. naïve Bayes  A causal model with evidence variables  Query model for causes given evidence  Discriminative classifiers:  No causal model, no Bayes rule, often no probabilities at all!  Try to predict the label Y directly from X  Robust, accurate with varied features  Loosely: mistake driven rather than model driven 21

Nearest-Neighbor Classification  Nearest neighbor for digits:  Take new image  Compare to all training images  Assign based on closest example  Encoding: image is vector of intensities:  What’s the similarity function?  Dot product of two images vectors?  Usually normalize vectors so ||x|| = 1  min = 0 (when?), max = 1 (when?)

Clustering  Clustering systems:  Unsupervised learning  Detect patterns in unlabeled data  E.g. group s or search results  E.g. find categories of customers  E.g. detect anomalous program executions  Useful when don’t know what you’re looking for  Requires data, but no labels  Often get gibberish 24

Clustering  Basic idea: group together similar instances  Example: 2D point patterns  What could “similar” mean?  One option: small (squared) Euclidean distance 25

K-Means  An iterative clustering algorithm  Pick K random points as cluster centers (means)  Alternate:  Assign data instances to closest mean  Assign each mean to the average of its assigned points  Stop when no points’ assignments change 26

K-Means Example 27

Example: K-Means  [web demo]  database/demo/kmcluster/ database/demo/kmcluster/ 28

K-Means as Optimization  Consider the total distance to the means:  Each iteration reduces phi  Two stages each iteration:  Update assignments: fix means c, change assignments a  Update means: fix assignments a, change means c points assignments means 29

Phase I: Update Assignments  For each point, re-assign to closest mean:  Can only decrease total distance phi! 30

Phase II: Update Means  Move each mean to the average of its assigned points:  Also can only decrease total distance… (Why?)  Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean 31

Initialization  K-means is non-deterministic  Requires initial means  It does matter what you pick!  What can go wrong?  Various schemes for preventing this kind of thing: variance- based split / merge, initialization heuristics 32

K-Means Getting Stuck  A local optimum: Why doesn’t this work out like the earlier example, with the purple taking over half the blue? 33

K-Means Questions  Will K-means converge?  To a global optimum?  Will it always find the true patterns in the data?  If the patterns are very very clear?  Will it find something interesting?  Do people ever use it?  How many clusters to pick? 34

Agglomerative Clustering  Agglomerative clustering:  First merge very similar instances  Incrementally build larger clusters out of smaller clusters  Algorithm:  Maintain a set of clusters  Initially, each instance in its own cluster  Repeat:  Pick the two closest clusters  Merge them into a new cluster  Stop when there’s only one cluster left  Produces not one clustering, but a family of clusterings represented by a dendrogram 35

Agglomerative Clustering  How should we define “closest” for clusters with multiple elements?  Many options  Closest pair (single-link clustering)  Farthest pair (complete-link clustering)  Average of all pairs  Ward’s method (min variance, like k-means)  Different choices create different clustering behaviors 36

Clustering Application 37 Top-level categories: supervised classification Story groupings: unsupervised clustering