Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai.

Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai

Outline Bayes probability model Naive Bayes classifier Text classification Digit classification Assignment specifications

Naive Bayes classifier A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions, or more specifically, independent feature model.

Graphical illustration – a class node C at root, want P(C|F1,…,Fn) – evidence nodes F - observed features as leaves – conditional independence between all evidence C F1 F2Fn …… Naive Bayes probability modelprobability model

Naive Bayes probability modelprobability model The classifier is a conditional model Following the Bayes’s rule strictly, we have ….. Simplify this through conditional independence -conditional independence So the conditional distribution over the class C is Z is constant given features

Naive Bayes classifier The naive Bayes classifier combines naive Bayes probability model with a decision rule, such as the maximum a posteriori or MAP decision rule. If there are k classes and if a model for p(Fi) can be expressed by r parameters, then the naive Bayes model has (k − 1) + n r k parameters.

Text Classification Task- classify text documents into one of the pre-defined classes such as sports, recreation, politics, war, economy,…,etc, Given – K groups of training texts – Each group with a label, containing a number of text documents

Procedures Computing a priori class probabilities – Count the number of text documents in each directory/class ni – Total number of training text documents n – Prior probability P(Ci) = ni / n

Computing class conditional word likelihoods – Suppose we have chosen m key words, denoted as w1, w2,…,wm – Count the number of times – cji, that word wj occurs in text class Ci – Count the number of words – ni, in class Ci – Class conditional probaiblity is P(wj | Ci) = cji / ni

Attentions Preprocessing – eliminating punctuation – eliminating numerals – converting all characters to lowercase – eliminating all words with less than 4 letters

You need to build a large vocabulary and separately counts how often a given word was encountered. The vocabulary can be built using a hash table. How to choose the key words wi’s? – For each class, you can pick out the first k words that occurs most frequently – For all the training data, pick out the first k works that appears most frequently – Union all these words as key-words/features

Zero probabilities must be avoided (why?) – This occurs when one word has been encountered only in one class, but not others. – In this case the class conditional probability is zero – To prevent this, re-estimate the conditional prob as P(wj|Ci) = ε/ni with ni a small, tunable number Convert all probabilities to logprobabilities (loglikeli- hoods) to avoid exceeding the dynamic range of the computer representation of real numbers

Digit Classification (assignment 1) USPS data set contains normalized handwritten digits, scanned by the U.S. Postal Service. 16 x 16 grayscale images 7291 training and 2007 test observations Format: each line consists of the digit id (0-9) followed by the 256 grayscale values. The test set is notoriously "difficult“ Download it from here

USPS digits

Setting Classes: 0~9 Features: each pixel is used as a feature, so there are 16 by 16, i.e., 256 features – Rather than pixel gray values, we can use more informative features, such as (detected) corners, crosses, slope, gravity center, etc. – How to quantize the real valued features. Task: classify new digits into one of the classes

Specifications (preliminary, assignment 1 will come soon on Friday) You can either use matlab or c++ for programming – If you use c++, you should have created the class and its members/functions as required – If you use matlab, you should have written functions as required – Input and output format will also be fixed in the assignemnt

Files Matlab file to read the USPS data Matlab file – >[n, digit, label] = read_usps(path, file); – path = ‘c:\...’; file = ‘usps_train.txt’; – n: number of digits/images obtained – Digit: a 16 by 16 by n matrix; – Label: label of each image; You may want to use it to read the USPS data

Matlab file to output a series of files – >output(str,i1,i2); – str: common string part ; – i1 and i2 is the starting and ending integers You may want to use it to write the digits into separate files with the naming system you like

Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai.

Similar presentations

Presentation on theme: "Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai.

Similar presentations

Presentation on theme: "Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai."— Presentation transcript:

Similar presentations

About project

Feedback