Presentation is loading. Please wait.

Presentation is loading. Please wait.

text processing And naïve bayes

Similar presentations


Presentation on theme: "text processing And naïve bayes"— Presentation transcript:

1 text processing And naïve bayes
ICCM

2 Using naïve bayes A classification algorithm
Naïve Bayes is popular due to its simplicity of implementation and overall effectiveness Based on (of course) Bayes theorem “Naïve” because of no dependency between words Well suited for: Sifting out spam from Predictive analysis

3 Naïve Bayes Example

4 Naïve Bayes Learning Phase
Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 Cool 3/9 1/5 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14

5 Naïve Bayes Test Phase Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) Look up tables achieved in the learning phrase Calculation P(Outlook=Sunny|Play=Yes) = 2/9 P(Temperature=Cool|Play=Yes) = 3/9 P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

6 Our working example Comparison of s from colleagues vs commercial/sales s.

7 General Steps Have samples of s from colleagues and another set of s that are known to be about other things (SPAM) Data cleansing: Clean out small words and articles. Consistently use either upper or lower case. Breakout extraneous punctuation (judgment call) Within each category, count how many times each word is used among all the s.

8 Testing the model Reuse known emails.
For each , parse out each token just as before. Within each individual , do a count of the tokens. Knime’s NB Module will: For each token found, record the probability value of that token based on the probability values earlier. Do this separately for each category. Processes the values for each category for the . Compare the two values. The one with the higher value is most likely from the associated category.

9 Data issues Additive smoothing: Eventually, when you evaluate a message, that message may have a token not in the training set. So, consistently increment each count. Dealing with Floating-Point Underflow: Due to the very small decimal value that could be produced, an option is to use the natural logarithm of the number.

10 reference classifier/


Download ppt "text processing And naïve bayes"

Similar presentations


Ads by Google