1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:

1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under: Topics in Artificial Intelligence ) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York

2 Suppose we have a single search word "egg." All documents on in our corpus (e.g. web pages) are organized into the following categories: dishwasher, poultry and pregnancy. What is the likelihood that the keyword is intended to "key into" each of the 3 categories of documents? That is, which category would be the best prediction for a search engine to make? Can easily be based on word frequencies in a bag of words approach. Say, for example: In documents classified asthe word "egg" appears dishwasher related379 times poultry related1,617 times pregnancy related824 times

3 Clearly, without any other words, poultry would be the best prediction. Let's formalize this a bit: We want argmax p( c | "egg") c That is, the category c, that maximizes the probability of c given "egg." Take an example. What's p(poultry|"egg")? Take all occurrences of "egg" from all documents in our collection (in our example that would be 2,820 (= 379 + 1,617 + 834) and partition them into their categories. 379 1,617 834 "egg" # occurrences of "egg" in documents in pregnancy category # occurrences of "egg" in documents in dishwasher category # occurrences of "egg" in documents in poultry category

4 p(dishwasher | egg) = 379/2,820 p(poultry | egg) = 1,617/2,820 p(pregnancy | egg) = 824/2,820 In fact, since denominator is the same, when calculating argmax, we just drop it, and calculate simply the max number of occurrences of "egg" in each category. That is, we want: argmax count ( "egg" in category c ). c Unfortunately, calculating this is quite expensive. We have to go through EVERY document in every category. So instead we apply Baye's Rule: p(B | A) P (B ) p( A | B) = ---------------- p(A) or in our example: p( word | category) p(category) p( category | word ) = ------------------------------------ p(word) But since we are finding the maximum probability, we can drop the denominator: argmax p( c | word ) = argmax p( word | c ) p( c ) c  categories c  categories

5 Easy to extend a single word to multiple words, and we get the basic version of the NAIVE BAYES algorithm: (1) argmax p( words | c ) p( c ) c  categories p( c ) is simply the probability of category c being chosen independent of any words. For example by the formula: total number of words in all documents categorized as c ---------------------------------------------------------------- total number of words in the entire corpus (BTW, Why is (1) easier to compute than argmax p( c | words ) ? c  categories Because in order to compute the second equation we would need to compute 2 n entries, where n = number of words, to obtain a joint probability distribution. See next slide. The more computery students, see me after class or email me if interested in further discussion).

6 379 1,617 824 "egg" Pregnancy Poultry Dishwashers p(Poultry | "egg") = 1,617 / (379 + 1,617 + 824)

7 In any event, the best predictor of a category, given a bunch of words, is given by (1) above. A final equation. If words contain the words: w 1, w 2,..., w n then (1) can be rewritten as assuming the words are independently likely to occur! - pretty naive but it works fairly well in practice: (2) argmax [ p(w 1 |c) * p(w 2 |c) *... * p(w n |c) ] * p( c ) c  categories And this is the way the implementation works. A) A corpus of documents are feed to the learner, the words are counted up and stored so that the probabilities in (2) can be effectively calculated. B)An unseen document is given to the learner, (2) is calculated where w 1, w 2,..., w n are the words in the document and the category that maximizes (2) is returned by the learner.

8 The vector space model. speechlanguageprocessing Doc 1601 Doc 2051 Doc 3121 Shorter notation:Doc 1 = Doc 2 = Doc 3 = speech language processing But need to normalize. For example, should be considered very similar to. Easy to do.

1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:

Similar presentations

Presentation on theme: "1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:

Similar presentations

Presentation on theme: "1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:"— Presentation transcript:

Similar presentations

About project

Feedback