Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Categorization CSC 575 Intelligent Information Retrieval.

Similar presentations


Presentation on theme: "Text Categorization CSC 575 Intelligent Information Retrieval."— Presentation transcript:

1 Text Categorization CSC 575 Intelligent Information Retrieval

2 2 Classification Task  Given:  A description of an instance, x  X, where X is the instance language or instance or feature space.  Typically, x is a row in a table with the instance/feature space described in terms of features or attributes.  A fixed set of class or category labels: C={c 1, c 2,…c n }  Classification task is to determine:  The class/category of x: c(x)  C, where c(x) is a function whose domain is X and whose range is C.

3 3 Learning for Classification  A training example is an instance x  X, paired with its correct class label c(x): for an unknown classification function, c.  Given a set of training examples, D.  Find a hypothesized classification function, h(x), such that: h(x) = c(x), for all training instances (i.e., for all in D). This is called consistency.

4 4 Example of Classification Learning  Instance language:  size  {small, medium, large}  color  {red, blue, green}  shape  {square, circle, triangle}  C = {positive, negative}  D:  Hypotheses? circle  positive? red  positive? ExampleSizeColorShapeCategory 1smallredcirclepositive 2largeredcirclepositive 3smallredtrianglenegative 4largebluecirclenegative

5 5 General Learning Issues (All Predictive Modeling Tasks)  Many hypotheses can be consistent with the training data  Bias: Any criteria other than consistency with the training data that is used to select a hypothesis  Classification accuracy (% of instances classified correctly).  Measured on independent test data  Efficiency Issues:  Training time (efficiency of training algorithm)  Testing time (efficiency of subsequent classification)  Generalization  Hypotheses must generalize to correctly classify instances not in training data  Simply memorizing training examples is a consistent hypothesis that does not generalize  Occam’s razor: Finding a simple hypothesis helps ensure generalization  Simplest models tend to be the best models  The KISS principle

6 Intelligent Information Retrieval 6 Text Categorization  Assigning documents to a fixed set of categories.  Applications:  Web pages  Recommending  Yahoo-like classification  Newsgroup Messages  Recommending  spam filtering  News articles  Personalized newspaper  Email messages  Routing  Folderizing  Spam filtering

7 Intelligent Information Retrieval 7 Learning for Text Categorization  Manual development of text categorization functions is difficult.  Learning Algorithms:  Bayesian (naïve)  Neural network  Relevance Feedback (Rocchio)  Rule based (Ripper)  Nearest Neighbor (case based)  Support Vector Machines (SVM)

8 Intelligent Information Retrieval 8 Using Relevance Feedback (Rocchio)  Relevance feedback methods can be adapted for text categorization.  Use standard TF/IDF weighted vectors to represent text documents (normalized by maximum term frequency).  For each category, compute a prototype vector by summing the vectors of the training documents in the category.  Assign test documents to the category with the closest prototype vector based on cosine similarity.

9 Intelligent Information Retrieval 9 Rocchio Text Categorization Algorithm (Training) Assume the set of categories is {c 1, c 2,…c n } For i from 1 to n let p i = (init. prototype vectors) For each training example  D Let d be the normalized TF/IDF term vector for doc x Let i = j: (c j = c(x)) (sum all the document vectors in c i to get p i ) Let p i = p i + d

10 Intelligent Information Retrieval 10 Rocchio Text Categorization Algorithm (Test) Given test document x Let d be the TF/IDF weighted term vector for x Let m = –2 (init. maximum cosSim) For i from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, p i ) if s > m let m = s let r = c i (update most similar class prototype) Return class r

11 Intelligent Information Retrieval 11 Rocchio Properties  Does not guarantee a consistent hypothesis.  Forms a simple generalization of the examples in each class (a prototype).  Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.  Classification is based on similarity to class prototypes.

12 Intelligent Information Retrieval 12 Nearest-Neighbor Learning Algorithm  Learning is just storing the representations of the training examples in D.  Testing instance x:  Compute similarity between x and all examples in D.  Assign x the category of the most similar example in D.  Does not explicitly compute a generalization or category prototypes.  Also called:  Case-based  Memory-based  Lazy learning

13 Intelligent Information Retrieval 13 K Nearest-Neighbor  Using only the closest example to determine categorization is subject to errors due to:  A single atypical example.  Noise (i.e. error) in the category label of a single training example.  More robust alternative is to find the k most-similar examples and return the majority category of these k examples.  Value of k is typically odd to avoid ties, 3 and 5 are most common.

14 Intelligent Information Retrieval 14 Similarity Metrics  Nearest neighbor method depends on a similarity (or distance) metric.  Simplest for continuous m-dimensional instance space is Euclidian distance.  Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).  For text, cosine similarity of TF-IDF weighted vectors is typically most effective.

15 Intelligent Information Retrieval 15 K Nearest Neighbor for Text Training: For each each training example  D Compute the corresponding TF-IDF vector, d x, for document x Test instance y: Compute TF-IDF vector d for document y For each  D Let s x = cosSim(d, d x ) Sort examples, x, in D by decreasing value of s x Let N be the first k examples in D. (get most similar neighbors) Return the majority class of examples in N

16 Intelligent Information Retrieval 16 Nearest Neighbor with Inverted Index  Determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.  Use standard VSR inverted index methods to find the k nearest neighbors.  Testing Time: O(B|V t |), where B is the average number of training documents in which a test-document word appears.  Therefore, overall classification is O(L t + B|V t |)  Typically B << |D|

17 Intelligent Information Retrieval 17 Bayesian Methods  Learning and classification methods based on probability theory.  Bayes theorem plays a critical role in probabilistic learning and classification.  Uses prior probability of each category given no information about an item.  Categorization produces a posterior probability distribution over the possible categories given a description of an item.

18 Intelligent Information Retrieval 18 Axioms of Probability Theory  All probabilities between 0 and 1  True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0  The probability of disjunction is: A B

19 Intelligent Information Retrieval 19 Conditional Probability  P(A | B) is the probability of A given B  Assumes that B is all and only information known.  Defined by: A B

20 Intelligent Information Retrieval 20 Independence  A and B are independent iff:  Therefore, if A and B are independent:  Bayes’ Rule: These two constraints are logically equivalent

21 Intelligent Information Retrieval 21 Bayesian Categorization  Let set of categories be {c 1, c 2,…c n }  Let E be description of an instance.  Determine category of E by determining for each c i  P(E) can be determined since categories are complete and disjoint.

22 Intelligent Information Retrieval 22 Bayesian Categorization (cont.)  Need to know:  Priors: P(c i )  Conditionals: P(E | c i )  P(c i ) are easily estimated from data.  If n i of the examples in D are in c i,then P(c i ) = n i / |D|  Assume instance is a conjunction of binary features:  Too many possible instances (exponential in m) to estimate all P(E | c i )

23 Intelligent Information Retrieval 23 Naïve Bayesian Categorization  If we assume features of an instance are independent given the category (c i ) (conditionally independent)  Therefore, we then only need to know P(e j | c i ) for each feature and category.

24 Intelligent Information Retrieval 24 Naïve Bayes Example  C = {allergy, cold, well}  e 1 = sneeze; e 2 = cough; e 3 = fever  E = {sneeze, cough,  fever} ProbWellColdAllergy P(c i ) 0.9 0.05 P(sneeze|c i ) 0.1 0.9 P(cough|c i ) 0.1 0.8 0.7 P(fever|c i ) 0.01 0.7 0.4

25 Intelligent Information Retrieval 25 Naïve Bayes Example (cont.) P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E) P(cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E) P(allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E) Most probable category: allergy P(E) = 0.089 + 0.01 + 0.019 = 0.0379 P(well | E) = 0.24 P(cold | E) = 0.26 P(allergy | E) = 0.50 ProbabilityWellColdAllergy P(c i ) 0.9 0.05 P(sneeze | c i ) 0.1 0.9 P(cough | c i ) 0.1 0.8 0.7 P(fever | c i ) 0.01 0.7 0.4 E={sneeze, cough,  fever}

26 Estimating Probabilities  Normally, probabilities are estimated based on observed frequencies in the training data.  If D contains n i examples in class c i, and n ij of these n i examples contains feature/attribute e j, then:  If the feature is continuous-valued, P(e j |c i ) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(e j |c i ) is 26

27 Smoothing  Estimating probabilities from small training sets is error-prone:  If due only to chance, a rare feature, e k, is always false in the training data,  c i :P(e k | c i ) = 0.  If e k then occurs in a test example, E, the result is that  c i : P(E | c i ) = 0 and  c i : P(c i | E) = 0  To account for estimation from small samples, probability estimates are adjusted or smoothed  Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.  For binary features, p is simply assumed to be 0.5. 27

28 Intelligent Information Retrieval 28 Naïve Bayes Classification for Text  Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w 1, w 2,…w m } based on the probabilities P(w j | c i ).  Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words  p = 1/|V|) and m = |V|  Equivalent to a virtual sample of seeing each word in each category exactly once.

29 Intelligent Information Retrieval 29 Text Naïve Bayes Algorithm (Train) Let V be the vocabulary of all words in the documents in D For each category c i  C Let D i be the subset of documents in D in category c i P(c i ) = |D i | / |D| Let T i be the concatenation of all the documents in D i Let n i be the total number of word occurrences in T i For each word w j  V Let n ij be the number of occurrences of w j in T i Let P(w i | c i ) = (n ij + 1) / (n i + |V|)

30 Intelligent Information Retrieval 30 Text Naïve Bayes Algorithm (Test) Given a test document X Let n be the number of word occurrences in X Return the category: where a i is the word occurring the ith position in X

31 Intelligent Information Retrieval 31 Text Naïve Bayes - Example P(no) = 0.4 P(yes) = 0.6 Training Data New email x containing t1, t4, t5  x = Should it be classified as spam = “yes” or spam = “no”? Need to find P(yes | x) and P(no | x) …

32 Intelligent Information Retrieval 32 Text Naïve Bayes - Example P(no) = 0.4 P(yes) = 0.6 New email x containing t1, t4, t5 x = P(yes | x) = [4/6 * (1-4/6) * (1-3/6) * 3/6 * 2/6] * P(yes) / P(x) =[0.67 * 0.33 * 0.5 * 0.5 * 0.33] * 0.6 / P(x) = 0.11 / P(x) P(no | x) = [1/4 * (1-2/4) * (1-2/4) * 3/4 * 2/4] * P(no) / P(x) =[0.25 * 0.5 * 0.5 * 0.75 * 0.5] * 0.4 / P(x) = 0.019 / P(x) To get actual probabilities need to normalize: note that P(yes | x) + P(no | x) must be 1 0.11 / P(x) + 0.019 / P(x) = 1  P(x) = 0.11 + 0.019 = 0.129 So: P(yes | x) = 0.11 / 0.129 = 0.853 P(no | x) = 0.019 / 0.129 = 0.147


Download ppt "Text Categorization CSC 575 Intelligent Information Retrieval."

Similar presentations


Ads by Google