General characterisation Training set: documents labeled with one or more classes encoded using some representation model Typical data representation model: every document represented as vector of real-valued measurements + class vector may represent word counts class is given since this is supervised learning
Data sources: Reuters collection http://www.daviddlewis.com/resources/testcollections/re uters21578/ http://www.daviddlewis.com/resources/testcollections/re uters21578/ Large collection of Reuters newswire texts, categorised by the topic. Topics include: earn(ings) grain wheat acq(uisitions) …
Reuters dataset Text 1Text 2 17-MAR-1987 11:07:22.82 earn AMRE INC <AMRE> 3RD QTR JAN 31 NET DALLAS, MArch 17 - Shr five cts vs one ct Net 196,986 vs 37,966 Revs 15.5 mln vs 8,900,000 Nine mths Shr 52 cts vs 22 cts Net two mln vs 874,000 Revs 53.7 mln vs 28.6 mln Reuter 17-MAR-1987 11:26:47.36 acq DEVELOPMENT CORP OF AMERICA <DCA> MERGED HOLLYWOOD, Fla., March 17 - Development Corp of America said its merger with Lennar Corp <LEN> was completed and its stock no longer existed. Development Corp of America, whose board approved the acquisition last November for 90 mln dlrs, said the merger was effective today and its stock now represents the right to receive 15 dlrs a share.The American Stock Exchange said it would provide further details later. Reuter
Representing documents: vector representations Suppose we select k = 20 keywords that are diagonistic of the earnings category. Can be done using chi-square, topic signatures etc Each document d represented as a vector, containing term weights for each of the k terms: #times term i occurs in doc j length of doc j
Why use a log weighting scheme? A formula like 1 + log(tf) dampens the actual frequency Example: let d be a document of 89 words profit occurs 6 times tf(profit) = 6; 10 * [1+log(tf(profit))/1+log(89)] = 6 cts (“cents”) occurs 3 times tf(cents) = 3; 10 * [1+log(tf(cts))/1+log(89)] = 5 we avoid overestimating the importance of profit relative to cts (profit is more important than cts, but not twice as important) Log weighting schemes are common in information retrieval
Form of a decision tree Example: probability of belonging to category “earnings” given that s(cts) > 2 is.116 node 4node 3 node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2 node 7node 6 vs < 2net ≥ 2
Form of a decision tree Equivalent to a formula in disjunctive normal form. (cts < 2 & net < 1 &…) V (cts ≥ 2 & net ≥ 1 &…) a complete path is a conjunction node 4node 3 node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2 node 7node 6 vs < 2net ≥ 2
How to grow a decision tree Typical procedure: grow a very large tree prune it Pruning avoids overfitting the training data. e.g. a tree can contain several branches which are based on accidental properties of the training set e.g. only 1 document in category earnings contains both “dlrs” and “pct”
Growing the tree Splitting criterion: to identify a value for a feature a on which a node is split Stopping criterion: determines when to stop splitting e.g. stop splitting when all elements at a node have an identical representation (equal vectors for all keywords)
Growing the tree: Splitting criterion Information gain: do we reduce uncertainty if we split node n into two when attribute a has value y? let t be the distribution of n this is equivalent to comparing: entropy of t vs entropy of t given a i.e. entropy of t vs entropy of its child nodes if we split sum of entropy of child nodes, weighted by the proportion p of items from n in each child (l & r)
Information gain example at node 1 P(c|n1) = 0.3 H = 0.6 at node 2: P(c|n2) = 0.1 H = 0.35 at node 5: P(c|n5) = 0.9 H = 0.22 weighted sum of 2 & 5 = 0.328 gain = 0.611 – 0.328 = 0.283 node 4node 3 node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2 node 7node 6 vs < 2net ≥ 2
Leaf nodes Suppose n3 has: 1500 “earnings” docs other docs in other categories Where do we classify a new doc d? e.g. use MLE with add-one smoothing node 4 541 items p(c|n4) = 0.649 node 3 5436 items p(c|n3) = 0.050 node 1 7861 items p(c|n1) = 0.3 split: cts value: 2 node 2 5977 items p(c|n2) = 0.116 split: net value: 1 node 5 1704 items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2
Pruning the tree Pruning proceeds by removing leaf nodes one by one, until tree is empty. At each step, remove the leaf node expected to be least helpful. Needs a pruning criterion. i.e. a measure of “confidence” indicating what evidence we have that the node is useful. Each pruning step gives us a new tree (old tree minus one node) – total of n trees if original tree had n nodes Which of these trees do we select as our final classifier?
Pruning the tree: held-out data To select the best tree, we can use held-out data. At each pruning step, try resulting tree against held-out data, and check success rate. Since held-out data reduces training set, better to perform cross-validation.
When are decision trees useful? Some disadvantages: A decision tree is a complex classification device many parameters split training data into very small chunks small sets will display regularities that don’t generalise (overfitting) Main advantage: very easy to understand!
A reminder from lecture 9 MaxEnt distribution a log-linear model: probability of a category c and document d computed in terms of weighted multiplication of feature values (normalised by a constant) each feature imposes a constraint on the model:
A reminder from lecture 9 The MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: where P is the set of possible distributions with p* is unique and has the form given earlier Weights for features can be found using Generalised Iterative Scaling
Application to text categorisation Example: we’ve identified 20 keywords which are diagnostic of the “earnings” category in Reuters each keyword is a feature
“Earnings” features (from M&S `99) f j (word)Weight α j log α j cts12.3032.51 profit9.7012.272 net6.1551.817 loss4.0321.394 dlrs0.678-0.388 pct0.590-0.528 is0.418-0.871 Very salient/ diagnostic features (higher weights) less important features
Classifying with the maxent model Recall that: As a decision criterion we can use: Classify a new document as “earnings” if P(“earnings”|d) > P(¬”earnings”|d)
Rationale Simple nearest neighbour (1NN): Given: a new document d Find: the document in the training set that is most similar to d Classify d with the same category Generalisation (kNN): compare d to its k nearest neighbours The crucial thing is the similarity measure.
Example: 1NN + cosine similarity Given: document d Goal: categorise d based on training set T Define: Find the subset T’ of T s.t.:
Generalising to k>1 neighbours Choose the k nearest neighbours and weight them by similarity. Repeat method for each neighbour. Decide on a classification based on the majority class for these neighbours.