Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supervised learning for text

Similar presentations


Presentation on theme: "Supervised learning for text"— Presentation transcript:

1 Supervised learning for text

2 Chakrabarti & Ramakrishnan
Organizing knowledge Systematic knowledge structures Ontologies Dewey decimal system, the Library of Congress catalog, the AMS Mathematics Subject Classification, and the US Patent subject classification Web catalogs Yahoo & Dmoz Problem: Manual maintenance Mining the Web Chakrabarti & Ramakrishnan

3 Chakrabarti & Ramakrishnan
Topic Tagging Finding similar documents Guiding queries Naïve Approach: Syntactic similarity between documents Better approach Topic tagging Mining the Web Chakrabarti & Ramakrishnan

4 Chakrabarti & Ramakrishnan
Topic Tagging Advantages Increase vocabulary of classes Hierarchical visualization and browsing aids Applications /Bookmark organization News Tracking Tracking authors of anonymous texts E.g.: The Flesch-Kincaid index classify the purpose of hyperlinks. Mining the Web Chakrabarti & Ramakrishnan

5 Chakrabarti & Ramakrishnan
Supervised learning Learning to assign objects to classes given examples Learner (classifier) A typical supervised text learning scenario. Mining the Web Chakrabarti & Ramakrishnan

6 Chakrabarti & Ramakrishnan
Difference with texts M.L classification techniques used for structured data Text: lots of features and lot of noise No fixed number of columns No categorical attribute values Data scarcity Larger number of class label Hierarchical relationships between classes less systematic unlike structured data Mining the Web Chakrabarti & Ramakrishnan

7 Chakrabarti & Ramakrishnan
Techniques Nearest Neighbor Classifier Lazy learner: remember all training instances Decision on test document: distribution of labels on the training documents most similar to it Assigns large weights to rare terms Feature selection removes terms in the training documents which are statistically uncorrelated with the class labels, Bayesian classifier Fit a generative term distribution Pr(d|c) to each class c of documents {d}. Testing: The distribution most likely to have generated a test document is used to label it. Mining the Web Chakrabarti & Ramakrishnan

8 Chakrabarti & Ramakrishnan
Other Classifiers Maximum entropy classifier: Estimate a direct distribution Pr(cjd) from term space to the probability of various classes. Support vector machines: Represent classes by numbers Construct a direct function from term space to the class variable. Rule induction: Induce rules for classification over diverse features E.g.: information from ordinary terms, the structure of the HTML tag tree in which terms are embedded, link neighbors, citations Mining the Web Chakrabarti & Ramakrishnan

9 Chakrabarti & Ramakrishnan
Other Issues Tokenization E.g.: replacing monetary amounts by a special token Evaluating text classifier Accuracy Training speed and scalability Simplicity, speed, and scalability for document modifications Ease of diagnosis, interpretation of results, and adding human judgment and feedback subjective Mining the Web Chakrabarti & Ramakrishnan

10 Benchmarks for accuracy
Reuters 10700 labeled documents 10% documents with multiple class labels OHSUMED abstracts from medical journals 20NG 18800 labeled USENET postings 20 leaf classes, 5 root level classes WebKB 8300 documents in 7 academic categories. Industry 10000 home pages of companies from 105 industry sectors Shallow hierarchies of sector names Mining the Web Chakrabarti & Ramakrishnan

11 Chakrabarti & Ramakrishnan
Measures of accuracy Assumptions Each document is associated with exactly one class. OR Each document is associated with a subset of classes. Confusion matrix (M) For more than 2 classes M[i; j] : number of test documents belonging to class i which were assigned to class j Perfect classifier: diagonal elements M[i; i] would be nonzero. Mining the Web Chakrabarti & Ramakrishnan

12 Evaluating classifier accuracy
Two-way ensemble To avoid searching over the power-set of class labels in the subset scenario Create positive and negative classes for each document d (E.g.: “Sports” and “Not sports” (all remaining documents) Recall and precision contingency matrix per (d,c) pair Mining the Web Chakrabarti & Ramakrishnan

13 Evaluating classifier accuracy (contd.)
micro averaged contingency matrix micro averaged precision and recall Equal importance for each document Macro averaged precision and recall Equal importance for each class Mining the Web Chakrabarti & Ramakrishnan

14 Evaluating classifier accuracy (contd.)
Precision – Recall tradeoff Plot of precision vs. recall: Better classifier has higher curvature Harmonic mean : Discard classifiers that sacrifice one for the other Mining the Web Chakrabarti & Ramakrishnan

15 Nearest Neighbor classifiers
Intuition similar documents are expected to be assigned the same class label. Vector space model + cosine similarity Training: Index each document and remember class label Testing: Fetch “k” most similar document to given document Majority class wins Alternative: Weighted counts – counts of classes weighted by the corresponding similarity measure Alternative: per-class offset bc which is tuned by testing the classier on a portion of training data held out for this purpose. Mining the Web Chakrabarti & Ramakrishnan

16 Nearest neighbor classification
Mining the Web Chakrabarti & Ramakrishnan

17 Chakrabarti & Ramakrishnan
Pros Easy availability and reuse of of inverted index Collection updates trivial Accuracy comparable to best known classifiers Mining the Web Chakrabarti & Ramakrishnan

18 Chakrabarti & Ramakrishnan
Cons Iceberg category questions involves as many inverted index lookups as there are distinct terms in dq, scoring the (possibly large number of) candidate documents which overlap with dq in at least one word, sorting by overall similarity, picking the best k documents, Space overhead and redundancy Data stored at level of individual documents No distillation Mining the Web Chakrabarti & Ramakrishnan

19 Chakrabarti & Ramakrishnan
Workarounds To reducing space requirements and speed up classification Find clusters in the data Store only a few statistical parameters per cluster. Compare with documents in only the most promising clusters. Again…. Ad-hoc choices for number and size of clusters and parameters. k is corpus sensitive Mining the Web Chakrabarti & Ramakrishnan

20 Chakrabarti & Ramakrishnan
TF-IDF TF-IDF done for whole corpus Interclass correlations and term frequencies unaccounted for Terms which occur relatively frequently in some classes compared to others should have higher importance Overall rarity in the corpus is not as important. Mining the Web Chakrabarti & Ramakrishnan

21 Chakrabarti & Ramakrishnan
Feature selection Data sparsity: Term distribution could be estimated if training set larger than test Not the case however……. Vocabulary documents For Reuters, only about documents available. Over-fitting problem Joint distribution may fit training instances….. But may not fit unforeseen test data that well Mining the Web Chakrabarti & Ramakrishnan

22 Marginals rather than joint
Marginal distribution of each term in each class Empirical distributions may not still reflect actual distributions if data is sparse Therefore feature selection Purposes: Improve accuracy by avoiding over fitting maintain accuracy while discarding as many features as possible to save a great deal of space for storing statistics Heuristic, guided by linguistic and domain knowledge, or statistical. Mining the Web Chakrabarti & Ramakrishnan

23 Chakrabarti & Ramakrishnan
Feature selection Perfect feature selection goal-directed pick all possible subsets of features for each subset train and test a classier retain that subset which resulted in the highest accuracy. COMPUTATIONALLY INFEASIBLE Simple heuristics Stop words like “a”, “an”, “the” etc. Empirically chosen thresholds (task and corpus sensitive) for ignoring “too frequent” or “too rare” terms Discard “too frequent” and “too rare terms” Larger and complex data sets Confusion with stop words Especially for topic hierarchies Greedy inclusion (bottom up) vs. top-down Mining the Web Chakrabarti & Ramakrishnan

24 Greedy inclusion algorithm
Most commonly used in text Algorithm: Compute, for each term, a measure of discrimination amongst classes. Arrange the terms in decreasing order of this measure. Retain a number of the best terms or features for use by the classier. Greedy because measure of discrimination of a term is computed independently of other terms Over-inclusion: mild effects on accuracy Mining the Web Chakrabarti & Ramakrishnan

25 Measure of discrimination
Dependent on model of documents desired speed of training ease of updates to documents and class assignments. Observations sets included for acceptable accuracy tend to have large overlap. Mining the Web Chakrabarti & Ramakrishnan

26 Chakrabarti & Ramakrishnan
The test Similar to the likelihood ratio test Build a 2 x 2 contingency matrix per class-term pair Under the independence hypothesis Aggregates the deviations of observed values from expected values Larger the value of , the lower is our belief that the independence assumption is upheld by the observed data. Mining the Web Chakrabarti & Ramakrishnan

27 Chakrabarti & Ramakrishnan
The test Feature selection process Sort terms in decreasing order of their values, Train several classifier with varying number of features Stopping at the point of maximum accuracy. Mining the Web Chakrabarti & Ramakrishnan

28 Chakrabarti & Ramakrishnan
Mutual information Useful when the multinomial document model is used X and Y are discrete random variables taking values x,y Mutual information (MI) between them is defined as Measure of extent of dependence between random variables, Extent to which the joint deviates from the product of the marginals Weighted with the distribution mass at (x; y) Mining the Web Chakrabarti & Ramakrishnan

29 Chakrabarti & Ramakrishnan
Mutual Information Advantages To the extent MI(X,Y) is large, X and Y are dependent. Deviations from independence at rare values of (x,y) are played down Interpretations Reduction in the entropy of Y given X. MI(X; Y ) = H(X) – H(X|Y) = H(Y) – H(Y|X) KL distance between no-independence hypothesis and independence hypothesis KL distance gives the average number of bits wasted by encoding events from the `correct‘ distribution using a code based on a not-quite-right distribution Mining the Web Chakrabarti & Ramakrishnan

30 Feature selection with MI
Fix a term t and let be an event associated with that term. E.g.: For the binary model, = 0/1, Pr( ) = the empirical fraction of documents in the training set in which event it occurred. Pr( ,c) = the empirical fraction of training documents which are in class c Pr(c) = fraction of training documents belonging to class c. Formula: Problem : document lengths are not normalized. Mining the Web Chakrabarti & Ramakrishnan

31 Fisher's discrimination index
Useful when documents are scaled to constant length Term occurrences are regarded as fractional real numbers. E.g.: Two class case Let X and Y be the sets of length normalized document vectors corresponding to the two classes. Let and be centroids for each class. Covariance matrices be Mining the Web Chakrabarti & Ramakrishnan

32 Fisher's discrimination index (contd.)
Goal : find a projection of the data sets X and Y on to a line such that the two projected centroids are far apart compared to the spread of the point sets projected on to the same line. Find a column vector such that the ratio of the square of the difference in mean vectors projected onto it & average projected variance is maximized. This gives Mining the Web Chakrabarti & Ramakrishnan

33 Fisher's discrimination index
Formula Let X and Y for both the training and test data are generated from multivariate Gaussian distributions Let Then this value of induces the optimal (minimum error) classier by suitable thresholding on for a test point q. Problems Inverting S would be unacceptably slow for tens of thousands of dimensions. Llinear transformations would destroy already existing sparsity. Mining the Web Chakrabarti & Ramakrishnan

34 Chakrabarti & Ramakrishnan
Solution Recall: Goal was to eliminate terms from consideration. Not to arrive at linear projections involving multiple terms Regard each term t as providing a candidate direction t which is parallel to the corresponding axis in the vector space model. Compute the Fisher's index of t Mining the Web Chakrabarti & Ramakrishnan

35 Chakrabarti & Ramakrishnan
FI : Solution (contd.) Formula For two class case Can be generalized to a set {c} of more than two classes Feature selection Terms are sorted in decreasing order of FI(t) Best ones chosen as features. Mining the Web Chakrabarti & Ramakrishnan

36 Chakrabarti & Ramakrishnan
Validation How to decide a cut-off rank ? Validation approach A portion of the training documents are held out The rest is used to do term ranking The held-out set used as a test set. Various cut-off ranks can be tested using the same held-out set. Leave-one-out cross-validation/partitioning data into two An aggregate accuracy is computed over all trials. Wrapper to search for the number of features In decreasing order of discriminative power Yields the highest accuracy. Mining the Web Chakrabarti & Ramakrishnan

37 Chakrabarti & Ramakrishnan
Validation (contd.) Simple search heuristic Keep adding one feature at every step until the classifier's accuracy ceases to improve. A general illustration of wrapping for feature selection. Mining the Web Chakrabarti & Ramakrishnan

38 Chakrabarti & Ramakrishnan
Validation (contd.) For naive Bayes-like classier Evaluation on many choices of feature sets can be done at once. For Maximum Entropy/Support vector machines Essentially involves training a classier from scratch for each choice of the cut-off rank. Therefore inefficient Mining the Web Chakrabarti & Ramakrishnan

39 Validation : observations
Bayesian classifier cannot over fit much Effect of feature selection on Bayesian classifiers Mining the Web Chakrabarti & Ramakrishnan

40 Truncation algorithms
Start from the complete set of terms T Keep selecting terms to drop Till you end up with a feature subset Question: When should you stop truncation ? Two objectives minimize the size of selected feature set F. Keep the distorted distribution Pr(C|F) as similar as possible to the original Pr(CjT) Mining the Web Chakrabarti & Ramakrishnan

41 Truncation Algorithms: Example
Kullback-Leibler (KL) Measures similarity or distance between two distributions Markov Blanket Let X be a feature in T. Let The presence of M renders the presence of X unnecessary as a feature => M is a Markov blanket for X Technically M is called a Markov blanket for if X is conditionally independent of given M eliminating a variable because it has a Markov blanket contained in other existing features does not increase the KL distance between Pr(C|T) and Pr(C|F). Mining the Web Chakrabarti & Ramakrishnan

42 Finding Markov Blankets
Absence of Markov Blanket in practice Finding approximate Markov blankets Purpose: To cut down computational complexity search for Markov blankets M to those with at most k features. given feature X, search for the members of M to those features which are most strongly correlated (using tests similar to the 2 or MI tests) with X. Example : For Reuters dataset, over two-thirds of T could be discarded while increasing classification accuracy Mining the Web Chakrabarti & Ramakrishnan

43 Feature Truncation algorithm
while truncated Pr(C|F) is reasonably close to original Pr(C|T) do for each remaining feature X do Identify a candidate Markov blanket M: For some tuned constant k, find the set M of k variables in F \ X that are most strongly correlated with X Estimate how good a blanket M is Estimate end for Eliminate the feature having the best surviving Markov blanket end while Mining the Web Chakrabarti & Ramakrishnan

44 General observations on feature selection
The issue of document length should be addressed properly. Choice of association measures does not make a dramatic difference Greedy inclusion algorithms scale nearly linearly with the number of features Markov blanket technique takes time proportional to at least . Advantage of Markov blankets algo over greedy inclusion Greedy algo may include features with high individual correlations even though one subsumes the other Features individually uncorrelated could be jointly more correlated with the class This rarely happens Binary feature selection view may not be only view to subscribe to Suggestion: combine features into fewer, simpler ones E.g.: project the document vectors to a lower dimensional space Mining the Web Chakrabarti & Ramakrishnan

45 Chakrabarti & Ramakrishnan
Bayesian Learner Very practical text classifier Assumption A document can belong to exactly one of a set of classes or topics. Each class c has an associated prior probability Pr(c), There is a class-conditional document distribution Pr(djc) for each class. Posterior probability Obtained using Bayes Rule Parameter set consists of all P(d|c) Mining the Web Chakrabarti & Ramakrishnan

46 Parameter Estimation for Bayesian Learner
Estimate of is based on two sources of information: Prior knowledge on the parameter set before seeing any training documents Terms in the training documents D. Bayes Optimal Classifier Taking the expectation of each parameter over Pr( |D) Computationally infeasible Maximum likelihood estimate Replace the sum above with the value of the summand (Pr(c|d, )) for arg max Pr(D| ), Works poorly Mining the Web Chakrabarti & Ramakrishnan

47 Naïve Bayes Classifier
assumption of independence between terms, joint term distribution is the product of the marginals. Widely used owing to simplicity and speed of training, applying, and updating Two kinds of widely used marginals for text Binary model Multinomial model Mining the Web Chakrabarti & Ramakrishnan

48 Chakrabarti & Ramakrishnan
Naïve Bayes Models Binary Model Each parameter indicates the probability that a document in class c will mention term t at least once. Multinomial model each class has an associated die with |W| faces. each parameter denotes probability of the face turning up on tossing the die. term t occurs n(d; t) times in document d, document length is a random variable denoted L, . Mining the Web Chakrabarti & Ramakrishnan

49 Analysis of Naïve Bayes Models
Multiply together a large number of small probabilities, Result: extremely tiny probabilities as answers. Solution : store all numbers as logarithms Class which comes out at the top wins by a huge margin Sanitizing scores using likelihood ration Also called the logit function . Mining the Web Chakrabarti & Ramakrishnan

50 Chakrabarti & Ramakrishnan
Parameter smoothing What if a test document contains a term t that never occurred in any training document in class c ? Ans : will be zero Even if many other terms clearly hint at a high likelihood of class c generating the document. Bayesian Estimation Estimating probability from insufficient data. If you toss a coin n times and it always comes up heads, what is the probability that the (n + 1)th toss will also come up heads? posit a prior distribution on , called E.g.: The uniform distribution Resultant posterior distribution: Mining the Web Chakrabarti & Ramakrishnan

51 Chakrabarti & Ramakrishnan
Laplace Smoothing Based on Bayesian Estimation Laplace's law of succession loss function (penalty) for picking a smoothed value as against the `true' value. E.g.: Loss function as the square error For this choice of loss,the best choice of the smoothed parameter is simply the expectation of the posterior distribution on having observed the data: . Mining the Web Chakrabarti & Ramakrishnan

52 Laplace Smoothing (contd.)
Heuristic alternatives Lidstone's law of succession . derivation for the multinomial model there are |W| possible events where W is the vocabulary. Mining the Web Chakrabarti & Ramakrishnan

53 Chakrabarti & Ramakrishnan
Performance analysis Multinomial naive Bayes classifier generally outperforms the binary variant K-NN may outperform naïve Bayes Naïve Bayes is faster and more compact decision boundaries: regions of potential confusion Mining the Web Chakrabarti & Ramakrishnan

54 NB: Decision boundaries
Bayesian classier partitions the multidimensional term space into regions Within each region, the probability of one class is higher than others On the boundaries, the probability of two or more classes are exactly equal NB is a linear classier it makes a decision between c = 1 and c = -1 by thresholding the value of (b=prior) for a suitable vector Mining the Web Chakrabarti & Ramakrishnan

55 Chakrabarti & Ramakrishnan
Pitfalls Strong bias fixes the policy that (tth component of the linear discriminant) depends only on the statistics of term t in the corpus. Therefore it cannot pick from the entire set of possible linear discriminants, Mining the Web Chakrabarti & Ramakrishnan

56 Chakrabarti & Ramakrishnan
Bayesian Networks Attempt to capture statistical dependencies between terms themselves Approximations to the joint distribution over terms Probability of a term occurring depends on observation about other terms as well as the class variable. A directed acyclic graph All random variables (classes and terms) are nodes Dependency edges are drawn from c to t for each t.(parent-child edges) To represent additional dependencies between terms dependency edges (parent child) are drawn Mining the Web Chakrabarti & Ramakrishnan

57 Chakrabarti & Ramakrishnan
Bayesian networks. For the naive Bayes assumption, the only edges are from the class variable to individual terms. Towards better approximations to the joint distribution over terms: the probability of a term occurring may now depend on observation about other terms as well as the class variable. Mining the Web Chakrabarti & Ramakrishnan

58 Bayesian Belief Network (BBN)
DAG Parents Pa(X) nodes that are connected by directed edges to a node X Fixing the values of the parent variables completely determines the conditional distribution of X Conditional Probability tables For discrete variables, the distribution data for X can be stored in the obvious way as a table with each row showing a set of values of the parents, the value of X, and a conditional probability. Unlike Naïve Bayes P(d|c) is not a simple product over all terms. . Mining the Web Chakrabarti & Ramakrishnan

59 Chakrabarti & Ramakrishnan
BBN: difficulty Getting a good network structure. At least quadratic time Enumeration of all pairs of features Exploited only for binary model Multinomial model Prohibitive CPT sizes Mining the Web Chakrabarti & Ramakrishnan

60 Exploiting hierarchy among topics
Ordering between the class labels For Data warehousing E.g. : high, medium, or low cancer risk patients. Text Class labels: Taxonomy: large and complex class hierarchy that relates the class labels Tree structure Simplest form of taxonomy widely used in directory browsing, often the output of clustering algorithms. inheritance: If class c0 is the parent of class c1, any training document which belongs to c1 also belongs to c0. Mining the Web Chakrabarti & Ramakrishnan

61 Topic Hierarchies : Feature selection
Discriminating ability of a term sensitive to the node (or class) in the hierarchy Measure of discrimination of a term Can be evaluated with respect to only internal nodes of the hierarchy. `can' may be a noisy word at the root node of Yahoo! Help classifying documents under the sub tree of /Science/Environment/Recycling. Mining the Web Chakrabarti & Ramakrishnan

62 Topic Hierarchies: Enhanced parameter estimation
Uniform priors not good Idea If a parameter estimate is shaky at a node with few training documents, perhaps we can impose a strong prior from a well-trained parent to repair the estimates. Shrinkage Seeks to improve estimates of descendants using data from ancestors, Mining the Web Chakrabarti & Ramakrishnan

63 Chakrabarti & Ramakrishnan
Shrinkage Assume multinomial model introducing a dummy class c0 as the parent of the root c1, where all terms are equally likely. For a specific path c0,c1,…….cn, `shrunk' estimate is determined by a convex linear interpolation of the MLE parameters at the ancestor nodes up through c0 Estimatation of mixing weights Simple form of EM algorithm Determined empirically, by iteratively maximizing the probability of a held-out portion Hn of the training set for node cn. Mining the Web Chakrabarti & Ramakrishnan

64 Shrinkage: Observation
Improves accuracy beyond hierarchical naïve Bayes, Improvement is high when data is sparse Capable of utilizing many more features than Naïve Bayes Mining the Web Chakrabarti & Ramakrishnan

65 Topic search in Hierarchy
By definition All documents are relevant to the root ‘topic’ Pr(root|d) = 1. Given a test document d: Find one or more of the most likely leaf nodes in the hierarchy. Document cannot belong to more than one path, . Mining the Web Chakrabarti & Ramakrishnan

66 Topic search in Hierarchy: Greedy Search strategy
Search starts at the root Decisions are made greedily At each internal node pick the highest probability class Continue Drawback Early errors cause compounding effect Mining the Web Chakrabarti & Ramakrishnan

67 Topic search in Hierarchy: Best-first search strategy
For finding m most probable leaf classes Find the weighted shortest path from the root to a leaf. Edge (c0,ci) is assigned a (non-negative) edge weight of –Pr(ci|c0,d) . To make Best first search different from greedy search Rescale/smoothen the probabilities Mining the Web Chakrabarti & Ramakrishnan

68 Chakrabarti & Ramakrishnan
Using best-first search on a hierarchy can improve both accuracy and speed. Because the hierarchy has four internal nodes, the second column shows the number of features for each. These were tuned so that the total number of features for both at and best-first are roughly the same (so that the model complexity is comparable). Because each document belonged to exactly one leaf node, recall equals precision in this case and is called `accuracy'. Mining the Web Chakrabarti & Ramakrishnan

69 The semantics of hierarchical classification
Asymmetry training document can be associated with any node, test document must be routed to a leaf, Routing test documents to internal nodes none of the children matches the document many children match the document the chances of making a mistake while pushing down the test document one more level may be too high. Research issue Mining the Web Chakrabarti & Ramakrishnan

70 Maximum entropy learners: Motivation
Bayesian learner first model Pr(d|c) at training time Apply Bayes rule at test time Two problems with Bayesian learners d is represented in a high-dimensional term space =>Pr(d|c) cannot be estimated accurately from a training set of limited size. No systematic way of adding synthetic features Such an addition may result in highly correlated features high subsumption Mining the Web Chakrabarti & Ramakrishnan

71 Maximum entropy learners
Assume that each document has only one class label Indicator functions fj(c,d) Flag ‘j’th condition relating class c to document d Expectation of indicator fj is . Approximating Pr(d,c) and Pr(d) with their empirical estimates Mining the Web Chakrabarti & Ramakrishnan

72 Principle of Maximum Entropy
Constraints don’t determine Pr(c|d) uniquely Principle of Maximum Entropy: prefer the simplest model to explain observed data. Choose Pr(c|d) that maximizes the Entropy of Pr(c|d) In the event of empty training set we should consider all classes to be equally likely, Constrained Optimization Maximize the entropy of the model distribution Pr(c|d) While obeying the constraints for all j Optimize by the method of Lagrange multipliers Mining the Web Chakrabarti & Ramakrishnan

73 Maximum Entropy solution
Fitting the distribution to the data involves two steps: Identify a set of indicator functions derived from the data. Iteratively arrive at values for the parameters that satisfy the constraints while maximizing the entropy of the distribution being modeled. An equivalent optimization problem Mining the Web Chakrabarti & Ramakrishnan

74 Text Classification using Maximum Entropy Model
Example Pick an indicator for each (class, term) combination. For the binary document model, For the multinomial document model What we gain with Maximum Entropy over naïve Bayes does not suffer from the independence assumptions E.g.: if the terms t1 = machine and t2 = learning are often found together in class c, and would be suitably discounted. Mining the Web Chakrabarti & Ramakrishnan

75 Performance of Maximum Entropy Classifier
Outperforms naive Bayes in accuracy, but not consistently. Table of figures Mining the Web Chakrabarti & Ramakrishnan

76 Discriminative classification
Naïve Bayes and Maximum Entropy Classifiers “induce” linear decision boundaries between classes in the feature space. Discriminative classifiers Directly map the feature space to class labels Class labels are encoded as numbers e.g: +1 and –1 for two class problem Two examples Linear least-square regression Support Vector Machines Mining the Web Chakrabarti & Ramakrishnan

77 Linear least-square regression
No inherent reason for going through the modeling step as in Bayesian or maximum entropy classifier to get a linear discriminant. Linear Regression Problem Look for some arbitrary such that directly predicts the label ci of document di. Minimize the square error between the observed and predicted class variable: Widrow-Hoff (WH) update rule. Scaling to norm 1 Two equivalent interpretations Classifier is a hyperplane Documents are projected on to a direction Performance Comparable to Naïve Bayes and Max Ent Mining the Web Chakrabarti & Ramakrishnan

78 Support vector machines
Assumption : training and test population are drawn from the same distribution Hypothesis Hyperplane that is close to many training data points has a greater chance of misclassifying test instances A hyperplane which passes through a “no-man's land”, has lower chances of misclassifications Make a decision by thresholding Seek an which maximizes the distance of any training point from the hyperplane Mining the Web Chakrabarti & Ramakrishnan

79 Support vector machines
Optimal separator Orthogonal to the shortest line connecting the convex hull of the two classes Intersects this shortest line halfway Margin: distance of any training point from the optimized hyperplane It is at least Mining the Web Chakrabarti & Ramakrishnan

80 Chakrabarti & Ramakrishnan
Illustration of the SVM optimization problem. Mining the Web Chakrabarti & Ramakrishnan

81 SVMs: non separable classes
Classes in the training data not always separable. Introduce fudge variables Equivalent dual Mining the Web Chakrabarti & Ramakrishnan

82 Chakrabarti & Ramakrishnan
SVMs: Complexity Quadratic optimization problem. Working set: refine a few at a time holding the others fixed. On-demand computation of inner-products n documents: Recent SVM packages Linear time by clever selection of working sets. Mining the Web Chakrabarti & Ramakrishnan

83 Chakrabarti & Ramakrishnan
Performance Comparison with other classifiers Amongst most accurate classifier for text Better accuracy than naive Bayes and decision tree classifier, interesting revelation Linear SVMs suffice standard text classification tasks have classes almost separable using a hyperplane in feature space Research issues Non-linear SVMs Mining the Web Chakrabarti & Ramakrishnan

84 Chakrabarti & Ramakrishnan
SVM training time variation as the training set size is increased, with and without sufficient memory to hold the training set. In the latter case, the memory is set to about a quarter of that needed by the training set. Mining the Web Chakrabarti & Ramakrishnan

85 Chakrabarti & Ramakrishnan
Comparison of LSVM with previous classifiers on the Reuters data set (data taken from Dumais). (The naive Bayes classier used binary features, so its accuracy can be improved) Mining the Web Chakrabarti & Ramakrishnan

86 Chakrabarti & Ramakrishnan
Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB. Mining the Web Chakrabarti & Ramakrishnan

87 Comparison between several classifiers using the Reuters collection.
Mining the Web Chakrabarti & Ramakrishnan

88 Hypertext classification
Techniques to address hypertextual features. Document Object Model or DOM well-formed HTML document is a properly nested hierarchy of regions in a tree-structured DOM tree, internal nodes are elements some of the leaf nodes are segments of text. other nodes are hyperlinks to other Web pages, In turn DOM trees Mining the Web Chakrabarti & Ramakrishnan

89 Representing hypertext for supervised learning
Paying special attention to tags can help with learning keyword-based search assign heuristic weights to terms that occur in specific HTML tags Example…….. (next slide) Mining the Web Chakrabarti & Ramakrishnan

90 Chakrabarti & Ramakrishnan
Prefixing with tags Distinguishing between the two occurrences of the word “surfing”, Prefixing each term by the sequence of tags that we need to follow from the DOM root to get to the term, A repeated term in different sections should reinforce belief in a class label Using a maximum entropy classier Accumulate evidence from different features maintain both forms of a term: plain text and prefixed text (all path prefixes) Mining the Web Chakrabarti & Ramakrishnan

91 Chakrabarti & Ramakrishnan
Experiments 10705 patents from the US Patent Office, 70% error with plain text classier, 24% error with path-tagged terms 17%. Error with path prefixes 1700 resumes (with naive Bayes classifier) 53% error with flattened HTML 40% error with prefix-tagged terms Mining the Web Chakrabarti & Ramakrishnan

92 Chakrabarti & Ramakrishnan
Limitations Prefix representations ad-hoc inflexible. Generalisibility: How to incorporate additional features ? E.g.: adding features derived from hyperlinks. Relations uniform way to codify hypertextual features. Example: Mining the Web Chakrabarti & Ramakrishnan

93 Rule Induction for relational learning
Inductive classifiers discover rules from a collection of relations. Example solution for above Goal : Discover a set of predicate rules Consider 2 class setting Positive examples D+ and negative examples D- Test instance: True => positive instance. Else negative instance. Mining the Web Chakrabarti & Ramakrishnan

94 Rule induction with First Order Inductive Logic (FOIL)
Well-known rule learner Start with empty rule set learn new (disjunctive) rule add conjunctive literals to the new rule until no negative example is covered by the new rule. pick a literal which increases the ratio of surviving positive to negative bindings rapidly. Remove positive examples covered by any rule generated thus far. Till no positive instances are left Mining the Web Chakrabarti & Ramakrishnan

95 Chakrabarti & Ramakrishnan
Literals Explored where Q is a relation and Xi are variables, at least one of which must be already bound. not(L), where L is a literal of the above forms. Mining the Web Chakrabarti & Ramakrishnan

96 Chakrabarti & Ramakrishnan
Analysis Can learn class labels for individual pages Can learn relationships between labels member(homePage, department) teaches(homePage, coursePage) advises(homePage, homePage) writes(homePage, paper) Hybrid approaches Statistical classifier more complex search for literals Inductive learning comparing the estimated probabilities of various classes. Recursively labeling relations E.g.: relating page label in terms of labels of neighboring pages classified(A, facultyPage) :- links-to(A, B), classified(B, studentPage), links-to(A, C), classified(C, coursePage), links-to(A, D), classified(D, publicationsPage). Mining the Web Chakrabarti & Ramakrishnan


Download ppt "Supervised learning for text"

Similar presentations


Ads by Google