Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 15: Text Classification Padhraic Smyth Department.

Similar presentations


Presentation on theme: "Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 15: Text Classification Padhraic Smyth Department."— Presentation transcript:

1 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 15: Text Classification Padhraic Smyth Department of Information and Computer Science University of California, Irvine

2 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine RoadMap for Lectures Lecture 15 (today): text classification Lectures 16, 17, 18, 19: –Unsupervised learning from text – clustering and topic modeling –Recommender systems –Credit scoring applications –Pattern-finding algorithms Lecture 20 –Thursday June 8 th (2 weeks from Thursday) –5-minute project summary from each student –More details on format to come later…..

3 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles, e.g., Google News –Automated creation of Web-page taxonomies Data Representation –“Bag of words” most commonly used: either counts or binary –Can also use “phrases” for commonly occuring combinations of words Classification Methods –Naïve Bayes widely used (e.g., for spam email) Fast and reasonably accurate –Support vector machines (SVMs) Typically the most accurate method in research studies But more complex computationally –Logistic Regression (regularized) Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)

4 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Further Reading on Text Classification Web-related text mining in general –S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003. –See chapter 5 for discussion of text classification General references on text and language modeling –Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. –Speech and Language Processing: An Introduction to Natural Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000. SVMs for text classification –T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002

5 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Common Data Sets used for Evaluation Reuters –10700 labeled documents –10% documents with multiple class labels Yahoo! Science Hierarchy –95 disjoint classes with 13,598 pages 20 Newsgroups data –18800 labeled USENET postings –20 leaf classes, 5 root level classes WebKB –8300 documents in 7 categories such as “faculty”, “course”, “student”. Industry –6449 home pages of companies partitioned into 71 classes

6 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Trimming the Vocabulary Stopword removal: –remove “non-content” words very frequent “stop words” such as “the”, “and”…. –remove very rare words, e.g., that only occur a few times in 100k documents –Can remove 30% or more of the original unique words Stemming: –Reduce all variants of a word to a single term –E.g., {draw, drawing, drawings} -> “draw” –Porter stemming algorithm (1980) relies on a preconstructed suffix list with associated rules e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE –BINARIZATION => BINARIZE This still often leaves p ~ O(10 4 ) terms => a very high-dimensional classification problem!

7 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Feature Selection Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms –See classification results later in these slides Greedy search –Start from empty set or full set and add/delete one at a time –Heuristics for adding/deleting Information gain (mutual information of term with class) Chi-square Other ideas –Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

8 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Example of Role of Feature Selection (from Chakrabarti, Chapter 5) 9600 documents from US Patent database 20,000 raw features (terms)

9 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Classifying Term Vectors Typically multiple different words may be helpful in classifying a particular class, e.g., –Class = “finance” –Words = “stocks”, “return”, “interest”, “rate”, etc. –Thus, classifiers that combine multiple features often do well, e.g, Naïve Bayes, Logistic regression, SVMs, etc Linear classifiers often perform well in high-dimensions – In many cases fewer documents in training data than dimensions, i.e., n training data are linearly separable –So again, naïve Bayes, logistic regression, linear SVMS, are all useful –Question becomes: which linear discriminant to select?

10 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Classification Issues Typically many features, p ~ O(10 4 ) terms Consider n sample points in p dimensions –Binary labels => 2 n possible labelings (or dichotomies) –A labeling is linearly separable if we can separate the labels with a hyperplane –Let f(n,p) = fraction of the 2 n possible labelings that are linearly separable f(n, p) = 1 n <= p + 1 2/ 2 n  (n-1 choose i) n > p+1

11 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine If n < p+1, then points will be linearly separable (for large p)

12 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Types of Classifiers Generative/Probabilistic –Model p(x | c) for each class, then estimate p(c | x) –e.g., naïve Bayes model Conditional Probability/Regression –Model p(c | x) directly, e.g., –e.g., logistic regression Discriminative –Look for decision boundaries in input space x directly No probabilities –e.g., perceptron, linear discriminants, SVMs, etc

13 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Probabilistic “Generative” Classifiers Model p( x | c k ) for each class and perform classification via Bayes rule, c = arg max { p( c k | x ) } = arg max { p( x | c k ) p(c k ) } How to model p( x | c k )? –p( x | c k ) = probability of a “bag of words” x given a class c k –Two commonly used approaches (for text): Naïve Bayes: treat each term x j as being conditionally independent, given c k Multinomial: model a document with N words as N tosses of a p-sided die –Other models possible but less common, E.g., model word order by using a Markov chain for p( x | c k )

14 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Naïve Bayes Classifier for Text Naïve Bayes classifier = conditional independence model –Assumes conditional independence assumption given the class: p( x | c k ) =  p( x j | c k ) –Note that we model each term x j as a discrete random variable –Binary terms (Bernoulli): p( x | c k ) =  p( x j = 1 | c k )  p( x j = 0 | c k ) –Non-binary terms (counts): p( x | c k ) =  p( x j = k | c k ) can use a parametric model (e.g., Poisson) or non-parametric model (e.g., histogram) for p(x j = k | c k ) distributions.

15 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Multinomial Classifier for Text Multinomial Classification model –Assume that the data are generated by a p-sided die (multinomial model) –where N x = number of terms (total count) in document x n j = number of times term j occurs in the document –p(N x | c k ) = probability a document has length N x, e.g., Poisson model Can be dropped if thought not to be class dependent –Here we have a single random variable for each class, and the p( x j = i | c k ) probabilities sum to 1 over i (i.e., a multinomial model) –Probabilities typically only defined and evaluated for i=1, 2, 3… –But “zero counts” could also be modeled if desired This would be equivalent to a Naïve Bayes model with a geometric distribution on counts

16 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Highest Probability Terms in Multinomial Distributions

17 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparing Naïve Bayes and Multinomial models McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text classification experiments (however, may be more a result of using counts vs. binary) Note on names used in the literature - Bernoulli (or multivariate Bernoulli) sometimes used for binary version of Naïve Bayes model - multinomial model is also referred to as “unigram” model - multinomial model is also sometimes (confusingly) referred to as naïve Bayes

18 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine WebKB Data Set Train on ~5,000 hand-labeled web pages –Cornell, Washington, U.Texas, Wisconsin Crawl and classify a new site (CMU) Results:

19 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparing Bernoulli and Multinomial on Web KB Data

20 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparing Multinomial and Bernoulli on Reuter’s Data (from McCallum and Nigam, 1998)

21 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparing Multinomial and Bernoulli on Reuter’s Data (from McCallum and Nigam, 1998)

22 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparing Bernoulli and Multinomial (slide from Chris Manning, Stanford) Results from classifying 13,589 Yahoo! Web pages in Science subtree of hierarchy into 95 different topics

23 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comments on Generative Models for Text (Comments applicable to both Naïve Bayes and Multinomial classifiers) Simple and fast => popular in practice –e.g., linear in p, n, M for both training and prediction Training = “smoothed” frequency counts, e.g., –e.g., easy to use in situations where classifier needs to be updated regularly (e.g., for spam email) Numerical issues –Typically work with log p( c k | x ), etc., to avoid numerical underflow –Useful trick: when computing  log p( x j | c k ), for sparse data, it may be much faster to –precompute  log p( x j = 0| c k ) –and then subtract off the log p( x j = 1| c k ) terms Note: both models are “wrong”: but for classification are often sufficient

24 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Beyond independence Naïve Bayes and multinomial both assume conditional independence of words given class Alternative approaches try to account for higher-order dependencies –Bayesian networks: p(x | c) =  x p(x | parents(x), c) Equivalent to directed graph where edges represent direct dependencies Various algorithms that search for a good network structure Useful for improving quality of distribution model ….however, this does not always translate into better classification –Maximum entropy models p(x | c) = 1/Z  subsets f( subsets(x) | c) Equivalent to undirected graph model Estimation is equivalent to maximum entropy assumption Feature selection is crucial (which f terms to include) – can provide high accuracy classification …. however, tends to be computationally complex to fit (estimating Z is difficult)

25 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine

26 Linear Classifiers Linear classifier (two-class case) w T x + w 0 > 0 –w is a p-dimensional vector of weights (learned from the data) –w 0 is a threshold (also learned from the data) –Equation of linear hyperplane (decision boundary) w T x + w 0 = 0 - Distance of a point x from hyperplane =

27 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Geometry of Linear Classifiers w T x + w 0 = 0 Direction of w vector Distance of x from the boundary is 1/||w|| (w T x + w 0 )

28 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Optimal Hyperplane, Support Vectors, and Margin Circles = support vectors = points on convex hull that are closest to hyperplane M = margin = distance of support vectors from hyperplane Goal is to find weight vector that maximizes M Theory tells us that max-margin hyperplane leads to good generalization (see work by Vapnik in 1990’s)

29 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Optimal Separating Hyperplane Solution to constrained optimization problem: (Here y i  {-1, 1} is the binary class label for example i) wlog, let ||w|| = 1/M Unique solution for a linearly separable data set Margin M of the classifier –the distance between the separating hyperplane and the closest training samples –optimal separating hyperplane  maximum margin

30 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Sketch of Optimization Problem Define Lagrangian as a function of w vector, and  ’s The solution must satisfy Points with  i > 0 are called support vectors and distance from hyperplane = M This results in a quadratic programming optimization problem –Good news: convex function of unknowns, unique optimum Variety of well-known algorithms for finding this optimum –Bad news: Quadratic programming in general scales as O(n 3 ), In practice takes O(n a ), where a ~ 1.6 to 2 (see Chakrabarti, Chapter 5, p166)

31 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine From Chakrabarti, Chapter 5, 2002 Timing results on text classification

32 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Support Vector Machines If  i > 0 then the distance of x i from the separating hyperplane is M –Support vectors - points with associated  I > 0 The decision function f(x) is computed from support vectors as => prediction can be fast Non-linearly-separable case: can generalize to allow “slack” constraints Non-linear SVMs: replace original x vector with non-linear functions of x –“kernel trick” : can solve high-d problem without working directly in high d Computational speedups: can reduce training time to near- linear –e.g Platt’s SMO algorithm, Joachim’s SVMLight

33 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine 21578 documents, labeled manually 9603 training, 3299 test articles (“ModApte” split) 118 categories –An article can be in more than one category –Learn 118 binary category distinctions Example “interest rate” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. Common categories (#train, #test) Classic Reuters Data Set Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56)

34 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Dumais et al. 1998: Reuters - Accuracy

35 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Precision-Recall for SVM (linear), Naïve Bayes, and NN (from Dumais 1998) using the Reuters data set

36 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5.

37 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Comparing Text Classifiers Naïve Bayes models (Bernoulli or Multinomial) –Low time complexity (single linear pass through the data) –Generally good, but not always best –Widely used for spam email filtering Linear SVMs –Often produce best results in research studies –But computationally complex to train –not so widely used in practice as naïve Bayes Others –Logistic regression, decision trees: less widely used, but can be useful

38 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Learning with Labeled and Unlabeled documents In practice, obtaining labels for documents is time-consuming, expensive, and error prone –Typical application: small number of labeled docs and a very large number of unlabeled docs Idea: –Build a probabilistic model on labeled docs –Classify the unlabeled docs, get p(class | doc) for each class and doc This is equivalent to the E-step in the EM algorithm –Now relearn the probabilistic model using the new “soft labels” This is equivalent to the M-step in the EM algorithm –Continue to iterate until convergence (e.g., class probabilities do not change significantly) –This EM approach to classification shows that unlabeled data can help in classification performance, compared to labeled data alone

39 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Learning with Labeled and Unlabeled Data Graph from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006

40 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Other issues in text classification Real-time constraints: –Being able to update classifiers as new data arrives –Being able to make predictions very quickly in real-time Document length –Varying document length can be a problem for some classifiers –Multinomial tends to be better than Bernoulli for example Multi-labels and multiple classes –Text documents can have more than one label –SVMs for example can only handle binary data Feature selection –Experiments have shown that feature selection (e.g., by greedy algorithms using information gain) can often improve results Linked documents –Can view Web documents as nodes in a directed graph –Classification can now be performed that leverages the link structure, Heuristic = class labels of linked pages are more likely to be the same –Optimal solution is to classify all documents jointly rather than individually –Resulting “global classification” problem is typically computationally complex

41 Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine Further Reading on Text Classification Web-related text mining in general –S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003. –See chapter 5 for discussion of text classification General references on text and language modeling –Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. –Speech and Language Processing: An Introduction to Natural Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000. SVMs for text classification –T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002


Download ppt "Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 15: Text Classification Padhraic Smyth Department."

Similar presentations


Ads by Google