A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito

Introduction: Text Categorization  Many digital Texts are available E-mail, Online news, Blog …  Need of Automatic Text Categorization is increasing without human resource Merits of time and cost

Introduction: Text Categorization  Application Spam filter Topic Categorization

Introduction: Machine Learning  Making Categorization rule automatically by Feature of Text  Types of Machine Learning (ML) Supervised Learning  Labeling Unsupervised Learning  Clustering

Introduction: flow of ML 1.Prepare training Text data with label Feature of Text 2.Learn 3.Categorize new Text Label1 Label2 ？

Outline  Introduction  Text Categorization  Feature of Text  Learning Algorithm  Conclusion

Number of labels  Binary-label True or False (Ex. spam or not) Applied for other types  Multi-label Many labels, but One Text has one label  Overlapping-label One Text has some labels Yes No L1 L2 L3 L4 L1 L2 L3 L4

Types of labels  Topic Categorization Basic Task Compare individual words  Author Categorization  Sentiment Categorization Ex) Review of products Need more linguistic information

Feature of Text  How to express a feature of Text? “ Bag of Words ”  Ignore an order of words Structure  Ex) I like this car. | I don ’ t like this car. “ Bag of Words ” will not work well (d:document = text) (t:term = word)

Preprocessing  Remove stop words “ the ” “ a ” “ for ” …  Stemming relational -> relate, truly -> true

Term Weighting  Term Frequency Number of a term in a document Frequent terms in a document seems to be important for categorization  tf ・ idf Terms appearing in many documents are not useful for categorization

Sentiment Weighting  For sentiment classification, weight a word as Positive or Negative  Constructing sentiment dictionary  WordNet [04 Kamps et al.] Synonym Database Using a distance from ‘ good ’ and ‘ bad ’ good bad happy d (good, happy) = 2 d (bad, happy) = 4

Dimension Reduction  Size of feature vector is (#terms)*(#documents) #terms ≒ size of dictionary High calculation cost Risk of overfitting  Best for training data ≠ Best for real data  Choosing effective feature to improve accuracy and calculation cost

Dimension Reduction  df-threshold Terms appearing in very few documents (ex.only one) are not important  Score If t and cj are independent, Score is equal to Zero

Learning Algorithm  Many (Almost all?) algorithms are used in Text Categorization Simple approach  Na ï ve Bayes  K-Nearest Neighbor High performance approach  Boosting  Support Vector Machine Hierarchical Learning

Na ï ve Bayes  Bayes Rule  This value is hard to calculate ? Assumption : each terms occurs independently

k-Nearest Neighbor  Define a “ distance ” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2| = cosθ  check k of high similarity Texts and categorize by majority vote  If size of test data is larger, memory and search cost is higher d1 d2 θ k=3

Boosting  BoosTexter [00 Schapire et al.]  Ada boost making many “ weak learner ” s with different parameters Kth “ weak learner ” checks performance of 1..K-1th, and tries to classify right to the worst score training data BoosTexter uses Decision Stump as “ weak learner ”

+ + + + + －－－－－ Simple example of Boosting + + + + + －－－－－ 1. －－ + + + + + －－－ 2. + + + + + －－－－－ 3.

Support Vector Machine  Text Categorization with SVM [98 Joachims]  Maximize margin

Text Categorization with SVM  SVM works well for Text Categorization Robustness for high dimension  Robustness for overfitting Most Text Categorization problems are linearly separable  All of OHSUMED (MEDLINE collection)  Most of Reuters-21578 (NEWS collection)

Comparison of these methods  [02 Sebastiani]  Reuters-21578 (2 versions) difference: number of Categories MethodVer.1(90)Ver.2(10) k-NN.860.823 Na ï ve Bayes.795.815 Boosting.878 - SVM.870.920

Hierarchical Learning  TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as Training data Applying AdaBoost recursively Better classifier than ‘ flat ’ AdaBoost  Accuracy ： 2-3% up  Time: training and categorization time down  Hierarchical SVM[04 Cai et al.]

TreeBoost root L1 L2 L3 L4 L11L12L41L42L43 L421L422

Conclusion  Overview of Text Categorization with Machine Learning Feature of Text Learning Algorithm  Future Work Natural Language Processing with Machine Learning, especially in Japanese Calculation Cost

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Similar presentations

Presentation on theme: "A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Similar presentations

Presentation on theme: "A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito."— Presentation transcript:

Similar presentations

About project

Feedback