Presentation on theme: "Opinion Mining and Topic Categorization with Novel Term Weighting Roman Sergienko, Ph.D student Tatiana Gasanova, Ph.D student Ulm University, Germany."— Presentation transcript:
Opinion Mining and Topic Categorization with Novel Term Weighting Roman Sergienko, Ph.D student Tatiana Gasanova, Ph.D student Ulm University, Germany Shaknaz Akhmedova, Ph.D. student Siberian State Aerospace University, Krasnoyarsk, Russia
Contents Motivation Databases Text preprocessing methods The novel term weighting method Features selection Classification algorithms Results of numerical experiments Conclusions 2
Motivation The goal of the work is to evaluate the competitiveness of the novel term weighting in comparison with the standard techniques for opining mining and topic categorization. The criteria are: 1)Macro F-measure for the test set 2)Computational time 3
Databases: DEFT’07 and DEFT’08 4 CorpusSizeClasses BooksTrain size = 2074 Test size = 1386 Vocabulary = 52507 0: negative, 1: neutral, 2: positive GamesTrain size = 2537 Test size = 1694 Vocabulary = 63144 0: negative, 1: neutral, 2: positive DebatesTrain size = 17299 Test size = 11533 Vocabulary = 59615 0: against, 1: for CorpusSizeClasses T1Train size = 15223 Test size = 10596 Vocabulary = 202979 0: Sport, 1: Economy, 2: Art, 3: Television T2Train size = 23550 Test size = 15693 Vocabulary = 262400 0: France, 1: International, 2: Literature, 3: Science, 4: Society
The existing text preprocessing methods Binary preprocessing TF-IDF (Salton and Buckley, 1988) 5 Confident Weights (Soucy and Mineau, 2005)
The novel term weighting method 6 L – the number of classes; n i – the number of instances of the i-th class; N ji – the number of j-th word occurrence in all instances of the i-th class; T ji =N ji /n i – the relative frequency of j-th word occurrence in the i-th class; Rj=max i T ji, S j =arg(max i T ji ) – the number of class which we assign to j-th word.
Features selection 1)Calculating a relative frequency for each word in the each class 2)Choice for each word the class with the maximum relative frequency 3)For each classification utterance calculating sums of weights of words which belong to each class 4)Number of attributes = number of classes 7
The best values of F-measure 10 ProblemF- measure The best known value Term weighting method Classification algorithm Books0.6190.603The novel TWSVM Games0.7200.784ConfWeightk-NN Debates0.7140.720ConfWeightSVM T10.8560.894The novel TWSVM T20.8510.880The novel TWSVM
Comparison of ConfWeight and the novel term weighting 11 ProblemConfWeightThe novel TW Difference Books0.5880.619+0.031 Games0.7200.712-0.008 Debates0.7140.700-0.014 T10.8550.856+0.001 T20.8510.820+0.031
Conclusions The novel term weighting method gives similar or better classification quality than the ConfWeight method but it requires the same amount of time as TF-IDF. 12
Your consent to our cookies if you continue to use this website.