1 Farag Saad firstname.lastname@example.org i-KNOW 2014 Graz- Austria, 18-09-2014 Baseline Evaluation: An Empirical Study of the Performance of Machine Learning Algorithms in Short Snippet Sentiment AnalysisFarag Saadi-KNOW 2014Graz- Austria,
2 Outline Introduction Sentiment Classification Training data Evaluation Sentiment AnalysisSentiment ClassificationFeatures extractionFeature weightingClassification algorithms (Binarized Multinomial Naïve Bayes)Training dataEvaluationClassification performance comparison between various classifiersDoes features selection useful for classification?Does the up weighting of adjectives improves classifiers’ performance?Conclusion
3 Introduction Emergence of Web 2.0 Internet is more user interactive Many users generate content dailyRich opinions are important. However,User-generated contentLacks OrganizationContains Improper StructureReviews are LongTherefore, automatically mining user-generated content is very difficult but a very important task to achieveblog.journals.cambridge.org
4 Sentiment Analysis Sentiment Analysis aka opinion miningAttempts to identify the opinion/sentiment that a person may hold towards an object (Bing Liu, 2010)Our task is to determine firstly, if a piece of text isObjective or subjective ? e.g.,:“Yesterday I bought a Nikon camera” is an objective text“The video capability is truly amazing” is subjective textSecond, is to detect a text polarity:Positive or negative sentiment. However,
5 Sentiment AnalysisA piece of text can fall between positive or negative“With the exception of burst shooting, this camera’s performance is excellent”The sentiment might be expressed explicitly or implicitly e.g.,”poor picture quality” explicit sentiment, while“The laptop battery lasted for 3 hour” implicit sentimentThe sentiment is domain dependent e.g.,“gangster kills a guy in a fight” bears a negative sentiment“fight illness with healthy food” bears positive sentiment
6 Sentiment Classification Consist of three main steps:Feature extractionUnigram featureData preprocessing: Remove stop words but keep the restFeatures reduction (select only the most useful feature)Feature weightingTerm frequencyTerm frequency & inverse document frequency (TF-IDF)Term PresencePart of Speech (only adjective is selected)
7 Sentiment Classification Algorithms Naïve Bayes (NB) with its variations, Support Vector Machine (SVM) and J48We will focus on describing the best performing classifier that is the Binarized Multinomial NB
8 The Binarized Multinomial NB Given an unlabeled set of sentences Where denotes the ith test sentence and the denotes the word within it, and given a manually annotated training sentences, that contain sentences with their sentiment polarities where denotes the ith labeled training sentence and the refers to its polarity𝑃 𝐶 𝑗 𝑡 𝑖 )=𝑃 𝑐 𝑗 𝑃 𝑡 𝑖 𝑐 𝑗 ) 𝑃( 𝑡 𝑖 ) −−−−−> (1)𝑃 𝑡 𝑖 𝑐 𝑗 )= 𝑡=1 𝑛 𝑥𝑡 ! 𝑡=1 |𝑉| 𝑃( 𝑤 𝑡 | 𝑐 𝑗 ) 𝑥 𝑡 𝑥 𝑡 ! −−−−−−> (2)
9 The Binarized Multinomial NB The probability 𝑃 𝑤 𝑡 𝑐 𝑗 ) based on a set of Documents D is computed as follows:Duplicates in each document in 𝐷 will be eliminated where for each word 𝑤 𝑡 in a document 𝑑 𝑗 only one instance is keptConcatenate all documents resulting in the first step into a single document 𝑑 𝑘Count the number of occurrences for each 𝑤 𝑡 in 𝑑 𝑘Laplace to avoid zero estimates:𝑃 𝑤 𝑡 𝐶 𝑗 = 𝑓 𝑤 𝑡 , 𝑐 𝑗 +𝜇 𝑓 𝑐 𝑗 +|𝑉| −−−−−−> (3)
10 The Training/Test Data Data collection (Blitzer et al.,2007)The 12 domains dataset were binary annotated (+/-)It consists of reviews for different product types such as books, apparel, health and personal care, magazines etc.The selected products contain an equal number of sentences from positive and negative reviews.Each product contains 1000 sentences positive and 1000 negativeThe total annotated sentences across all products was 24000J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification (ACL 2007)
11 EvaluationInconsistence in the previous studies in regard to algorithms performanceWe have designed a binary classification task, where each sentence is a test instance and the target class attribute is either positive or negativeTrain a selected classifier models to predict the class of any unlabeled test instances
12 Evaluation Three main experiments: A comparison of classifiers‘ performance has been carried out.What is the effect of feature selection methods (mainly using Information Gain)Does the up-weighting of adjectives leads to a classification improvement?
13 Evaluation / Cross-Validation Exp.1TrainingExp.2Exp.3Exp.4Exp.5Exp.101234510For each data set: break up the data into 10 foldsFor each foldSelect the current fold as test setTrain the classifier on the rest (9 folds)Compute the average classification performance on the 10 runs
14 Evaluation Corpora NB MNB/BMNB SVM J48 F1-Measure TF TP TF-IDF Apparel 0.7470.7810.6850.8280.8270.8110.6990.7620.8150.7360.7370.734Baby0.7060.7560.7000.7970.8030.7920.6550.5030.7900.6930.7010.686Books0.6580.6970.7670.7690.7630.5880.5550.7680.6360.6380.656Camera & Photo0.6980.7280.7980.6650.8200.7350.727DVD0.6820.7260.6610.7730.7820.7800.5970.5470.7860.6720.6760.673Electronics0.7080.6800.7770.7790.6140.6780.6700.671Health & personal care0.7070.720.8100.8000.6400.6420.8080.695Kitchen & housewares0.6810.7390.6910.7940.6900.7930.7240.717Magazines0.7220.8210.8220.8230.6210.5620.8260.778software0.6300.7090.7310.8060.5830.7190.7130.712Sport & outdoors0.7580.7250.8010.7890.743Video0.7030.7450.6880.7650.6470.7140.729Table 1: The polarity classification results using F1-Measure for different classifiers applied on 12 test data domains (the best performed method for each domain is in bold and underlined).
15 Evaluation Corpora NB MNB/BMNB SVM J48 F1-Measure TF TP TF-IDF Apparel 0.7800.7890.7460.8480.8520.8390.7850.7920.8260.7410.743Baby0.7600.7710.7350.8190.8180.8080.7690.7640.8120.7160.7180.705Books0.7030.7090.7010.7660.7770.7630.6930.6950.7740.6670.6640.674Camera & Photo0.7440.7420.8170.8280.8250.7480.7390.719DVD0.7510.7150.7830.8050.7020.7250.6910.6850.704Electronics0.7200.7070.8130.8060.7380.683Health & personal care0.7570.8270.8210.8240.8070.69Kitchen & housewares0.7360.7590.8330.8290.7280.734Magazines0.7610.8430.8490.8580.7550.753software0.6700.7240.7470.8150.6690.7140.713Sport & outdoors0.7780.8000.8200.7810.8090.7230.706Video0.7490.7760.8010.7210.7270.730Table 2: The impact of feature selection method Information Gain (IG) on the classifiers' performance using F1-Measure. (the best performed method for each domain is in bold and underlined).
16 Evaluation Weighting method Improvement TF 6.10% TP 5.62% TF&IDF 2.63% CorporaNBMNB/BMNBSVMJ48F1-MeasureTFTPTF-IDFApparel0.7800.7890.7460.8480.8520.8390.7850.7920.8260.7410.743Baby0.7600.7710.7350.8190.8180.8080.7690.7640.8120.7160.7180.705Books0.7030.7090.7010.7660.7770.7630.6930.6950.7740.6670.6640.674Camera & Photo0.7440.7420.8170.8280.8250.7480.7390.719DVD0.7510.7150.7830.8050.7020.7250.6910.6850.704Electronics0.7200.7070.8130.8060.7380.683Health & personal care0.7570.8270.8210.8240.8070.69Kitchen & housewares0.7360.7590.8330.8290.7280.734Magazines0.7610.8430.8490.8580.7550.753software0.6700.7240.7470.8150.6690.7140.713Sport & outdoors0.7780.8000.8200.7810.8090.7230.706Video0.7490.7760.8010.7210.7270.730Weighting methodImprovementTF6.10%TP5.62%TF&IDF2.63%Overall average4.7%Table 2: The impact of feature selection method Information Gain (IG) on the classifiers' performance using F1-Measure. (the best performed method for each domain is in bold and underlined).
17 ConclusionWe have conducted a series of comparative experiments in order to compare the performance of various machine learning classifiers on the sentiment analysis taskWe studied the impact of feature selection methods on the classification performance improvementThe best achieved classification results were obtained using the BMNB classifierUsing feature selection methods have led to a significant increase of all classifiers‘ performance using different feature weighting methods
18 ConclusionBased on the carried out experiments in this paper, our finding raised the possibility that the BMNB classier performs the best in a short snippet sentiment analysisWe further support the recent finding done by Wang and Manning (2012*) (in the short snippet sentiment analysis, MNB actually performs better than other classifiers particularly better than SVM classifier* S. Wang and C. D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics- Volume 2, ACL '12, pages 90-94, 2012.