Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

On-line learning and Boosting

Farag Saad i-KNOW 2014 Graz- Austria,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.

S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Multiple Instance Learning

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Region Based Image Annotation Through Multiple-Instance Learning By: Changbo Yang Wayne State University Department of Computer Science.

Distributed Representations of Sentences and Documents

Scalable Text Mining with Sparse Generative Models

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

Masquerade Detection Mark Stamp 1Masquerade Detection.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.

Active Learning for Class Imbalance Problem

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Text Classification, Active/Interactive learning.

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Learning with AdaBoost

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

CSC 594 Topics in AI – Text Mining and Analytics

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.

GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.

Ensemble Methods in Machine Learning

Class Imbalance in Text Classification

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Learning from Multi-topic Web Documents for Contextual Advertisement Presenter : Yu-hui Huang Authors.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Sentiment analysis algorithms and applications: A survey

iSRD Spam Review Detection with Imbalanced Data Distributions

NAÏVE BAYES CLASSIFICATION

Presentation transcript:

Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008

2015/10/222 Outline 1.Introduction 2.Sensitive Content Detection 3.Sentiment Classification and Detection & Opinion Mining 4.Experiments 5.Conclusion

2015/10/ Introduction (1/4) Contextual advertisement –A popular advertising paradigm where web page owners allow ad platforms to place ads on their pages that match the content of their sites –Problems: The huge variety of content that can appear on a single web page –e.g. news sites, blogs, etc Advertisers do not want to show their ads on pages with content like violence, pornography etc. (sensitive content) They may also not wish to advertise on pages which contain negative opinion about their products

2015/10/ Introduction (2/4) Objective –Not only to tell if a document has some targeted content in it, but also to label the parts of the document where the content is present Sub-document classification –Classifier train on entire pages using page-level labels and test on individual blocks

2015/10/ Introduction (3/4) Challenges: –Pages can contain unwanted parts e.g., navigation panes, text advertisements, etc –Pages may also contain information on multiple topics –To collect large amounts of broad coverage single-topic training data, pre-clean and hand- label the blocks are difficult and expensive

2015/10/ Introduction (4/4) This paper using multiple-instance learning (MIL) techniques – MILBoost to improve the performance of traditional methods (Naive-Bayes and Decision tree) To train sub-document classifiers using only page level labels The problems of sensitive content detection and opinion/sentiment classification for advertising can be considered as 2-class and multi-class classifying In sentiment detection, a Naive-Bayes based MILBoost detector performs as well as the best block detector trained with block-level labels

2015/10/227 Outline 1.Introduction 2.Sensitive Content Detection 3.Sentiment Classification and Detection & Opinion Mining 4.Experiments 5.Conclusion

2015/10/ Sensitive Content Detection (1/3) Sensitive content categories –e.g., crime, war, disasters, terrorism, pornography, etc The various sensitive categories are grouped as one class labeled as “sensitive” As long as a web page contains such content blocks, it will be marked as sensitive and the ad display will be turned off The available training web pages are labeled at the page-level –The labels only tell whether a page contains sensitive content somewhere in it or not

2015/10/ Sensitive Content Detection (2/3)

2015/10/ Sensitive Content Detection (3/3) If using the entire page, traditional classification methods run the risk of learning everything on the page as “sensitive” To avoid this problem, a classifier that can accurately identify the parts of the page that contain the targeted content is needed Better still is a classifier that can integrate the two tasks of locating and learning –Multiple Instance Learning framework

2015/10/ Multiple Instance Learning Boosting (1/8) Multiple Instance Learning (MIL) is a variation of supervised learning where labels of training data are incomplete Traditional methods where the label of each individual training instance is known In MIL the labels are known only for groups of instances –Bag: a web page –Instance: a block of text

2015/10/ Multiple Instance Learning Boosting (2/8) 2-class classification (sensitive or non-sensitive) –A bag is labeled positive if at least one instance in that bag is positive –A bag is labeled negative if all the instances in it are negative The goal of MIL algorithm is to produce a content detector at the sub-document (block) level without having the block labels in the training data This can save significant amount of money and effort by avoiding labeling work at the block level

2015/10/ Multiple Instance Learning Boosting (3/8) Why MILBoost: –The state of the art traditional algorithms use boostingboosting –We needed a framework to accurately measure the added effectiveness of the MIL framework –MILBoost has been successfully applied to a similar problem Training a face detector to detect multiple faces in pictures when only picture level labels are available

2015/10/ Multiple Instance Learning Boosting (4/8) Positive Negative Positive Negative Positive Negative ? ?? Positive ? ?? Training initial classifier Positive Negative

2015/10/ Multiple Instance Learning Boosting (5/8) For each instance of bag, the probability of an instance is positive is given by where is the weighted sum of the output of each classifier in ensemble with t steps is the output score of the instance generated by the classifier of ensemble

2015/10/ Multiple Instance Learning Boosting (6/8) The probability that the bag is positive is a “noisy OR” Under this model the likelihood assigned to a set of training bags is where is the label of bag i

2015/10/ Multiple Instance Learning Boosting (7/8) Following the AnyBoost approach, the weight on an instance is given as Each round of boosting is a search for a classifier which maximum where is the score assigned to the instance of bag i by the weak classifier (for a binary classifier ) The parameter is determined using a line search to maximum

2015/10/ Multiple Instance Learning Boosting (8/8)

2015/10/2219

2015/10/2220 Outline 1.Introduction 2.Sensitive Content Detection 3.Sentiment Classification and Detection & Opinion Mining 4.Experiments 5.Conclusion and Future Work

2015/10/ Sentiment Classification and Detection & Opinion Mining Sentiment/opinion mining from review pages or blogs –A page may contain one or more topics –It is common to label reviews as “positive” or “negative” –Reviews are often not as polar or one sided as the label indicates –Blog review sites or discussion forums usually feature many people expressing varied opinions about the same product –These “mixed” opinions may act as noise during the training of traditional classification methods

2015/10/ Multi-target MILBoost Algorithm (1/6) To apply MILBoost to the multi-topic detection task, it needs to be extended to a multi-class scenario The “positive” and “negative” opinions can be treated as the target classes and the “neutral” class as the null class A bag is labeled as belonging to class k if it contains at least one instance of class k A bag can be multi-labeled since it may contain instances from more than two different target classesmulti-labeled To deal with multi-labels –Creating duplicates of a bag with multiple labels –Assigning a different label to each duplicate

2015/10/ Multi-target MILBoost Algorithm (2/6) Suppose we have 1...K target classes and class 0 is the null class For each instance of bag, the probability that belongs to class k (k {1, 2,...,K}) is given by a softMax function, where is the weighted sum of the output of each classifier in the ensemble with t steps is the output score for class k from instance generated by the classifier of the ensemble

2015/10/ Multi-target MILBoost Algorithm (3/6) The probability that a page has label k is the probability that at least one of its content block has label k Assuming that the blocks are independent of each other, the probability that a bag belongs to any target class k (k > 0) is The probability that a page is neutral (or belongs to the null class 0) is the same as the probability that all the blocks in the page are neutral (“noisy OR” model)

2015/10/ Multi-target MILBoost Algorithm (4/6) The log likelihood of all the training data can be given as The weight on each instance for next round of training is given as For the null class,

2015/10/ Multi-target MILBoost Algorithm (5/6) Combining weak classifiers –Once the (t + 1)th classifier is trained, the weight on the classifier can be obtained by a line search to maximize the log likelihood function Choice of classifier –In experiments, Naive Bayes and decision trees are used

2015/10/ Multi-target MILBoost Algorithm (6/6) Testing –The new page is divided into blocks and the block level probabilities are computed using the classifier –The page level probabilities are obtained by combining the block level probabilities using noisy-OR –The block and page level labels are calculated using thresholds on the probabilities

2015/10/2228 Outline 1.Introduction 2.Sensitive Content Detection 3.Sentiment Classification and Detection & Opinion Mining 4.Experiments 5.Conclusion

2015/10/ Sensitive Content Detection (1/5) The data set contains 2,000 web pages which are labeled at the page level by human annotators The label for each web page is binary, either sensitive or non-sensitive There is no labeling done at the text block level The evaluation has to be done at the web page level Two popular base classifiers were used to build the MILBoost ensemble, decision trees and Naive Bayes Both the MILBoost and the non-MILboost versions were run through 30 boosting iterations which end up with an ensemble of 30 classifiers Area Under the ROC Curve (AUC) was used to evaluate the effectiveness of the various detectors

2015/10/ Sensitive Content Detection (2/5) Significantly outperforms both base classifiers and traditional boosted versions MILBoost further improves this performance by another 8.2% The MILBoost version achieved almost the same performance as the boosted page-classifier

2015/10/ Sensitive Content Detection (3/5) Althoug the AUC is about the same, the MILBoosted system is almost consistently better than the boosted page- classifier at the early part, where usually the operation point exists. This “early lift” brings practical advantage to the MILBoosted system.

2015/10/ Sensitive Content Detection (4/5) Naive Bayes vs Decision Trees –Naive Bayes performed much better than decision trees in this task –The decision tree ensemble uses only about 700 keywords while NB theoretically uses the whole vocabulary, which is about 20,000 –The bigger feature set enables NB to generalize better at the testing stage

2015/10/ Sensitive Content Detection (5/5) A Sensitive Content Detection Demo

2015/10/ Sentence Level Sentiment Detection (1/2) The subjectivity dataset from the Cornell movie review data repository is used as the data set “objective” and “subjective” sentences are labeled These sentences were extracted from 3000 reviews, which are labeled at the review-level as well A review is a “page” and a sentence is a “block” The MILBoost detector is trained with the review data only using page-level labels, and then evaluated at the sentence-level with sentence level labels Traditional page-level classifiers using boosted NB and decision trees are also built as benchmark algorithms for comparison A page-level classifier using support vector machines (SVM) is also trained to compare the performance

2015/10/ Sentence Level Sentiment Detection (2/2) Train the classifiers with the reviews using page-level labelsTrain the classifiers with the reviews using sentence-level labels The highest performance in all algorithms using page-level labels, and it is comparable with the best sentence detector trained with sentence-level labels MILBoost improves the performance by about 10% over boosted decision trees The SVM did not do as well as the NB classifiers for sentence classification trained either with the page- or with the sentence-level label.

2015/10/ Multi-class Sentiment Detection (1/3) Sentiment detection is naturally a three-class problem with “positive”, “negative” and “neutral” as class labels The “positive” and “negative” classes are the target classes and the “neutral” class is the null class in the MILBoost setup In these tasks, only built a multi-class MIL system based on Naives Bayes The evaluation can only be done at the page-level

2015/10/ Multi-class Sentiment Detection (2/3) MILBoost based system improves upop the boosted Naive Bayes classifierThe performance using SVM is comparable to the MILBoost system

2015/10/ Multi-class Sentiment Detection (3/3)

2015/10/ When does MIL improve on traditional methods? An Analysis Experiment (1/3) This paper hypothesized before that multiple-instance learning should improve learning of traditional techniques when the amount of mixed content is high The experiments were run on a car review dataset which contained 113,000 user reviews from MSN Autos The objective of the learning task to identify negative opinions in review texts These experiments want to show that as the amount of mixed content increases, MIL based approach can help traditional techniques improve

2015/10/ When does MIL improve on traditional methods? An Analysis Experiment (2/3) This data set had an overall review rating score from 0-10 Assume that if the rating score is 6 or below, there will be some negative opinions in the review text Further split the negative reviews into two subsets, one with rating scores from 0 to 3 (data 0-3)and the other with ratings from 4 to 6 (data 4-6) Presumably, the percentage of negative sentences in “data 0-3” will be much higher than that in “data 4-6” If hypothesis hold right, MIL based techniques should give a bigger boost in “data 4-6”

2015/10/ When does MIL improve on traditional methods? An Analysis Experiment (3/3) The MILBoost based system did not improve much over the regular boosted system Statistically significant improvement over traditional classifiers With good quality training data, MILBoost does not give much advantage over traditional methods; however, if the training data has a high ratio of mixed content, then MILBoost does provide significant advantages There are three times as many pages in “data 4-6” as in “data 0-3” and the entire class distribution is highly biased towards positive with positive to negative ratio of 5:1

2015/10/2242 Outline 1.Introduction 2.Sensitive Content Detection 3.Sentiment Classification and Detection & Opinion Mining 4.Experiments 5.Conclusion

2015/10/ Conclusion This paper explored sub-document classification for contextual advertisement applications where the desired content appears only in a small part of a multi-topic web document The sub-document classifiers are trained when only page level labels are available It showed that the MILBoost system improve the performance of the traditional classifiers in such tasks, especially when the percentage of mixed content is high These systems provide good quality block level labels for free, leading to significant savings in time and cost on human labeling at the block level

2015/10/2244 END

2015/10/2245 AdaBoost

2015/10/2246 Multi-labelled Document