© 2006 Nielsen BuzzMetrics, A VNU business affiliate Deriving Marketing Intelligence from Online Discussion Natalie Glance and Matthew Hurst CMU Information.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

A Graph-based Recommender System Zan Huang, Wingyan Chung, Thian-Huat Ong, Hsinchun Chen Artificial Intelligence Lab The University of Arizona 07/15/2002.
Characteristic Identifier Scoring and Clustering for Classification By Mahesh Kumar Chhaparia.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
CIS630 Spring 2013 Lecture 2 Affect analysis in text and speech.
A Quality Focused Crawler for Health Information Tim Tang.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Yusuf Simonson Title Suggesting Friends Using the Implicit Social Graph.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Technology and Community Group 3 Additional Reading Jody Chatalas.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
© 2002 McGraw-Hill Companies, Inc., McGraw-Hill/Irwin TURNING MARKETING INFORMATION INTO ACTION.
1 Web Marketing Research Hsinchun Chen May Overview Sentiment index: Michigan Consumer Sentiment Survey, BrandIndex.com Marketing tools: MarketTools,
The political blogosphere and the 2004 election:
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
Presented by Christian Becker TripAdvisor: How reviews influence consumer purchases 5/14.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Overview of Web Data Mining and Applications Part I
Online PR Srba Jovanović International Public Relations Association – IPRA Board member.
9. Learning Objectives  How do companies utilize social media research? What are the primary approaches to social media research?  What is the research.
Brand Engagement Study - Retail. Brand Engagement Studies To demonstrate the ability of internet advertising to drive engagement To measure the effects.
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
CHUCK YOUNG MANAGING DIRECTOR OFFICE OF PUBLIC AFFAIRS GOVERNMENT ACCOUNTABILITY OFFICE to AGA BOSTON CHAPTER PROFESSIONAL DEVELOPMENT CONFERENCE MARCH.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan Leonard N. Stern School.
E-Commerce and the Entrepreneur
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Towards a Robust Metric of Opinion Kamal Nigam, Matthew Hurst Intelliseek Applied Research Center AAAI Spring Symposium on Exploring Attitude and Affect.
Generating and Tracking Communities Based on Implicit Affinities Matthew Smith – BYU Data Mining Lab April 2007.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Designing Ranking Systems for Consumer Reviews: The Economic Impact of Customer Sentiment in Electronic Markets Anindya Ghose Panagiotis Ipeirotis Stern.
To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.
Prediction of Influencers from Word Use Chan Shing Hei.
Teens, Social Networks & Safety An Overview Amanda Lenhart Family Online Safety Institute Launch February 13, 2007 Washington, DC.
Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.
Politics and Social media: The Political Blogosphere and the 2004 U.S. election: Divided They Blog Crystal: Analyzing Predictive Opinions on the Web Swapna.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Natalie Glance Senior Research Scientist Nielsen BuzzMetrics.
Post-Ranking query suggestion by diversifying search Chao Wang.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
The Spread of Media Content through the Blogosphere
Memory Standardization
Aspect-based sentiment analysis
Web Mining Department of Computer Science and Engg.
Modeling Trust and Influence in the Blogosphere using Link Polarity
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Technology and Community
Presentation transcript:

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Deriving Marketing Intelligence from Online Discussion Natalie Glance and Matthew Hurst CMU Information Retrieval Seminar, April 19, 2006

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Overview  Motivation  Content Segment: The Blogosphere  Structural Aspects  Topical Aspects  Deriving market intelligence  Conclusion

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Motivation Social Media Social Media Mobile phone data The celly 31 is awesome, but the screen is a bit too dim. ProductScore Celly Phony ZA8.0 FeatureScore Screen2.0 Signal9.0

© 2006 Nielsen BuzzMetrics, A VNU business affiliate The Blogosphere

© 2006 Nielsen BuzzMetrics, A VNU business affiliate

Profile Analysis Hurst, “24 Hours in the Blogosphere”, 2006 AAAI Spring Symposium on Computational Approaches to Analysing Weblogs.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Hypotheses Different hosts attract users with different capacity to disclose profile information (?) Blogspot users are more disposed to disclose information (?) Different interface implementations perform differently at extracting/encouraging information from users (?)

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Per Capita: Spaces variance in average age variance in profiles with age variance in per capita bloggers

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Per Capita: Blogspot

© 2006 Nielsen BuzzMetrics, A VNU business affiliate The graphical structure of the blogosphere

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Graphical Structure of the Blogosphere  Citations between blogs indicate some form of relationship, generally topical.  A link is certainly evidence of awareness, consequently reciprocal links are evidence of mutual awareness.  Mutual awareness suggests some commonality, perhaps common interests.  The graph of reciprocal links can be considered a social network.  Areciprocal links suggest topical relationships, but not social ones.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate

Graph Layout  Hierarchical Force Layout  Graph has 2 types of links: reciprocal links and areciprocal links  Create set of partitions P where each partition is a connected component in the reciprocal graph.  Create a graph whose nodes are the members of P and whose edges are formed from areciprocal links between (nodes within) member of P.  Layout the partition graph.  Layout each partition.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Japanese r = 2 p = 25 cooking knitting

© 2006 Nielsen BuzzMetrics, A VNU business affiliate boingboing michellemalkin engadget instapundit powerline scoble crooksandliars kbcafe/rss gizmodo r = 2 p = 1

© 2006 Nielsen BuzzMetrics, A VNU business affiliate r = 3 p = 100 technology social/politics The English blogosphere is political.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate L. Adamic and N. Glance, “The Political Blogosphere and the 2004 U.S. Election: Divided They Blog”, 2 nd Annual Workshop on the Weblogging Ecosystem, Chiba, Japan, Political Blogosphere

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Political Blogs & Readership  Pew Internet & American Life Project Report, January 2005, reports:  63 million U.S. citizens use the Internet to stay informed about politics (mid-2004, Pew Internet Study)  9% of Internet users read political blogs preceding the 2004 U.S. Presidential Election  2004 Presidential Campaign Firsts  Candidate blogs: e.g. Dean’s blogforamerica.com  Successful grassroots campaign conducted via websites & blogs  Bloggers credentialed as journalists & invited to nominating conventions

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Research Goals & Questions  Are we witnessing a cyberbalkination of the Internet?  Linking behavior of blogs may make it easier to read only like-minded bloggers  On the other hand, bloggers systematically react to and comment on each others’ posts, both in agreement and disagreement (Balkin 2004)  Goal: study the linking behavior & discussion topics of political bloggers  Measure the degree of interaction between liberal and conservative bloggers  Find any differences in the structure of the two communities: is there a significant difference in “cohesiveness” in one community over another?

© 2006 Nielsen BuzzMetrics, A VNU business affiliate The Greater Political Blogosphere  Citation graph of greater political blogosphere  Front page of each blog crawled in February 2005  Directed link between blog A and blog B, if A links to B  Method biases blogroll/sidebar links (as opposed to links in posts)  Results  91% of links point to blog of same persuasion (liberal vs. conservative)  Conservative blogs show greater tendency to link  82% of conservative blogs are linked to at least once; 84% link to at least one other blog  67% of liberal blogs are linked to at least once; 74% link to at least one other blog  Average # of links per blog is similar: 13.6 for liberal; 15.1 for conservative  Higher proportion of liberal blogs that are not linked to at all

A)All citations between A- list blogs in 2 months preceding the 2004 election B)Citations between A-list blogs with at least 5 citations in both directions C)Edges further limited to those exceeding 25 combined citations Citations between blogs extracted from posts (Aug 29 th – Nov 15 th, 2004) Only 15% of the citations bridge communities

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Are political blogs echo chambers?  Performed pairwise comparison of URL citations and phrase usage from blog posts  Link-based similarity measure  Cosine similarity: cos(A,B) = v A.v B /(||v A ||*||v B ||), where v A is a binary vector. Each entry = 1 or 0, depending on whether blog A cites a particular URL  Average similarity(L,R) = 0.03; cos(R,R) = 0.083; cos(L,L) =  Phrase-based similarity measure  Extracted set of phrases, informative wrt background model  Entries in v A are TF*IDF weight for each phrase = (# of phrase mentions by blog)*log[(# blogs)/(# blogs citing the phrase)]  Average similarity(L,R) = 0.10; cos(R,R) = 0.54; cos(L,L) = 0.57

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Influence on mainstream media Notable examples of blogs breaking a story 1.Swiftvets.com anti-Kerry video  Bloggers linked to this in late July, keeping accusations alive  Kerry responded in late August, bringing mainstream media coverage 2.CBS memos alleging preferential treatment of Pres. Bush during the Vietnam War  Powerline broke the story on Sep. 9 th, launching flurry of discussion  Dan Rather apologized later in the month 3.“Was Bush Wired?”  Salon.com asked the question first on Oct. 8 th, echoed by Wonkette & PoliticalWire.com  MSM follows-up the next day

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Deriving Market Intelligence N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton and T. Tomokiyo. Deriving Marketing Intelligence from Online Discussion. Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005).KDD 2005

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Automating Market Research  Brand managers want to know:  Do consumers prefer my brand to another?  Which features of my product are most valued?  What should we change or improve?  Alert me when a rumor starts to spread!

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Comparative mentions: Halo 2 ‘halo 2’

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Case Study: PDAs  Collect online discussion in target domain (order of 10K to 10M posts)  Classify discussion into domain-specific topics (brand, feature, price)  Perform base analysis over combination of topics: buzz, sentiment/polarity, influencer identification

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Dell Axim, 11.5% buzz, 3.4 polarity

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Interactive analysis  Top-down approach: drill down from aggregate findings to drivers of those findings  Global view of data used to determine focus  Model parent and child slice  Use data driven methods to identify what distinguishes one data set from the other

© 2006 Nielsen BuzzMetrics, A VNU business affiliate SD card

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Social network analysis for discussion about the Dell Axim

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Drilling down to sentence level  Discussion centers on poor quality of sound hardware & IR ports  “It is very sad that the Axim’s audio AND Irda output are so sub-par, because otherwise it is a great Pocket PC.”  “Long story made short: the Axim has a considerably inferior audio output than any other Pocket PC we have ever tested.”  “When we tested it we found that there was a problem with the audio output of the Axim.”  “The Dell Axim has a lousy IR transmitter AND a lousy headphone jack.”  Note: these examples are automatically extracted.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Technology  Data Collection:  Document acquisition and analysis  Classification (relevance/topic)  Topical Analysis:  Topic classification using a hierarchy of topic classifiers operating at sentence level.  Phrase mining and association.  Intentional Analysis:  Interpreting sentiment/polarity  Community analysis  Aggregate metrics

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Topical Analysis  Hierarchy of topics with specific ‘dimensions’:  Brand dimension  Pocket PC:  Dell Axim  Toshiba  e740  Palm  Zire  Tungsten  Feature dimension:  Components  Battery

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Topical Analysis  Each topic is a classifier, e.g. a boolean expression with sentence and/or message scoped sub-expressions.  Measured precision of classifier allows for projection of raw counts.  Intersection of typed dimensions allows for a basic approach to association (e.g. find sentences discussing the battery of the Dell Axim).

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity: What is it?  Opinion, evaluation/emotional state wrt some topic.  It is excellent  I love it.  Desirable or undesirable condition  It is broken (objective, but negative).  We use a lexical/syntactic approach.  Cf. related work on boolean document classification task using supervised classifiers.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Identification This car is really great

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Identification This car is really great POS: DT NN VB RR JJ

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Identification This car is really great POS: Lexical orientation: DT NN VB RR JJ

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Identification This car is really great POS: Lexical orientation: Chunking: DT NN VB RR JJ BNP BVP BADJP

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Identification This car is really great POS: Lexical orientation: Chunking: Interpretation: DT NN VB RR JJ BNP BVP BADJP Positive (parsing):

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Challenges  Methodological: ‘She told me she didn’t like it.’  Syntactic: ‘His cell phone works in some buildings, but it others it doesn’t.’  Valence:  ‘I told you I didn’t like it’,  ‘I heard you didn’t like it’,  ‘I didn’t tell you I liked it’,  ‘I didn’t hear you liked it’: man verbs (tell, hear, say, …) require semantic/functional information for polarity interpretation.  Association

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Examples

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Polarity Metric  Function of counts of polar statements on a topic: f(size, f top, f top+pos, f top+neg )  Use empirical priors to smooth counts from observed counts (helps with low counts)  Use P/R of system to project true counts and provide error bars (requires labeled data)  Example: +/- ratio metric maps ratio to 0-10 score

© 2006 Nielsen BuzzMetrics, A VNU business affiliate

Predicting Movie Sales from Blogger Sentiment G. Mishne and N. Glance, “Predicting Movie Sales from Blogger Sentiment,” 2006 AAAI Spring Symposium on Computational Approaches to Analysing Weblogs.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Blogger Sentiment and Impact on Sales  What we know:  There is a correlation between references to a product in the blogspace and its financial figures  Tong 2001: Movie buzz in Usenet is correlated with sales  Gruhl et. al.: 2005: Spikes in Amazon book sales follow spikes in blog buzz  What we want to find out:  Does taking into account the polarity of the references yield a better correlation?  Product of choice: movies  Methodology: compare correlation of references to sales with the correlation of polar references to sales

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Experiment  49 movies  Budget > 1M$  Released between Feb. and Aug  Sales data from IMDB  “Income per Screen” = opening weekend sales / screens  Blog post collection  References to the movies in a 2-month window  Used IMDB link + simple heuristics  Measure:  Pearson’s-R between the Income per Screen and {references in blogs, positive/polar references in blogs}  Applied to various context lengths around the reference

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Results For 80% of the movies, r > 0.75 for pre-release positive sentiment 12% improvement compared with correlation of movie sales with simple buzz count (0.542 vs ) Income per screen vs. positive references

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Conclusion  The intersection of Social Media and Data/Text Mining algorithms presents a viable business opportunity set to replace traditional forms of market research/social trend analysis/etc.  Key elements include topic detection and sentiment mining.  The success of the blogosphere has driven interest in a distinct form of online content which has a long history but is becoming more and more visible.  The blogosphere itself is a fascinating demonstration of social content and interaction and will enjoy many applications of traditional and novel analysis.

© 2006 Nielsen BuzzMetrics, A VNU business affiliate  Internships: openings available for this summer   Data set: weblog data for July 2005   3 rd Annual Workshop on the Weblogging Ecosystem   1 st International Conference on Weblogs on Social Media, March 2007  (under construction)  Company info  Company website:  Blog search:

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Phrase Finding  Goal: find key phrases which discriminate between foreground corpus and background corpus  First step: KeyBigramFinder  Identifies phrases that score high in informativeness and phraseness  Informativeness: measure of ability to discriminate foreground from background  Phraseness: measure of collocation of consecutive words

© 2006 Nielsen BuzzMetrics, A VNU business affiliate Phrase Finding Pipeline  Seeded by KeyBigramFinder  Sample pipeline  APrioriPhraseExpander: expands top N bigrams into longer phrases, adapting the APRIORI algorithm to text and features of text  ConstituentFinder: uses contextual evidence to identify noun phrases  Final list sorted either by frequency or informativeness score