Seshadri Tirunillai University of Houston

Seshadri Tirunillai University of Houston
Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation Seshadri Tirunillai University of Houston Gerard J. Tellis University of Southern California Presented by: Porter Jenkins Smeal College of Business

Agenda Introduction Motivation Goals Contributions Methodology
Background: LDA Text Preprocessing Dimension Extract using LDA Dimension Labelling Validation Empirical Results

Introduction: Motivation
Research shows that product quality an important factor: Brand Loyalty Market Share Value etc… Surveys are typical measure of quality Limited samples Infrequent

Alternative approach? User-generated content (UGC) Cheap (often free) Frequent, even contemporaneous Passionate, honest responses

How can we specify a model-based approach to extract meaning from thousands (millions) of text documents online? Even if we have a model for text, how can evaluate quality of products? From a practical perspective, there are many ways to define quality (multidimensional) If we have dimensions, we would want to understand: Valence (sentiment) Labels of Dimensions Validity Importance Dynamics Heterogeneity

Introduction: Goals Extract latent dimensions of quality from online text data Evaluate the latent dimensions on the criteria previously mentioned Utilize latent dimensions for strategic analysis Brand positioning, segmentation etc….

Introduction: Contributions
Unified framework to: Simultaneously extract the latent dimensions of quality and the valence (sentiment) of those dimensions Analyze the dynamics of quality at frequent intervals Show importance of latent dimensions using time- varying intensity in conversations Identify optimal number of latent dimensions Estimate heterogeneity among customers along latent dimensions Make generalizable inferences over multiple markets

Methodology

Methodology: Background of LDA
How can we extract meaning from text? —> Non-trivial Challenges: A single word can have different meaning in different contexts Words can have different meanings over different markets i.e., ‘Small’ television vs. ‘Small’ computer memory We can’t simply use universal mapping of positive/negative sentiment People use different vocabulary Results in a very sparse corpus matrix Average corpus matrix: 201 x 2571 (documents x words) Most elements will be 0 Can’t use dimension reduction techniques

Methodology: Background of LDA
Enter Latent Dirichlet Allocation (Blei, Ng, Jordan 2003) Model/Algorithm from the machine learning literature Unsupervised learning algorithm: Don’t have to know dimensions in advance Doesn’t require elaborate mapping dictionaries Extracts meaning of words within context of use Extends well to big data, sparse matrices Measures valence and dimensions jointly

Methodology: Example of LDA1
1: Adapted from Carl Rasmussen

Methodology: LDA as a graphical model1
1: Adapted from Carl Rasmussen

Methodology: LDA as a Graphical Model
Key contribution: Addition of valence to model (ν) Per word topic assignment (z) is dependent on the document topics AND the valence of the word Valence of word (ν) is dependent on valence of topic (π) in document φ is the multinomial distribution of the topics (dimensions) with the associated valence over the vocabulary of the words in the reviews

Methodology: Steps in Paper
Text Preprocessing Dimension extraction using LDA Dimension Labeling Cross-validation from other data sources

Methodology: Text Preprocessing
Treat each review as separate document Eliminate non-english characters; i.e., URL’s, punctuation etc… Replace pronouns with corresponding noun; a.k.a Anaphoric resolution Separate strings into individual sentences. Tokenize sentences Replace negative words with ‘not’ Stem words; i.e., ‘swimming’ —> ‘swim’ Porter’s algorithm Remove stop words: ‘the’, ‘and’, ‘is’, etc…. Remove all words that appear in less than 2% of documents

Methodology: Dimension Extraction (LDA)
Joint density of observed words, dimensions (topics), and valence given hyper parameters Word-level parameters Document-level parameters Latent dimension, valence parameters

Likelihood of getting w, the words in a given document (3) We can sum across all valence levels and topics to get (4)

Posterior Distribution Analytically intractable We can use standard MCMC techniques (Gibbs Sampler) sampling to approximate this

A few items to train in the algorithm Seed words to identify valence We specify a small dictionary of seed words that are unambiguous i.e., ‘good’, ‘great’, ‘bad’, ‘horrible’, etc… Iteratively estimate probability of valence of newly encountered words by their co-occurrence with seed words Select optimal number of dimensions (topics) For each market Begin with two dimensions, increase the number of dimensions until log likelihood reaches maximum

Methodology: Dimension Labeling
Goal: Choose a topic label so that dimensions reflect topic of discussion We want an unsupervised approach Assign score for a word given a dimension (topic) Mutual Information (MI) The amount of information gained by the dimension as a result of the presence of the word Genetics? Biology? Neurology? Technology?

Methodology: Dimension Labeling
We can use entropy as a metric to identify MI Entropy measures the probability that dimension k generates a randomly chosen document (6) η represents event that the document discussed the kth dimension Entropy conditional on words observed in document (7) Finally, calculate how much a given word, w*, reduced uncertainty in the entropy of a dimension, k. Equation (8) is positive if w* reduces uncertainty Select word(s) with highest MI as label(s)

Methodology: Validation
How can we know if our automated algorithm is any good? Compare to ratings of quality from humans Have humans read each review Construct set of dimensions and associated valence by presence of words Compute kappa statistic to measure agreement Found ‘strong’ to ‘quite strong’ agreement Compare to Consumer reports Jaccard Coefficient to measure degree of overlapping dimensions Fairly high for all markets Conclusion: MI works well Alternatively, you could have humans read a few random articles in each dimension for deeper meaning

Results

Results: Dimension of Quality (Mobile Phones)
What does the LDA dimensions extraction give us?

Results: Dimension of Quality (All Markets)
Some dimensions of quality are common across industries i.e., performance, customer service, visual appeal Others are unique Toys: Safety Footwear: Comfort Data Storage: Portability

Results Heterogeneity of Dimensions
Don’t use estimated parameters to assess heterogeneity Herfindahl Index: Concentration of reviews that mention a given dimension within a brand Relative to all other dimensions extracted for that brand H can be interpreted as the average concentration of dimensions (within a brand)

Results Heterogeneity of Dimensions
Herfindahl Index is an inverse measure of heterogeneity: High H for a brand, means low heterogeneity of dimensions, Low H means high heterogeneity Key insight: In vertical markets, consumers agree more on definition of quality

Results Heterogeneity of Dimensions Over Time
Herfindahl Index does not take into account the time- varying nature of heterogeneity in dimensions Calculate the percentage of instability in the Herfindhal index over time Let ρ be the correlation between percentage share of consumers citing the dimensions within a given brand and all other dimensions between two periods Let σ be the standard deviation of shares of dimensions, and n is the total number of dimensions at time t

Results Heterogeneity of Dimensions Over Time
Key point: Vertically integrated markets are more stable Indicators of quality do not change Horizontal markets are less stable

Results Brand Mapping Calculate distance between two brands on a given dimension Dissimilarity between vector of words underlying that dimension Non-euclidean Use Hellinger Distance Measure distance for all combinations of brands within a given market to construct similarity matrix Use multidimensional scaling to map to 2-D space Use top two most important dimensions (frequency of occurrence)

Results Static Brand Mapping

Results Within-brand Segmentation
Segment size represented by size of circles Brand perception represented by location on map

Results Dynamic Brand Mapping

Results: Dynamics of Dimensions
iPhone Results: Dynamics of Dimensions Blackberry Storm How is the perception of ‘Ease of Use’ changing over time? Can be used in conjunction with exogenous shocks: Blackberry storm iPhone Volume spikes at these points as well (panel B)

Conclusions User-generated Content is a rich form of data. It can be used to extract important latent dimensions of product quality Dimensions differ across brands in a given market. They also differ across markets The valence associated with the dimensions varies across markets The dimensions are corroborated by human raters and Consumer Reports These dimensions can be used to construct brand perceptual maps. Maps can be static, segmented, or dynamic In vertically differentiated markets (e.g., mobile phones, computers), dimensions are more objective and heterogeneity is low, and stability is high over time. The opposite is generally true of horizontally integrated markets

Seshadri Tirunillai University of Houston

Similar presentations

Presentation on theme: "Seshadri Tirunillai University of Houston"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Seshadri Tirunillai University of Houston

Similar presentations

Presentation on theme: "Seshadri Tirunillai University of Houston"— Presentation transcript:

Similar presentations

About project

Feedback