Presentation is loading. Please wait.

Presentation is loading. Please wait.

Word/Doc2Vec for Sentiment Analysis

Similar presentations


Presentation on theme: "Word/Doc2Vec for Sentiment Analysis"— Presentation transcript:

1 Word/Doc2Vec for Sentiment Analysis
Michael Czerny DC Natural Language Processing 4/8/2015

2 Who am I? MSc Cosmology and Particle Physics
Data Scientist at L-3 Data Tactics Interested in applying forefront research in NLP and ML to industry problems @m0_z

3 Outline: What is sentiment analysis?
Previous (i.e. pre-W/D2V) approaches to SA Word2Vec explained How it can be used for SA Example/App(?) Doc2Vec explained “ “ Conclusions

4 What is sentiment analysis?

5 What is sentiment analysis?
In a nutshell: extracting attitudes toward something from human language SA aims to map qualitative data to a quantitative output(s) => Positive (?) => Negative (?) (Or something else entirely?)

6 What is sentiment analysis?
Easy (relatively) for humans1, hard for machines! How do we convert human language to a machine-readable form? 1mashable.com/2010/04/19/sentiment-analysis/

7 Previous approaches to SA

8 Previous approaches to SA
Keyword lookup: Assign sentiment score to words (“hate”: -1, “love”: +1) Aggregate scores of all words in text Overall + / - determines sentiment

9 Previous approaches to SA
Drawbacks: Need to label words Can’t implicitly capture negation (“Not good” = 0 ??) Ignores word context

10 Previous approaches to SA
Slap a classifier on it! One-hot encode text so that each word is a column (“bag of words”) “John likes to watch movies. Mary likes movies too.” => [ ] “John also likes to watch football games.” => [ ] Use these vectors as input features to some classifier with labeled data

11 Previous approaches to SA
Drawbacks: Feature space grows linearly with vocab size Ignores word context Input features contain no information on words themselves (“bad” is just as similar to “good” as “great” is)

12 Previous approaches to SA
Sentiment Treebank (actually came after W2V) Fine-grained sentiment labels for 215,154 phrases of 11,855 sentences Train a recurrent ANN “bottom-up” by using previous children vectors to predict parent-phrase sentiment Does very well (85% test-set accuracy) on Treebank sentences2 Good at finding negation 2 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, R. Socher et al

13 Previous approaches to SA
Drawbacks: Probably does not generalize well to all tasks (phrase score for “#YOLO swag” = ??) Good in theory, hard in practice (good luck implementing it!)

14 What can: Give continuous vector rep.’s of words? Capture context? Require minimal feature creation?

15 Answer:

16 Or… Word2Vec!3 Maps words to continuous vector representations (i.e. points in an N-dimensional space) Learns vectors from training data (generalizable!) Minimal feature creation! 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013.

17 Two methods: Skip-gram and CBOW
Word2Vec How does it work? Two methods: Skip-gram and CBOW 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013.

18 CBOW Randomly initialize input/output weight matrices of sizes VxN and NxV where V: vocab size, N: vector size (parameter) Predict target word (one-hot encoded) from input of context word vectors (average) using single-layer NN Update weight matrices using SGD, backprop. and cross entropy over corpus Hidden layer size corresponds to word vector dim. 4 word2vec Parameter Learning Explained, X. Rong,

19 Skip-gram Method very similar, except now we predict window of words given single word vector Boils down to maximizing dot-product similarity of context words and target word5 Skip-gram typically outperforms CBOW on semantic and syntactic accuracy (Mikolov et al.) 4 word2vec Parameter Learning Explained, X. Rong, 5 word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, Y. Goldberg & O. Levy, 2014

20 What does Word2Vec give us?
Vectors! More importantly, stuff like: vector(“king”) – vector(“man”) + vector(“woman”) ≈ vector(“queen”)

21 Simple vector operations give us Interesting relationships:
3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013.

22 Word2Vec for Sentiment Analysis

23 Word2Vec for SA Learned W2V features Sentiment classifier
Bonus: Word2Vec has implementations in python (gensim), Java, C++, and Spark MLlib

24 Example: Tweets Methodology:
Collect tweets using emoticons  and  as fuzzy labels for positive and negative sentiment (can quickly & easily collect many this way!) Preprocess tweets Split into train-test Train word2vec on train set Average word vectors for each tweet as input to classifier Validate model All using python!

25 Example: Tweets Tutorial at: Word2Vec trained on ~400,000 tweets gives us 73% classification accuracy Gensim word2vec implementation Sklearn Logit SGD classifier Improves to 77% using ANN classifier ROC curve

26 Example: Tweets Negative tweets: Positive tweets:

27 Example: Tweets Extend with neutral class (“#news” is our fuzzy label)
~83% test accuracy with ANN classifier Seems to do impossibly well for neutral…

28 Example: Tweets Neutral tweets:

29 Example: Tweets Why does averaging tweets work?

30 Example: Tweets Words in 2D space

31 Example: Tweets Words in 2D space

32 Example: Tweets Words in 2D space

33 Example: Convolutional Nets
Window of word vecs => convolve => classify 6 Convolutional Neural Networks for Sentence Classification, Y. Kim, 2014.

34 6 Convolutional Neural Networks for Sentence Classification, Y
6 Convolutional Neural Networks for Sentence Classification, Y. Kim, 2014.

35 But Google released 3 million word vecs trained on 100 billion words!
Drawbacks: Quality depends on input data, number of samples, and size of vectors (possibly long computation time!) But Google released 3 million word vecs trained on 100 billion words! Averaging vec’s does not work well (in my experience) on large text (> tweet level) W2V cannot provide fixed-length feature vectors for variable-length text (pretty much everything!) 3 Efficient Implementation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013.

36 Doc2Vec

37 Doc2Vec7 Generalizes W2V to whole documents (phrases, sentences, etc.)
Provides fixed-length vector Distributed Memory (DM) and Distributed Bag of Words (DBOW) 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014

38 Distributed Memory (DM)
Assign and randomly initialize paragraph vector for each doc Predict next work using context words + paragraph vec Slide context window across doc but keep paragraph vec fixed (hence distributed memory) Updating done via SGD and backprop. 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014

39 Distributed Bag of Words (DBOW)
ONLY use paragraph vec (no word vecs!) Take window of words in paragraph and randomly sample which one to predict using paragraph vec (ignores word ordering) Simpler, more memory efficient DM typically outperforms DBOW (but DM + DBOW is even better!) 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014

40 How does it perform? Outperforms sentiment Treebank RNN (and everything else) on its own dataset on both coarse and fine-grained sentiment classification Paragraph vec + 7 words to predict 8th word Concatenates 400 dim. DBOW and DM vecs as input Predicts test-set paragraph vec’s from frozen train-set word vec’s 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014

41 Outperforms everything on Stanford IMDB movie review data set
How does it perform? Outperforms everything on Stanford IMDB movie review data set Paragraph vec + 9 words to predict 10th word Concatenates 400 dim. DBOW and DM vecs as input Predicts test-set paragraph vec’s from frozen train-set word vec’s 7Distributed Representations of Sentences and Documents, Q. V. Lee & T. Mikolov, 2014

42 Doc2Vec on Wikipedia8 8Document Embedding with Paragraph Vectors, A. Dai et al., 2014

43 Doc2Vec on Wikipedia LDA vs. Doc2Vec for nearest neighbors to “Machine learning” (bold = unrelated to ML) 8Document Embedding with Paragraph Vectors, A. Dai et al., 2014

44 Doc2Vec on Wikipedia 8Document Embedding with Paragraph Vectors, A. Dai et al., 2014

45 Using Doc2Vec Gensim has an implementation already!
Let’s try it on the Stanford IMDB set…

46  Using Doc2Vec Only see ~13% test error (compared to reported 7.42%)
See my blog post for full details: Others have had similar issues (can get to 10% error) Code used in paper coming in the near future! (?) Gensim cannot infer new doc vecs, but that is also coming!

47 Conclusion

48 Will Word/Doc2Vec solve all my problems?!
No, but maybe!

49 “No Free Lunch Theorem9”
Applying machine learning is an art! Test many tools and pick the right one. 9The Lack of A Priori Distinctions Between Learning Algorithms, D.H. Wolpert, 1996

50 W/D2V find contextual-based continuous vector representations of text
Many applications! Information retrieval Document classification Recommendation algorithms

51 Thank you!


Download ppt "Word/Doc2Vec for Sentiment Analysis"

Similar presentations


Ads by Google