Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sentiment analysis using deep learning methods

Similar presentations


Presentation on theme: "Sentiment analysis using deep learning methods"โ€” Presentation transcript:

1 Sentiment analysis using deep learning methods
Antti Keurulainen

2 Sentiment analysis using deep learning methods
Two main approaches: Convolutional neural networks (CNN) Recurrent neural networks (RNN), can be enhanced by using LSTM Antti Keurulainen

3 Deep Learning One or more hidden layers and the ability to have trainable parameters in these layers An artificial network, that is organized in hierarchical layers, has the capability to build hierarchical representations of the input data Antti Keurulainen

4 Convolutional neural network (CNN)
Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen

5 Convolutional neural network (CNN)
Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 ๐‘ 1 =f (0.2โˆ—1+0.7โˆ—3 โˆ’0.5 โˆ—6+0.7โˆ—0) = f(โˆ’0.7) Note: Bias terms omitted! Feature map f(-0.7) f represents some non-linear activation function Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen

6 Convolutional neural network (CNN)
Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 ๐‘ 1 =f (0.2โˆ—1+0.7โˆ—3 โˆ’0.5 โˆ—6+0.7โˆ—0) = f(โˆ’0.7) ๐‘ 2 =f(0.2โˆ—3+0.7โˆ—5 โˆ’0.5 โˆ—0+0.7โˆ—2) = f(5.5) Note: Bias terms omitted! Feature map f(-0.7) f(5.5) f represents some non-linear activation function Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen

7 Convolutional neural network (CNN)
Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 ๐‘ 1 =f (0.2โˆ—1+0.7โˆ—3 โˆ’0.5 โˆ—6+0.7โˆ—0) = f(โˆ’0.7) ๐‘ 2 =f(0.2โˆ—3+0.7โˆ—5 โˆ’0.5 โˆ—0+0.7โˆ—2) = f(5.5) ๐‘ 3 =โ€ฆ Note: Bias terms omitted! Feature map f(-0.7) f(5.5) โ€ฆ f represents some non-linear activation function Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen

8 Convolutional neural network (CNN)
๐‘ฆ = ๐‘˜=1 3 ๐‘–=1 5 ๐‘—=1 5 ๐‘ฅ ๐‘˜๐‘–๐‘— ๐œƒ ๐‘˜๐‘–๐‘— After convolution, some other operations are performed such as applying the activation function (nonlinearity) and pooling During training, the values that are used in the filters are updated and gradually learned The parameter sharing concept brings invariance Antti Keurulainen

9 Recurrent Neural Network (RNN)
Shallow RNN: ๐‘ฅ ๐‘กโˆ’1 ๐‘ฅ ๐‘ก ๐‘ฅ ๐‘ก+1 ๐‘ฅ ๐‘ก+2 ๐‘ฅ ๐‘ก+3 โ„Ž ๐‘กโˆ’1 โ„Ž ๐‘ก โ„Ž ๐‘ก+1 โ„Ž ๐‘ก+2 โ„Ž ๐‘ก+3 ๐‘œ ๐‘กโˆ’1 ๐‘œ ๐‘ก ๐‘œ ๐‘ก+1 ๐‘œ ๐‘ก+2 ๐‘œ ๐‘ก+3 . . . U W V ๐ฟ ๐‘กโˆ’1 ๐ฟ ๐‘ก ๐ฟ ๐‘ก+1 ๐ฟ ๐‘ก+2 ๐ฟ ๐‘ก+3 ๐‘ฆ ๐‘กโˆ’1 ๐‘ฆ ๐‘ก ๐‘ฆ ๐‘ก+1 ๐‘ฆ ๐‘ก+2 ๐‘ฆ ๐‘ก+3 Source: Goodfellow, I., Bengio, Y., Courville, A., Deep Learning, Antti Keurulainen

10 Recurrent Neural Network (RNN)
๐‘ฅ ๐‘กโˆ’1 ๐‘ฅ ๐‘ก ๐‘ฅ ๐‘ก+1 ๐‘ฅ ๐‘ก+2 ๐‘ฅ ๐‘ก+3 โ„Ž 1 ๐‘กโˆ’1 โ„Ž 1 ๐‘ก โ„Ž 1 ๐‘ก+1 โ„Ž 1 ๐‘ก+2 โ„Ž 1 ๐‘ก+3 ๐‘œ ๐‘กโˆ’1 ๐‘œ ๐‘ก ๐‘œ ๐‘ก+1 ๐‘œ ๐‘ก+2 ๐‘œ ๐‘ก+3 . . . U W1 V1 ๐ฟ ๐‘กโˆ’1 ๐ฟ ๐‘ก ๐ฟ ๐‘ก+1 ๐ฟ ๐‘ก+2 ๐ฟ ๐‘ก+3 ๐‘ฆ ๐‘กโˆ’1 ๐‘ฆ ๐‘ก ๐‘ฆ ๐‘ก+1 ๐‘ฆ ๐‘ก+2 ๐‘ฆ ๐‘ก+3 โ„Ž 2 ๐‘กโˆ’1 โ„Ž 2 ๐‘ก โ„Ž 2 ๐‘ก+1 โ„Ž 2 ๐‘ก+2 โ„Ž 2 ๐‘ก+3 W2 V2 Deep RNN example: Antti Keurulainen

11 Vanishing gradient problem and LSTM
Problem: gradients propagate over many stages, and involves several multiplications of the weight matrix. -> vanishing or exploding gradients ๐‘ฅ ๐‘กโˆ’1 ๐‘ฅ ๐‘ก ๐‘ฅ ๐‘ก+1 ๐‘ฅ ๐‘ก+2 ๐‘ฅ ๐‘ก+3 โ„Ž ๐‘กโˆ’1 โ„Ž ๐‘ก โ„Ž ๐‘ก+1 โ„Ž ๐‘ก+2 โ„Ž ๐‘ก+3 ๐‘œ ๐‘กโˆ’1 ๐‘œ ๐‘ก ๐‘œ ๐‘ก+1 ๐‘œ ๐‘ก+2 ๐‘œ ๐‘ก+3 . . . U W V ๐ฟ ๐‘กโˆ’1 ๐ฟ ๐‘ก ๐ฟ ๐‘ก+1 ๐ฟ ๐‘ก+2 ๐ฟ ๐‘ก+3 ๐‘ฆ ๐‘กโˆ’1 ๐‘ฆ ๐‘ก ๐‘ฆ ๐‘ก+1 ๐‘ฆ ๐‘ก+2 ๐‘ฆ ๐‘ก+3 Antti Keurulainen

12 โ„Ž ๐‘ก = ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ˆ ๐‘ ๐‘ฅ ๐‘ก + ๐‘Š ๐‘ โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘
Standard RNN cell โ„Ž ๐‘ก = ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ˆ ๐‘ ๐‘ฅ ๐‘ก + ๐‘Š ๐‘ โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘ ๐’‰ ๐’• Vanilla RNN ๐’‰ ๐’•โˆ’๐Ÿ ๐’‰ ๐’• ๐’•๐’‚๐’๐’‰ ๐‘ˆ ๐‘Š ๐’™ ๐’• Visualization idea by Christopher Olah Antti Keurulainen

13 ๐‘  ๐‘ก = ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ˆ ๐‘ ๐‘ฅ ๐‘ก + ๐‘Š ๐‘ โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘
๐‘  ๐‘ก = ๐‘“ ๐‘ก โˆ˜ ๐‘  ๐‘กโˆ’1 + ๐‘– ๐‘ก โˆ˜ ๐‘  ๐‘ก ๐‘  ๐‘ก = ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ˆ ๐‘ ๐‘ฅ ๐‘ก + ๐‘Š ๐‘ โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘ ๐‘“ ๐‘ก = ๐œŽ ๐‘ˆ ๐‘“ ๐‘ฅ ๐‘ก + ๐‘Š ๐‘“ โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘“ โ„Ž ๐‘ก = ๐‘œ ๐‘ก โˆ˜๐‘ก๐‘Ž๐‘›โ„Ž ๐‘  ๐‘ก ๐‘– ๐‘ก = ๐œŽ ๐‘ˆ ๐‘– ๐‘ฅ ๐‘ก + ๐‘Š ๐‘– โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘– ๐‘œ ๐‘ก = ๐œŽ ๐‘ˆ ๐‘œ ๐‘ฅ ๐‘ก + ๐‘Š ๐‘œ โ„Ž ๐‘กโˆ’1 + ๐‘ ๐‘œ โ„Ž ๐‘ก LSTM ๐’” ๐’•โˆ’๐Ÿ ๐’” ๐’• X + ๐‘– ๐‘ก โˆ˜ ๐‘  ๐‘ก ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘“ ๐‘ก X ๐‘œ ๐‘ก ๐‘– ๐‘ก ๐ˆ ๐‘  ๐‘ก X ๐ˆ ๐’•๐’‚๐’๐’‰ ๐ˆ ๐‘ˆ ๐‘“ ๐‘Š ๐‘“ ๐‘ˆ ๐‘œ ๐‘Š ๐‘œ โ„Ž ๐‘ก โ„Ž ๐‘กโˆ’1 ๐‘ˆ ๐‘– ๐‘Š ๐‘– ๐‘ˆ ๐‘ ๐‘Š ๐‘ ๐‘œ ๐‘ก โˆ˜๐‘ก๐‘Ž๐‘›โ„Ž ๐‘  ๐‘ก ๐‘ฅ ๐‘ก Visualization idea by Christopher Olah Antti Keurulainen

14 Sentiment analysis Sentiment analysis is a collection of methods with the main intent to observe the opinion or attitude, for example, of a sentence expressed in natural language. Antti Keurulainen

15 Sentiment analysis using CNNs
Analysis based on Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746โ€“ a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. Good results are obtained by using pre-trained word vector. Results are still improved by further training the word vectors for specific tasks. Antti Keurulainen

16 Sentiment analysis using CNNs
Simple CNN model for sentiment analysis Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Antti Keurulainen

17 Sentiment analysis using CNNs
Multiple filters sizes (3,4,5) to produce several feature maps (100 of each size) -> magnitude of 0,3 โ€“ 0,4 M parameters Max-over-time pooling used to select the most important feature Two input channels used, other with static word vectors and other with trainable vectors Fully connected softmax layer on top to produce probabilities for each class Dropout used in the fully connected layer for regularization, L2 norm gradient clipping for other weights. Early stopping used. Stochastic gradient descent update using Adadelta update rule Pre-trained word2vec used, trained with 100B words from Google news, 300 dimensions Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Antti Keurulainen

18 Sentiment analysis using CNNs
Models in [Kim2014] CNN-rand; all words are initialized randomly and trained CNN-static; initialized with word2vec used (unknown initialized randomly) and kept static CNN-non-static; intialized with word2vec and trained further CNN-multichannel; Initialized with word2vec, one channel stays static and other channel is further trained Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Antti Keurulainen

19 Sentiment analysis using CNNs
datasets [Kim 2014] โ€œMovie review dataโ€. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, EMNLP Binary classification. โ€œStanford sentiment treebank 1โ€. Extension of the above. Fine-graned labels added, Socher et al 2013. โ€œStanford sentiment treebank 2โ€. Same as above but with neutral removed and binary labels. โ€œSubjectivity datasetโ€ subjective and 5000 objective processed sentences. Pang/Lee ACL 2004 TREC question dataset, classifying a question type into 6 classes Customer review dataset. Reviews of various products like cameras, mp3 players etc. Hu & Liu 2004 MPQA dataset. Opinion polarity subtask from MPQA dataset. Wiebe et al 2005 Antti Keurulainen

20 Sentiment analysis using CNNs
results [Kim 2014] Antti Keurulainen

21 Sentiment analysis using RNNs
Antti Keurulainen

22 Sentiment analysis using RNNs
Analysis based on [Wan2015]: Wang, X., Liu, Y., Sun, C., Wang, B., & Wang, X. (2015). Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1343โ€“1353, Beijing, China. Association for Computational Linguistics. Twitter sentiment prediction, using simple RNN or LSTM recurrent network. Antti Keurulainen

23 Sentiment analysis using RNNs
The word vectors created by co-occurrence statistics are not always suitable for sentiment analysis (e.g. words โ€œgoodโ€ and โ€œbadโ€ are close in word2vec representations) Sentiments are expressed by phrases instead of individual words -> how to capture the representation of the whole sentence? Additional challenge: Recurrent Neural Network (RNN) has difficulties to maintain longer time dependencies -> LSTM networks It has been shown, that further task-specific training of the pre-trained word vectors help capturing the polarity information of the sentences Antti Keurulainen

24 Sentiment analysis using RNNs
Basic RNN architecture [Wan2015] RNN architecture that is used in [Wan2015] In [Wan2015], the sentence is expressed by the hidden state of the last time step. Antti Keurulainen

25 Sentiment analysis using RNNs
RNN-FLT (Recurrent Neural Network with Fixed Lookup-Table): a simple implementation of the recurrent sentiment classifier Forward pass: Backpropagation: f represents sigmoid function, w are the weights, e are the word embeddings, v includes hidden-output weights, t is the time step, T is the last time step. The loss O is calculated by using cross-entropy loss, and training is conducted by stochastic gradient descent (SGD) Antti Keurulainen

26 Sentiment analysis using RNNs
RNN-TLT (Recurrent Neural Network with Trainable Lookup-Table) and LSTM-TLT: implementations that include further training of the pre-trained word vectors. In LSTM-TLT the classifier uses LSTM blocks instead of regular RNN blocks. Each regular RNN block is replaced by an LSTM block -> much more complicated functionality -> helps to combat the vanishing gradient problem Antti Keurulainen

27 Sentiment analysis using RNNs
Experiments are run by using Stanford Twitter Sentiment corpus (STS); positive and negative tweets. Manually labeled test set includes 177 negative and 182 positive tweets. Training set sentiment labels are retrieved from emoticons. 25 dimensional word vectors that were trained with 1.56M tweets from training set using word2vec. Hidden layer size 60. Non-neural classifiers (Naive Bayes, Maximum Entropy, Support Vector Machine) Neural Bag-of-Words, summation of word vectors as input Dynamic Convolutional Neural Network Recursive Autoencoder Models presented in this paper Antti Keurulainen

28 Sentiment analysis using RNNs
Additional experiment are run using human-labeled dataset SemEval The dataset has training set of 4099, development set of 735 and test set of 1742 tweets. Fixed word vectors, pre-trained with word2vec using STS dataset, this time 300-dimensions Antti Keurulainen

29 Sentiment analysis using RNNs
Which words change most when training the pre-trained word2vec vectors? Antti Keurulainen

30 Sentiment analysis using RNNs
How the sentiment words are moved during training in 2-d space? 20 most negative and 20 most positive words were tracked during training Before tuning After tuning Antti Keurulainen

31 Sentiment analysis experiments using python libraries and tensorflow
Antti Keurulainen

32 Sentiment analysis experiments using python libraries and tensorflow
Dataset: IMDb movie review dataset, labeled reviews in the training set, unlabeled in the test set. Models: Bag of words with random forest (pandas, numpy, scikit-learn) Word2vec with random forest (pandas, numpy, scikit-learn, gensim) Word2vec with feed forward (pandas, numpy, gensim, tensorflow) Antti Keurulainen

33 Sentiment analysis experiments using python libraries and tensorflow
Baseline with bag of words and random forest Step 1: Download from Kaggle.com and clean the IMDb movie review dataset import data to pandas frame use BeatifulSoup to remove html tagging use regular expression to remove non-letters convert to lowercase remove stopwords Step 2: create bag of words representations of the individual reviews (sklearn CountVectorizer) Step 3: fit random forest model on training set and run predictions, submit to Kaggle.com for test set accuracy Accuracy 85,6 % Antti Keurulainen

34 Sentiment analysis experiments using python libraries and tensorflow
Word2vec with random forest Step 1: Download from Kaggle.com and clean the IMDb movie review dataset import data to pandas frame use BeatifulSoup to remove html tagging use regular expression to remove non-letters convert to lowercase Step 2: create word2vec representations of the individual words (gensim word2vec) Step 3: average all word vectors in a review to form one single vector for a review Step 4: fit random forest model on training set and run predictions, submit to Kaggle.com for test set accuracy Accuracy 83,3 % Antti Keurulainen

35 Sentiment analysis experiments using python libraries and tensorflow
Word2vec with deep learning Step 1: Download from Kaggle.com and clean the IMDb movie review dataset import data to pandas frame use BeatifulSoup to remove html tagging use regular expression to remove non-letters convert to lowercase Step 2: create word2vec representations of the individual words (gensim word2vec) Step 3: average all word vectors in a review to form one single vector for the review Step 4: fit feed forward deep learning model on training set and run predictions, submit to Kaggle.com for test set accuracy Accuracy 87,0 % Antti Keurulainen

36 A lot of hyperparameters and other decisions
Remove stopwords? (yes for word2vec, no to sentiment analysis) Remove punctuation? (yes) Dimension of word vectors (300) Word2vec window size (10) Downsampling for frequent words (1e-3) Minimum word count for word2vec (40) Deep Learning (DL) number of layers (3) DL width of the hidden layers ( ) DL activation functions (Relu) DL use dropout (tried, did not help. -> no) DL Initialization (random uniform between 0 and 1) DL Optimizer (Adam) DL Adam optimizer parameters, learning rate +3 others DL number of training steps DL regularization method and its parameters (none) DL loss function (cross entropy) Antti Keurulainen

37 Homework Consider the CNN-based single-channel version of the model architecture presented in [Kim2014]. Consider a scenario where the pre-trained static word vectors have 300 dimensions, and three different filter sizes are used that span over 3, 4 and 5 words. Each filter size size produces 100 feature maps. The feature maps are calculated by using the the formula (2): meaning that the weight matrix is multiplied with the input vectors, bias is added, and the result is applied through non-linearity such as tanh. Then, the feature maps are max-pooled, and these results are connected to final two dimensional output layer in the fully connected manner. Calculate the number of trainable parameters (weights and biases) there are in this model. Antti Keurulainen


Download ppt "Sentiment analysis using deep learning methods"

Similar presentations


Ads by Google