Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Sentiment Analysis from Blogs: Predicting the Change in Stock Magnitude Saleh Alshepani (BH115) Supervisor : Dr Najeeb Abbas Al-Sammarraie.

Similar presentations


Presentation on theme: "Automated Sentiment Analysis from Blogs: Predicting the Change in Stock Magnitude Saleh Alshepani (BH115) Supervisor : Dr Najeeb Abbas Al-Sammarraie."— Presentation transcript:

1 Automated Sentiment Analysis from Blogs: Predicting the Change in Stock Magnitude Saleh Alshepani (BH115) Supervisor : Dr Najeeb Abbas Al-Sammarraie

2 Outline Definitions Problem Statement Significance of Study Research Questions Research Objectives Literature Review Research Methodology

3 Definitions Sentiment An emotion, thought, view, judgment or idea prompted or colored by emotion. Sentiment Classification (Polarity) Used to identify whether the opinion expressed in a text is positive or negative.

4 Definitions Sentiment Analysis concerns about automatically identifying (extracting and classifying) sentiment or opinion expressed in a given piece of typically unstructured text. computational study of opinions, sentiments and emotions expressed in text.

5 Definitions Precision i s the probability that a document predicted to be in class “positive” is truly belongs to this class. Recall is the probability that a document belonging to class “positive” is classified into this class. Accuracy is used to measure the proportion of documents that are obtained correctly.

6 Problem Statement Given the largeness of the amount of blogging data that is already available online as well as the enormous amounts that are being added every day, it would be impossible to read all of the blogs in a thousand lifetimes. Today, there are billions of words already published online and tens of millions more being added by millions of bloggers around the world every day. These blogs pertain to an incredibly diverse range of personal interests, professional and unprofessional pursuits.

7 Problem Statement (cont) Bloggers may spill their hearts out for the whole world to see, and be devastated or rejuvenated by the responses they receive – or both. When tens of millions of people are engaging in this type of behavior on a regular basis, there is clearly some valuable information that can be gained if the right analytical techniques are used in a thoughtful fashion.

8 Problem Statement (cont) The act of establishing correlation between financial news articles and quantifiable movement of stock price is a very difficult task. This difficulty is also noticed even when information within the blogs context do exhibit visible impact on price.

9 Problem Statement (cont) So predicting the future outcomes (such as the movement of stock prices) is made much difficult because the movement of stocks doesn’t follow the same pattern like product or movies reviews. In product reviews for example, people have actually used the product or have friends that have used it. Then they voice out their opinions based on overall experience gained from using the product.

10 Problem Statement (cont) However, the case of stock movement is different because the movement of stocks is affected by a number of factors such as: political (government policies), economic (demand), social (perception) etc. Thus, this creates a huge gap in this research as studied in relation to the change, movement, and expected magnitude of such changes are lacking in both the computing world and other business settings.

11 Significance of Study Automatic extraction of sentiment analysis is very important: It reduces the cost accrued with customer survey. It provides real-time information that companies can use for real-time changes. It is a general representation of customers thoughts about a company. It can be used for word-of-mouth marketing.

12 Significance of Study This research is highly important because It presents an overview of the importance of automatic extraction of sentiments to companies, techniques for such approach, and the influence of such approach to management. It will help to expand an understanding of factors that influence stock movements as well as how such factors can either be mitigated or enhanced in order to ensure desired outcome from such movement. It will help to expand the ideology and understanding of how extracted sentiments from blogs can be used to predict the movement of stocks, thus giving investors needed edge on rights and successful investments.

13 Research Questions What is automated sentiment extraction and analysis? How can sentiments be automatically extracted and analysed? How can automated sentiment extraction and analysis be used to predict the movement of stock price?

14 Research Objectives To deliver a comprehensive and critical review of the relevant peer-reviewed and scholarly literature concerning automated sentiment extraction and analysis. To enhance an algorithm on how to carry out automated sentiment analysis from blogs. To analyze how sentiments gathered from blogs can be used to predict the movement of stocks.

15 Literature Review Sentiment Analysis deals with finding the opinion, sentiment polarity (usually in terms of positive, negative), in text documents such as movie reviews or product reviews (Pang & Lee, 2008). Sentiment polarity for a document is conducted at the phrase, sentence or document level. Phrase - capture multiple sentiments within a sentence (Wilson et al., 2005). Sentence - classify positive and negative sentiments for every subjective sentence (Hatzivassiloglou & Wiebe, 2000; Rilo ff & Wiebe, 2003; Yi et al., 2003; Mullen and Collier, 2004; Pang and Lee, 2004; Rilo ff, Patwardhan & Wiebe, 2006). Document - classify sentiments in news articles, web forum postings or movie reviews (Wiebe et al.,2001; Pang et al., 2002; Mullen and Collier, 2004; Pang and Lee, 2004; Whitelaw et al.,2005; Melville, Gryc & Lawrence, 2009; Zhang et al., 2011).

16 Sentiment Classification Techniques Analysis techniques are employed to recognize sentence, phrase, word and text meanings and to measure and predict emotional and psychological aspects of the texts or documents. They are domain specific (productre views, movie reviews, news and blogs, etc).

17 Lexicon based Techniques Kennedy and Inkpen,2006; Turney, 2002; Kamps, Marx & Mokken,2004. makes use of works in a dictionary, or lexicons of pre- tagged words in the blog. Each word in the text is compared against those in the dictionary and the polarity value of such word will be added to the “total polarity score” of the text. If the polarity score of a text is positive in relation to the dictionary, then the text is classified as positive otherwise it is classified as negative

18 Lexicon based Techniques It is difficult to discover the lexical information that works best as a result of the large volume of work inputted into the system, b ecause statement classification is dependent on the score it receives. Kennedy and Inkpen (2006) and Hatzivassiloglou and Wiebe (2000) used mainly hand-tagged adjectival lexicons and reported accuracy of 62% and 80% Respectively. Kamps et al. (2004) and Andreevskaia et al. (2007) used WordNet database to determine the polarity of words and reported accuracy of 64%.

19 Supervised Machine Learning Techniques This approach requires training document of textual content or a data corpus, which serves as a preparation document for classification learning. A series of feature vectors are chosen and a collection of tagged corpora to offer training for classifiers that can be applied to an untagged corpus of text. Before training the classifier, you must select the words/features that you will use on your model (not all the words that are returned by the tokenization algorithm will be used because there are several irrelevant words within them).

20 Supervised Machine Learning Techniques The selection of features is important in order to increase the success rate of the classification. The most common vectors for this approach are unigrams (single words) and n-grams (two or more word). Support Vector Machines (SVMs), Maximum Entropy (ME) and the Naive Bayes algorithm are the most commonly employed classification algorithms (Witten & Frank,2005) with an accuracy ranges between 63% and 82%.

21 Supervised Machine Learning Techniques The chosen classifier must be trained on a set of pre- tagged (polarity is determined) data (training set of data) Each algorithm requires different configuration such as the number of selected features. Trial and error must be used to find the configuration that works best.

22 Classification Technique used Which classification techniques to choose depends heavily on the application, domain and language. Lexicon based techniques with large dictionaries enables us to achieve very good results. Nevertheless they require using a lexicon, something which is not always available in all languages. Machine Learning based techniques deliver good results nevertheless they require obtaining data for training the classifiers.

23 Sentiment and Stock Movement Many researches have been done in the area of understanding how activities in the virtual communities are correlated with the market outcome in terms of volume, disagreement, and bullishness of such postings (Antweiler & Frank, 2004; Das & Chen, 2007; Das et al., 2005; Sabherwal et al., 2008). There is high belief that sentiments from blogs can actually be used to predict stock value but little evidence have been published in order to back that believe (Tumarkin & Whitelaw, 2001; Das & Chen, 2007; Antweiler & Frank, 2004).

24 Reserch Methodology The approach for this research will be experimental and machine learning technique will be used. An algorithm for automatic extraction of sentiments from blogs will be chosen and enhanced. It will be used to extract sentiments from blogs of a number of companies. Then the magnitude of stock price movement of a number of companies will be determined in relation to the sentiments that were extracted from their corresponding blogs.

25 Classification Algorithm The following set of classification features will be used: term presence and negation. Term presence will be used because it outperforms the use of term frequency (Pang et al., 2002). Negation is used because it potentially can reverse the sentiment (Pang and Lee, 2008).

26 Classification Algorithm For term presence, uni-grams and bi-grams will be used because both capture sentiment in different ways. For capturing coverage of data, uni-grams are best and for capturing patterns, bi-grams are better. Also other features like Part-of-Speech (POS), bag-of- words, pattern and punctuation will be experimented. Then a feature or combination of features will be used dependent on the best performance.

27 Classification Algorithm The following classification algorithms will be used: Support Vector Machines (SVMs), Maximum Entropy (ME) and the Naive Bayes algorithm. A voting schema between the three will determine the sentiment of the blog.

28 Stock Movement Magnitude Data will be collected from the daily closing prices of a specific number of companies. A period of six months will be used and the period will be increased if the number of blogs that are needed in the experiment in that period is not enough. The magnitude of price change will be calculated between day t and day t-1. The relationship between the outcome (positive, negative) of the extracted sentiment and the Stock magnitude for the same period will be analyzed to find any correlation between sentiments and stock price.

29


Download ppt "Automated Sentiment Analysis from Blogs: Predicting the Change in Stock Magnitude Saleh Alshepani (BH115) Supervisor : Dr Najeeb Abbas Al-Sammarraie."

Similar presentations


Ads by Google