Presentation is loading. Please wait.

Presentation is loading. Please wait.

2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.

Similar presentations


Presentation on theme: "2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software."— Presentation transcript:

1 2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software Engineer HP Big Data Business Unit #GHC14 2014

2 What to Expect  Sentiment Analysis −What is it? −Why is it interesting? −How HP Vertica Pulse works −Achieving greater accuracy −Different point of view using the most- mentioned word tree

3 2014 What I Expect  A 5-star rating on GHC app I just expect you to enjoy and learn!

4 2014 Sentiment Analysis  In plain English −the process of automatically detecting if a text segment contains emotional or opinionated content and determining its polarity (e.g., “thumbs up” or “thumbs down”), is a field of research that has received significant attention in recent years, both in academia and in industry. [Wright, 2009]

5 2014 Gimme Examples!  Also known as: −Opinion Mining −Text Mining  Determine people’s general opinion −“I just got a new car, and I’m loving it ” −“My new car isn’t as fast as I thought.”

6 2014 Why are we interested?  Increasing(every minute!) web usage −Articles −Blogs −Comments  Power of Social Media −Online Shopping −Customer Reviews −Recommended products on Amazon −How other people feel about the product

7 2014 Product Review

8 2014 Data… Data… Data…

9 2014 HP Vertica Pulse

10 2014 How to Analyze?  Lexicon-based approach – HP Labs [Zhang et. al. 2011]  Choose a product, person, event, organization, or topic [Hu and Liu, 2004] to analyze the opinion  Determine the Semantic Orientation score of opinion lexicons WordSemantic Orientation Value Fabulous+3 Good+1 Bad Nasty-3

11 2014 Sentiment Scoring  Input: text or sentence  Output: For each attribute or entity, generates a sentiment score ranging from -1 to 1 −-1: Negative sentiment − 0: Neutral sentiment − 1: Positive sentiment  Entity-level lexicon-based sentiment scoring

12 2014 Limitation  Semantic Orientation value(‘missed’) = -1  Gives more weight to the closely located word  Accuracy can suffer

13 2014 Improve accuracy  Accuracy is what we strive for!  More robust pre-processing −Prune data to fit for different types of user opinion (e.g. Twitter vs. YouTube comments)  Naïve Bayes Classifier Training  Tune accordingly

14 2014 Data Set  Test dataset −Stanford students collected −In 2009 −Over 3 million tweets with tested score −Analyzed 3500 tweets  Collected dataset −HP Vertica Pulse Twitter Connector −In 2014 −Total of 1.2 million tweets

15 2014 Data Pruning  Remove −Job postings #job, #jobs, #tweetmyjob −Links http://this.is/nogood −Duplicates −Twitter specific characters RT, @, # −Emoticons I hate my life :-), sarcasm is wide-spread disease  After pruning −~287000 tweets, 24% of the 1.2 million tweets

16 2014 Naïve Bayes Classifier

17 2014 Naïve Bayes Classifier  Results: −Final accuracy : 0.788

18 2014 Tuning Pulse  Positive words  Negative words  Neutral words  White lists  Stop words  Synonym mappings

19 2014 Accuracy Comparison  Sentiment scores generated for each phase

20 2014 Trend/Targeted Analysis  Targeted dataset analysis can help improve accuracy  Identify the most-mentioned words −Use the most-recurrent words to narrow the scope of analysis  Find new trends −Government healthcare (2009) vs. Obamacare (2014)  Are we looking at the targeted data? −“Solve healthcare challenges with technology!” −“Healthcare After ObamaCare” −“Get affordable healthcare at HealthCare.gov”

21 2014 Generating Tree  Increase the relevancy of sentiment score by running the sentiment analysis on the entity, as well as on the most-recurrent words to identify: −Homonyms that machines do not understand −More accurate scores based on user interest  Generate tree using Text Search −Merge stemmer words e.g. query, queries, querying… −Lucene - apache open source

22 2014 Tree View healthcare obamacare !(Obamacare) obama !(Obama) !(health) health

23 2014 Thank you Questions? bohyun@hp.com bohyun.j.kim@gmail.com Many thanks to*: Tim Donar, Solution Engineer Beth Favini, Tech Pubs Sr. Manager Judith Plummer, Tech Pubs Editor in Chief * In alphabetical order

24 2014 Got Feedback? Rate and Review the session using the GHC Mobile App To download visit www.gracehopper.org


Download ppt "2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software."

Similar presentations


Ads by Google