Presentation is loading. Please wait.

Presentation is loading. Please wait.

Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras.

Similar presentations


Presentation on theme: "Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras."— Presentation transcript:

1 Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras Allababidi ID: 145416 1

2 Introduction Opinion Mining is: analyzing people’s opinions, sentiments, attitudes and emotions. 2

3 Web 2.0 Made it Easy Because the Web 2.0 allowed user generated content, people are expressing their opinions on everything on the web and it became important to understand that feedback and analyze it. 3

4 Importance Why: to get answers for new geopolitical, social and business-related questions. Examples.. 4

5 Challenges Besides the typical challenges known from natural language processing and text processing, there are many challenges to opinion mining:  Noisy texts: User generated contents in social media tend to be less grammatically correct, they are informally written and have spelling mistakes. These texts often make use of emoticons and abbreviations or unorthodox capitalization.  Language variations: Texts in user generated content typically contain irony and sarcasm; texts lack contextual information but have implicit knowledge about a specific topic. 5

6 Challenges  Relevance and boilerplate: Relevant content on webpages is usually surrounded by irrelevant elements like advertisements, navigational components or previews of other articles; discussions and comment threads can divert to non-relevant topics.  Target identification: Search-based approaches to opinion mining often face the problem that the topic of the retrieved document does not necessarily match the mentioned object.  Complexity and changing rate of opinions. 6

7 Structure Opinion mining has been investigated mainly at three different levels: 1. Document level 2. Sentence level 3. Entity/aspect-level 7

8 Structure cont. Opinion is defined as a quintuple (E i, A ij, S ijkl, H k, T l )  Ei: Name of entity  Aij: Aspect of the entity  Sijkl: Sentiment of an aspect (positive, negative, or natural)  Hk: Opinion holder  Tl: Time of expressed opinion 8

9 Technical Approaches  Sentiment classification  Feature-based opinion mining (or aspect-based opinion mining)  Comparison-based opinion mining 9

10 Paper Objective The objective of this paper is to investigate the differences between social media channels and to discuss the impacts of their characteristics to opinion mining approaches 10

11 Paper Methodology  Identify the most popular approaches for opinion mining in the scientific field and their underlying principles of detecting and analyzing text.  Identify and deduce criteria from literature to exhibit differences between the different kinds of social media sources regarding possible impacts on the quality of opinion mining.  Do an empirical analysis based on the deduced criteria in order to determine the differences between several social media channels.  Social network services (Facebook)  Microblogs (Twitter)  Comments on weblogs  Product reviews (Amazon and other product review sites).  In the last step, the social media source types need to be correlated with applicable opinion mining approaches based on their respective characteristics. 11

12 Algorithms Used  Supervised learning  Unsupervised learning  Partially supervised learning  Latent variable models (Hidden Markov Model HMM)  Conditional Random Fields CRF  Latent Semantic Association LSA  Pointwise Mutual Information PMI 12

13 Algorithms: Hidden Markov Model HMM  Formal foundation for making probabilistic models of linear sequence. They are example of stochastic processes—processes that generate random sequences of outcomes or states according to certain probabilities. Markov processes are distinguished by being memoryless—their next state depends only on their current state, not on the history that led them there. 13

14 Algorithms: Conditional Random Fields CRF  Probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences.  p(Y v | X, Y w, w <> v) = p(Y v | X, Y w, w ∼ v) 14

15 Algorithms: Latent Semantic Association LSA 15

16 Algorithms: Pointwise Mutual Information PMI  Approach to find collocations. Measure of how much one word tells us about the other. How much information we gain.  A collocation is an expression of two or more words that are some conventional way of saying something. Ex: I’ll be in touch. 16

17 Empirical Analysis  Focused on Specific Brand (Samsung)  Specific time: between June, 15th 2011 and Jan, 28th 2013  Data labeled manually by four different human labelers  Sources were taken in four different languages  Number of sources of each media:  Facebook: 410 postings, using the API  Twitter: 287 tweets, using API  Blog: 387 blog posts  discussion forum: 417 posts from 4 different forums, performed manually  product reviews: 433 reviews from Amazon, and two product review pages) using Web- crawler 17

18 Evaluation Criteria 18

19 Results of Survey: FaceBook  Length of postings: Facebook 19 words compared to 119 in product reviews  Emoticons and Internet slang: Emoticons are highest with 27.8%, while slang surprisingly least with only 8.3%  Grammatical and orthographical correctness: Second highest with error ratio of 42%  Aspects and details: 33% has one or more aspect. Mainly contain postings on entity-level 65.4%.  Subjectivity: 67.3% lowest subjectivity, while 26.1% objective  Opinion holder: between 95% and 97.6% reveal the author as the opinion holder  Topic Relatedness: lowest with 82.3%. 1.1% both topic and non-topic 19

20 Results of Survey: Twitter  Length of postings: Lowest with 14 words out of 119 highest  Emoticons and Internet slang: Emoticons second lowest with 24.4% while its highest in slang with 20.2%  Grammatical & orthographical correctness: Highest error ratio with 48.8%  Aspects and details: 60.6% contain an aspect or more. Mainly contain postings on entity-level 56.6%  Subjectivity: 82.9% highest subjectivity, while 12.8% objective  Opinion holder: between 95% and 97.6% reveal the author as the opinion holder  Topic Relatedness: lowest with 95.3%. 0% both topic and non-topic 20

21 Results of Survey: Blogs  Emoticons and Internet slang: Emoticons second with 27.6% but very close to Facebook, while slang came with 12.8% and higher than FB  Grammatical and orthographical correctness: lowest error ratio with 35.4%  Aspects and details: 55.3% go into detail. 5.6% contain aspects as well as opinions on entity-level.  Subjectivity: 69.3% subjective, while 19.6% objective  Opinion holder: between 95% and 97.6% reveal the author as the opinion holder  Topic Relatedness: lowest with 92.6%. 1.1% both topic and non-topic 21

22 Results of Survey: Product Reviews  Length of postings: Highest 119 words in Product reviews  Emoticons and Internet slang: Emoticons least with 20.1% only. While slang came with 12.8% and higher than FB.  Grammatical and orthographical correctness: The error ratio is second lowest with 37.2%  Aspects and details: product review postings go into detail (39.6%) and contain aspects as well as opinions on entity-level 27.0%  Subjectivity: 71.7% subjective, while 26.12.9% objective making 25.4% both  Opinion holder: 90% the author is the opinion holder  Topic Relatedness: lowest with 93.1%. 5.8% both topic and non-topic 22

23 Impact on Opinion Mining Blogs  Many research papers that focus on blogs do not unfold how comments to the blog posts are taken into consideration.  Depending on the type of the blog (corporate blog vs. j-blog) both the blog posting and the blog comments can be interesting sources for opinion mining. 23

24 Impact on Opinion Mining  Product review:  Several researchers proposed models to identify aspects and sentiments.  Few assume that all of the words in a sentence cover one single topic.  Social Network (Facebook): Because users can interact with each other, respond to questions and the amount of grammatical mistakes, there are similar challenges like with discussion forums. More research work is required. 24

25 Impact on Opinion Mining  Microblog (Twitter): Many grammatical errors, short sentences, heavy usage of hashtags and other abbreviations.  Researchers mainly use supervised learning or semisupervised learning  Davidov et al. use Twitter characteristics and language conventions as features.  Zhang et al. combine lexicon-based and learning-based methods for Twitter sentiment analysis.  The usage of part-of-speech features does not seem to be useful in the microblogging domain. 25

26 Further Research Further research work should be conducted: (i)Measure and compare the factual implications of the characteristics of social media on the performance of the different opinion mining approaches. (ii)Conduct more research work on alternative (statistical / mathematical) approaches. 26

27 Resources  AI and Opinion Mining http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5475086 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5475086  Opinion Mining on the Web 2.0 – Characteristics of User Generated Content and Their Impacts https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=33 6152&pCurrPk=71378 https://online.tugraz.at/tug_online/voe_main2.getVollText?pDocumentNr=33 6152&pCurrPk=71378 27

28 28 The End Thank you for listening… Any Questions


Download ppt "Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras."

Similar presentations


Ads by Google