Download presentation
Presentation is loading. Please wait.
Published byOliver Tate Modified over 8 years ago
1
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois at Chicago * HP Labs 1. Introduction 4. Experiments Parsing indirect relations is error-prone for Web corpora. Thus we only use direct relation to extract opinion words and feature candidates in our application. 2.3. Weakness 3. Proposed Techniques To deal with the problem of double propagation, we propose a novel method to mine features, which consists of two steps: feature extraction and feature ranking. 3.1 Feature Extraction We still adopt double propagation idea to populate feature candidates. But two improvements based on part-whole relation patterns and a “no” pattern are made to find features which double propagation cannot find. So there are three kinds of feature indicators: Double propagation Part-whole relation pattern A part-whole pattern indicates one object is part of another object. It is a good indicator for features if the class concept word (the “whole” part) is known. “no” pattern a specific pattern for product review and forum posts. People often express their comments or opinions on features by this short pattern (e.g. no noise) 2. Existing Techniques 2.1 Double Propagation (Qiu et al 2010) Observation : Opinion words are often used to modify features ; opinion words and features themselves have relations in opinionated expressions too, Method: Double propagation assumes that features are nouns/noun phrases and opinion words are adjectives Opinion words can be recognized by identified features, and features can be identified by known opinion words. The extracted opinion words and features are utilized to identify new opinion words and new features which are used again to extract more opinion words and features. This propagation or bootstrapping process ends when no more opinion words or features can be found. The opinion word/feature relations can be identified via a dependency parser based on the dependency grammar. 2.2. Dependency Grammar It describes the dependency relations between words in a sentence 3.2 Feature Ranking The basic idea is to rank the extracted feature candidates by feature importance. If a feature candidate is correct and important, it should be ranked high. For unimportant feature or noise, it should be ranked low. We identify two major factors affecting the feature importance. Feature relevance: it describes how possible a feature candidate is a correct feature. Feature frequency: a feature is important, if appears frequently in opinion documents. We find that there is a mutual enforcement relation between opinion words, part-whole relation and “no” patterns and features. If an adjective modifies many correct features, it is highly possible to be a good opinion word. Similarly, if a feature candidate can be extracted by many opinion words, part-whole patterns, or “no” pattern, it is also highly likely to be a correct feature. The Web page ranking algorithm HITS is applicable. 4. Conclusions A new method to deal with problems of double propagation. Part-whole and “no” patterns are used to increase recall and then ranks the extracted feature candidates by feature importance, which is determined by feature relevance and frequency. HITS was applied to compute feature relevance. We used 4 diverse corpora to evaluate the techniques. They were obtained from a commercial company. The data were crawled and extracted from multiple online message boards and blogs discussing different products and services. An important task of opinion mining is to extract people’s opinions on features/attributes of an entity. The sentence, “I love the GPS function of Motorola Droid”, expresses a positive opinion on the “GPS function” of the Motorola phone. “GPS function” is the feature. 3.3. The Whole Algorithm Step 1 : Extract products features using double propagation, part-whole patterns and “no” patterns Step 2 : Compute feature score using HITS without considering frequency Step 3 : The final score function considering the feature frequency is given as follows S = S(f) * log( freq(f) ) Freq (f) is the frequency count of feature f, and S(f) is the authority score of the feature f. Step 3 : The final score function considering the feature frequency is given as follows S = S(f) * log( freq(f) ) Freq (f) is the frequency count of feature f, and S(f) is the authority score of the feature f. Direct relations: it represents that one word depends on the other word directly or they both depend on a third word directly (e.g. “ The camera has a good lens” ) Indirect relations: it represents that one word depends on the other word through other words or they both depend on a third word indirectly For large corpora, double propagation may introduce a lotof noise ( error propagation). For small corpora, it may miss some important features ( features are not modified by any opinion word). A recently proposed unsupervised technique for extracting features from reviews (1) Phrase pattern NP + Prep + CP (e.g. battery of the camera) CP + with + NP (e.g. mattress with a cover) NP CP or CP NP (e.g. mattress pad ) (2) Sentence pattern CP verb NP (e.g. the phone has a big screen) Rank features Video Picture quality GPS function. Feature extraction double propagation, part-whole relation, “no” pattern” Feature ranking HITS algorithm Algorithms Corpus 1 Corpus 2 Corpus n … Opinion Lexicon Class Concept Word Dependency relations Feature extraction and ranking Table 1. Descriptions of the 4 corpora Table 2. Results of 1000 sentences Table 3. Precision at top 50 feature candidates Bipartite graph and HITS algorithm
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.