Download presentation
Presentation is loading. Please wait.
Published byHenry Carson Modified over 9 years ago
1
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining May, 18th-20th, 2005 Shotaro Matsumoto, Hiroya Takamura and Manabu Okumura Tokyo Institute of Technology
2
Made with OpenOffice.org 2 Table of Contents 1. Motivation 2. Our Approach 3. Experiments 4. Result and Discussion 5. Conclusion and Future Work
3
Made with OpenOffice.org 3 Table of Contents 1. Motivation Background Document Sentiment Classification Early Studies Issue Objective 2. Our Approach 3. Experiments 4. Results and Discussions 5. Conclusion and Future Work
4
Made with OpenOffice.org 4 Background Online grass-roots reviews are rapidly increasing Contain useful reputation There are so many such documents that we cannot read them all Mining reputation from such documents is important
5
Made with OpenOffice.org 5 Document sentiment classification a task to classify an overall document according to the positive or negative polarity of its opinion (desirable or undesirable)
6
Made with OpenOffice.org 6 Two steps for the classification 1. Feature extraction convert a document to a feature vector, which preserves features of the original document 2. Binary classification Classify the feature vector to positive or negative sentiment polarity
7
Made with OpenOffice.org 7 Early Studies [Pang 02] Features: unigrams in the document Classifier: Naïve Bayes, ME Model, Support Vector Machines (SVMs) Showed SVMs is superior to others [Pang 04] Features: unigrams obtained from the summary Classifier: SVMs [Mullen 04] Features: unigrams, unigrams of lemmatized words, prior knowledge from Internet and thesaurus Classifier: SVMs Get better results than [Pang 02]
8
Made with OpenOffice.org 8 Issue Features in early studies A document is represented as a bag-of-words, where a text is regarded as a set of words → Word order and syntactic relations between words in a sentence, intuitively important for the classification, are discarded
9
Made with OpenOffice.org 9 Objective We propose a method for extracting word order and syntactic relations as features. We use frequent sub-patterns in sentences as these features.
10
Made with OpenOffice.org 10 Table of Contents 1. Motivation 2. Our Approach Overview Word Sub-Sequence Dependency Sub-Tree Frequent Sub-pattern 3. Experiments 4. Result and Discussion 5. Conclusion and Future Work
11
Made with OpenOffice.org 11 Overview We use a word sequence and a dependency tree as structured representations of a sentence We extract frequent sub-patterns from sentences as features for the classification
12
Made with OpenOffice.org 12 Word Sub-Sequence A word sequence S Just a sequence of words which represents a sentence preserves word order in a sentence A word sub-sequence S’ of a word sequence S Obtained by removing zero or more words from the original sequence Preserve the word order of the original sentence
13
Made with OpenOffice.org 13 Dependency Sub-Tree A dependency tree D Expresses dependency between words in the sentence by child- parent relationships of nodes Preserves syntactic relations between words in the sentence A dependency sub-tree D’ of a dependency tree D Obtained by removing zero or more nodes from the original tree Preserves syntactic relations between words in the original sentence
14
Made with OpenOffice.org 14 Frequent Sub-Pattern The number of all sub-patterns (subsequences or subtrees) is too large → Use only frequent sub-patterns Definition A sentence contains a pattern if and only if the pattern is a subsequence or a subtree of the sentence A support of a pattern is the number of sentences containing the pattern in a dataset If a support of a pattern is a given support threshold or more, the pattern is frequent. (In this experiment, we fixed support threshold to 10.) As implementations for mining frequent sub-patterns, we use Kudo’s Prefixspan and FREQT.
15
Made with OpenOffice.org 15 Table of Contents 1. Motivation 2. Our Approach 3. Experiments Movie review dataset Features Classifiers and Tests 4. Result and Discussion 5. Conclusion and Future Work
16
Made with OpenOffice.org 16 Movie review dataset Dataset 1: used in [Pang 02], [Mullen 04] 690 positive reviews and 690 negative reviews Written in English 3-fold cross-validation Dataset 2: used in [Pang 04] 1000 positives and 1000 negatives Written in English 10-fold cross-validation
17
Made with OpenOffice.org 17 Features We employ the following features and their combinations for the classification Bag-of-words features Unigram (ex: “good”, “film”): uni ● Unigram patterns appear in at least 2 distinct sentences Bigram (ex: “very good”, “film is”): bi ● Bigram patterns appear in at least 2 distinct sentences Frequent sub-pattern features Word Sub-Sequence: seq Dependency Sub-tree: dep Features of lemmatized words As in the extraction of the features uni, bi, seq, dep, also extract uni l, bi l, seq l, dep l
18
Made with OpenOffice.org 18 Classifiers and Tests (1/2) Classifier Method: SVMs, binary classifier based on supervised learning ● Kernel function: linear kernel ● Performance closely depends on its learning parameter C (called soft margin parameter) → We carry out three kind of experiments
19
Made with OpenOffice.org 19 Classifiers and Tests (2/2) Test 1: fix C as 1 The result is used for comparison to the early studies Test 2: best accuracy with C ∈ {e -2.0, e -1.5, …, e 2.0 } Observe potential performance of features Use the result for finding the best effective combination of bag- of-words features Test 3: predict a proper value of C from training data Observe practical performance of features
20
Made with OpenOffice.org 20 Table of Contents 1. Motivation 2. Our Approach 3. Experiments 4. Results and Discussion Results Discussion 5. Conclusion and Future Work
21
Made with OpenOffice.org 21 Results (1/2) Results for dataset 1 vs Pang 82.9% → 87.3% (error reduction: 26%) vs Mullen 84.6% → 87.3% (error reduction: 18%)
22
Made with OpenOffice.org 22 Results (2/2) Results for dataset 2 vs Pang87.1% → 92.9% (error reduction: 45%)
23
Made with OpenOffice.org 23 Discussion From the results of the test1, our method proved to be effective Accuracy by features: bow + dep ≒ bow + dep + seq (93%) >> bow + seq (89%) > bow (87%) Lemmatized features are not always more effective than the original ones
24
Made with OpenOffice.org 24 Table of Contents 1. Motivation 2. Our Approach 3. Experiments 4. Results and Discussion 5. Conclusion and Future Work Conclusion Future Work
25
Made with OpenOffice.org 25 Conclusion We proposed a method for incorporating word order and syntactic relations between words in a sentence into document sentiment classification by using frequent word sub- sequences and dependency sub-trees as features. Experimental results on movie review datasets show that our classifiers obtained the best results yet published using these datasets.
26
Made with OpenOffice.org 26 Future Work (1/2) Negative/Interrogative Sentence affirmative sentence: This film is good. (1) Negative sentence:This film is not good. (2) Interrogative sentence: Is this film good? (3) All sub-patterns in sentence (1) are also contained in sentence (2). Similarly, there is a large overlap of patterns between (1) and (3). Distinguishing these sentence-types would solve these problems.
27
Made with OpenOffice.org 27 Future Work (2/2) Incorporating discourse structures in a document Example (positive movie review) The scenario is simplistic. But I love this film. By a word “but”, we would know that “I love this film” is a more important sentence than “The scenario is simplistic” in the sense of sentiment classification.
28
Made with OpenOffice.org 28 Thank you.
29
Made with OpenOffice.org 29 Examples of Weighed Patterns Positive(+) weight shows positive sentiment polarity Negative(-) weight shows negative sentiment polarity The absolute value of each weight indicates how large the contribution of the feature is
30
Made with OpenOffice.org 30 A Word Sequence = A Clause Sentences are too long to be used for mining frequent sub- sequences Instead of sentences, we used clauses of sentences as word sequences As in the figure on the right, We split a sentence to a main clause and subordinate clauses with information of parse tree In addition, we removed stopwords. Conjunction, preposition, number, etc…
31
Made with OpenOffice.org 31 References [Pang 02] [Pang 04] [Mullen 04]
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.