Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Similar presentations


Presentation on theme: "A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:"— Presentation transcript:

1 A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source: LREC 2008

2 Motivation Most available resource for subjectivity analysis are focused on English –Lexicon –Manually labeled corpora Previously proposed method –Rely on advanced language processing tool Syntactic parsers Information extraction

3 Goal Minimize the required resources Build a subjectivity lexicon by using –a small seed set of subjective words –an online dictionary –a small raw corpus –A bootstrapping process ranks new candidate words based on a similarity measure

4 Related Work Starts with seeds and uses PMI similarity method to grow seed list from Web data (Turney, 2002) –Annotated for polarity –Web data as very large corpus

5 Bootrapping Bootstrapping from manually selected seeds Depend on an online dictionary Expanded with related words at each iteration Filtered by using a measure of word similarity

6 Seed set 60 seeds as the seed set here –evenhandedly sampled from verbs, nouns, adjectives and adverbs Manually selected from 1.the XI-th grade curriculum for Romanian Language and Literature 2.translations of instances appearing in the OpinionFinder strong subjective lexicon Similar seed set can be easily obtained for any other language

7 Sample of initial seed set

8 Dictionary Collect all the open-class words appearing in its definition as new related words –include synonyms and antonyms if available –expand all the possible meanings of a seed word –filtered for incorrect meanings by using the measure of word similarity Use an online Romanian dictionary

9 Bootstrapping iterations Continue to the next iteration until a maximum number of iterations is reached Part-of-speech information is not maintained throughout the bootstrapping process

10 Filtering Calculating a measure of similarity between the original seeds and each of the possible candidates Two corpus-based measures of similarity 1.Pointwise Mutual Information 2.Latent Semantic Analysis (LSA) –both methods provided similar results –the LSA method was significantly faster and required less training data Candidates with an LSA score higher than 0.4 are considered to be expanded

11 LSA Evaluating the semantic similarity Automatically acquired in an unsupervised way from a corpus –Corpus: e.g., British National Corpus –Latent Semantic Analysis (LSA) yields a vector space model –Factor analysis and dimension reduction –Allows for a homogeneous representation of words, word sets, sentences and texts as an vector and then can compute a similarity measure

12 LSA Here the LSA module was trained on a half-million word Romanian corpus –Consisting of a manually translated version of the SemCor balanced corpus (Miller et al., 1993).

13 Variable filter The subjectivity lexicons consist of a ranked list of candidates –in decreasing order of similarity A variable filtering threshold can be used to select the most closely related candidates –used thresholds: 0.40, 0.50, 0.55, 0.60

14 Evaluation Subjectivity lexicon –LSA similarity threshold of 0.5 –five bootstrapping iterations –Resulted in a lexicon of 3,913 entries Used in a rule-based sentence-level subjectivity classifier –Subjective sentence : contains three or more entries appear in the subjective lexicon Gold-standard corpus: –Consisting of 504 Romanian sentences –manually annotated for subjectivity –The agreement of the two annotators is 0.83% (κ = 0.67) –Resulting in a gold standard dataset with 272 (54%) subjective sentences and 232 (46%) objective sentences

15 Lexicon Acquisition Entries in the lexicon

16 Sentence-level subjectivity classification lexicon alone –rule-based subjectivity classifier with an overall F-measure of 61.69%

17 Compare our results with other rule-based methods Mihalcea et al., 2007 –subjectivity lexicon: translation of the English subjectivity lexicon –2,282 entries with a confidence label of strong, neutral or weak as flagged by the Opinion- Finder lexicon Bootstrapping method achieve significant improvement of 18.03% in the overall F- measure

18 Conclusion Quickly generate a large subjectivity lexicon Used to build rule-based sentence-level subjectivity classifiers This system proposes a possible path towards identifying subjectivity in low-resource languages Future work –variations of the bootstrapping mechanism –other similarity measures


Download ppt "A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:"

Similar presentations


Ads by Google