Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Similar presentations


Presentation on theme: " Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University."— Presentation transcript:

1  Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University at Bloomington

2 Outline  The big picture  A specific problem – opinion detection

3 Intelligent information retrieval  Characteristics  Not restricted to keyword matching and Boolean search  Deal with natural language query and advanced search criteria  Coarse-to-fine level of granularity  Automatically organize/evaluate/interpret solution space  User-centered, e.g., adapt to user’s learning habit  Etc.

4 Intelligent information retrieval  System Preferences  Various source of evidence  Natural language processing  Semantic web technologies  Automatic text classification  Etc.

5 Intelligent IR system diagram

6  A Specific Question: Semi-Supervised Learning for Identifying Opinions in Web Content Dissertation work

7 Growing demand for online opinions  Enormous body of user- generated content  About anything, published anywhere and at any time  Useful for literature review, decision making, market monitoring, etc.

8 Major approaches for opinion detection

9  To acquire a broad and comprehensive collection of opinion-bearing features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic collocations, stylistic features, contextual features);  To generate complex patterns (e.g., “good amount”) that can approximate the context of words.  To generate and evaluate opinion detection systems;  To allow evaluation of opinion detection strategies with high confidence; 9 9 What’s Essential? Labeled Data! And lots of them!!!

10 Challenges for opinion detection  Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain

11 Motivations & research question  Easy to collect unlabeled user-generated content that contains opinions  Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?

12 Datasets & data split Evaluation(5%) Unlabeled (90%) Labeled(1-5%) SSL Full SL Baseline Supervised Learning (SL) Labeled(95%) Evaluation(5%) Labeled(1-5%) Evaluation(5%) Dataset (sentences) Blog PostsMovie ReviewsNews Articles Opinion4,8435,0005,297 Non-opinion4,8435,0005,174

13 Two major SSL methods: Self-training  Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set.  Limitation: Auto-labeled data may be biased by the particular opinion classifier.

14 Two major SSL methods: Co-training  Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other.  Limitation: It is not always easy to create two different classifiers.

15 Experimental design  General settings for SSL  Naïve Bayes classifier for self-training  Binary values for unigram and bigram features  Co-training strategies:  Unigrams and bigrams (content vs. context)  Two randomly split feature/training sets  A character-based language model (CLM) and a bag-of-words model (BOW)

16 Results: Overall  For movie reviews and news articles, co- training proved to be most robust  For blog posts, SSL showed no benefits over SL due to the low initial accuracy

17 Results: Movie reviews  Both self-training and co-training can improve opinion detection performance  Co-training is more effective than self- training

18 Results: Movie reviews (cont.)  The more different the two classifiers, the better the performance

19 Results: Domain transfer (movie reviews->blog posts)  For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.

20 Contributions  Comprehensive research expands the spectrum of SSL application to opinion detection  Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation  Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection  Research extensible to other data domains, non-English texts, and other text mining tasks

21 21 www.CartoonStock.com “All my opinions are posted on my online blog.” “A grade of 85 or higher will get you favorable mention on my blog.” “If you want a second opinion, I’ll ask my computer” Thank you!


Download ppt " Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University."

Similar presentations


Ads by Google