Queensland University of Technology

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Presented by Zeehasham Rasheed
Introduction to Machine Learning Approach Lecture 5.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Mining and Summarizing Customer Reviews
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Chapter 6: Information Retrieval and Web Search
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An information-pattern-based approach to novelty detection Presenter : Lin, Shu-Han Authors : Xiaoyan.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Selected New Training Documents to Update User Profile Abdulmohsen Algarni and Yuefeng Li and Yue Xu CIKM 2010 Hao-Chin Chang Department of Computer Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Information Retrieval in Practice
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Multimedia Information Retrieval
Compact Query Term Selection Using Topically Related Text
Mining and Analyzing Data from Open Source Software Repository
Relevance Feedback Hongning Wang
IR Theory: Evaluation Methods
Presented by: Prof. Ali Jaoua
Discriminative Frequent Pattern Analysis for Effective Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
Inf 722 Information Organisation
Citation-based Extraction of Core Contents from Biomedical Articles
Chapter 5: Information Retrieval and Web Search
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Relevance and Reinforcement in Interactive Browsing
Information Retrieval and Web Design
Introduction to Search Engines
A Neural Passage Model for Ad-hoc Document Retrieval
Presentation transcript:

Queensland University of Technology Y. Li Y, A. Algarni, and N. Zhong Mining Positive and Negative Patterns for Relevance Feature Discovery 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC (KDD 2010), pp. 753-762. (Presented By) Prof. Yuefeng Li Queensland University of Technology Brisbane, QLD 4001, Australia y2.li@qut.edu.au

Outline Introduction What’s relevance feature discovery, term based Models, Why patterns? Are pattern effective for relevance feature discovery? And, the new solution The Deploying Method The definition, deploying higher level patterns to low-level terms Low-level Features Specificity and Exhaustive. Calculate the specificity score. Classification rules. Revise weights of the low-level features. Evaluation Conclusion

Introduction Relevance is a fundamental concept: Topical relevance, discusses a document's relevance to a given query User relevance, discusses a document's relevance to a user. The objective of relevance feature discovery is to find useful features available in a training set for describing what users want.

Term Based Models Popular term-based IR models The Rocchio algorithm Probabilistic models Okapi BM25 Language models: model-based methods and relevance models The advantages: efficient computational performance; and mature theories for term weighting. Phrases have been used in some IR models, as phrases are more discriminative and carry more “semantics" than words. Currently, many researchers illustrated that phrases are useful and crucial for query expansion in building good ranking functions.

Why Patterns? Challenging issue of using phrases: Finding useful phrases for text mining and classification, since naturally, phrases have inferior statistical properties to words, and there are a large number of redundant and noisy phrases among them. Patterns can be an promising alternative of phrases Like words, patterns enjoy good statistical properties. Data mining has developed some techniques: maximal patterns, closed patterns and master patterns for removing redundant and noisy patterns.

Are Patterns Effective for Relevance Feature Discovery? Pattern Taxonomy Models (PTM) It used closed sequential patterns, and has shown a certain extent improvement on the effectiveness. Two Challenging Issues for using Patterns in Text Mining The low-support problem, given a topic, large patterns are more specific for the topic, but with low supports (or low frequency). If we decrease the minimum support, there are a lot of noisy patterns would be discovered. The misinterpretation problem that means the measures used in pattern mining (e.g., “support” and “confidence”) turn out to be not suitable in using discovered patterns for answering what users want. For example, a highly frequent pattern is usually a general pattern for a topic.

The New Solution Features in two levels: low-level terms and higher level patterns. An innovative approach to evaluate weights of terms according to both their specificity and their distributions in the higher level features where the higher level features include both positive and negative patterns.

The Deploying Method What’s deploying: evaluating weights (supports) of low-level terms based on their distribution (appearances) in higher level patterns where higher level patterns are closed patterns that frequently appeared in paragraphs. A method for interpreting discovered patterns that provide a new method for weighting terms. An efficient and effective way for using patterns in solving problems; especially for using large patterns.

How Deploying Works Let SP1, SP2, ..., SPn be the sets of discovered closed sequential patterns for all documents di  D+ (i = 1, 2, …, n), where n = |D+|.

How Deploying Works cont.

Specificity of Low-Level Features A term’s specificity describes the extent of the term to which the topic focuses on what users want. For example, “JDK” is more specific than term “LIB” for describing “Java Programming Language”. Basically, the specificity of terms based on their positions in a concept hierarchy. For example, terms are more general if they are in the upper part of the LCSH (Library of Congress Subject Headings) hierarchy; otherwise, they are more specific. In many cases, a term’s specificity is measured based on what topics we are talking about. For example, “knowledge discovery” will be a general term in data mining community; however it may be a specific term when we talk about information technology.

Definition of Specificity The concept of relevance is subjective. It is easy for human being to do so. However, it is very difficult to use these concepts for interpreting relevance features in text documents. We define the specificity of a given term t in the training set D = D+  D- as follows:

Classification Rules Based on the specificity score the terms can be categorized into three group used the following classification rules: where G is a general terms, T+ is the specific positive terms and T- is the specific negative terms . 1 T+ | G | T- -0.5

Negative Feedback In general, negative and positive documents could share some background concepts or noises knowledge. There are two main issues of using negative relevance feedback: How to select constructive negative examples to reduce noises and the space of negative examples as well? How to use selected negative examples to refine the discovered knowledge in D+?

The Selection of Constructive Negative Examples Offender documents are negative documents that are most likely to be classified as positive.

How to select constructive negative examples? Re-rank the negative examples used the extracted low-level features from positive feedback. Select the top-K documents as an offenders, Extract higher level patterns and low-level terms from selected negative examples that uses the similar method as for mining in positive documents.

Weight Revision of Low-level Features Generally the specific terms are more important; however, general terms are also necessary for describing what users want. For that reasons, we increased the weights of positive terms and reduce the weights of the specific negative terms as in this paper:

EVALUATION Text Data - Reuters Corpus Volume 1 (RCV1) is used to test the effectiveness of the proposed model, including a total of 806,791 documents. Topics - 50 TREC assessor topics The documents are treated as plain text documents by pre- processing the documents. The tasks of removing stop-words according to a given stop- words list and stemming terms by applying the Porter Stemming algorithm are conducted.

Results Results of the proposed model against the baseline models for all 50 assessing topics

Discussions top-20 Map Fscore p/b IAP RFD PTM Rocchio SVM BM25 Percentage Changes   top-20 Map Fscore p/b IAP RFD 0.557 0.493 0.470 0.472 0.513 PTM 0.496 0.444 0.439 0.430 0.464 Rocchio 0.474 0.420 0.452 SVM 0.453 0.409 0.421 0.408 0.435 BM25 0.445 0.407 0.414 0.428 % Change 12.30% 11.18% 6.92% 9.75% 10.44% Statistical information for RFD with different values of K K Average number of training documents Average number of extracted terms Average weight of extracted terms MAP Positive Negative Offenders T+ G T- w(t+) w(tg) w(t-) K=|D+|/2 12.780 41.300 6.540 23.540 22.360 231.780 4.158 1.400 -0.551 0.493 K=|D-| 39.920 14.200 15.280 539.360 1.858 0.890 -71.202 0.278 K=|D+| 10.180 20.780 20.740 280.180 3.060 1.271 -2.965 0.463 Statistical information for RFD and PTM Average number of Extracted Terms used RFD Average weight(t) before revision Averageweight(t) after revision Terms extracted from D+ used PTM T+ G T- w(t+) w(tg) w(t-) T w(t) 23.54 22.36 231.78 2.842 1.400 0.320 4.158 -0.551 156.9 1.452

Results of using different values of K in RCV1 dataset Offenders selection Results of using different values of K in RCV1 dataset

Results of using different groups of terms in all 50 assessing topics Classification rules Results of using different groups of terms in all 50 assessing topics

Weight revision Weight distributions before and after the revision for the extracted features.

Conclusion Compared with the state-of-the-art models, the proposed model is consistent and very significant on all five measures on all 50 assessing topics. Negative (offender) selection approach is satisfactory. The use of negative relevance feedback is very significant for the relevance feature discovery. It can balance the percentages of the specific terms and the general terms to reduce noises.

Questions?