Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna.

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.

Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,

Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz,

Introduction to Information Retrieval (Part 2) By Evren Ermis.

 Andisheh Keikha Ryerson University Ebrahim Bagheri Ryerson University May 7 th

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Evaluating Search Engine

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Information Retrieval Review

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Modern Information Retrieval

1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.

Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.

Information Retrieval

Language Modeling Approaches for Information Retrieval Rong Jin.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Search Engines and Information Retrieval Chapter 1.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.

A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Proposal for Term Project J. H. Wang Mar. 2, 2015.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Lecture 2 Jan 13, 2010 Social Search. What is Social Search? Social Information Access –a stream of research that explores methods for organizing users’

Chapter 6: Information Retrieval and Web Search

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.

Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Chapter 23: Probabilistic Language Models April 13, 2004.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.

Selecting Good Expansion Terms for Pseudo-Relevance Feedback Guihong Cao, Jian-Yun Nie, Jianfeng Gao, Stephen Robertson 2008 SIGIR reporter: Chen, Yi-wen.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.

Queensland University of Technology

Developments in Evaluation of Search Engines

Proposal for Term Project

An Empirical Study of Learning to Rank for Entity Search

Applying Key Phrase Extraction to aid Invalidity Search

IR Theory: Evaluation Methods

Data Mining Chapter 6 Search Engines

Chapter 5: Information Retrieval and Web Search

Large Scale Findability Analysis

Presentation transcript:

Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna

Agenda Document FindabilityDocument Findability Calculating Findability MeasureCalculating Findability Measure GINI CoefficientGINI Coefficient Queries Creation for Findability MeasureQueries Creation for Findability Measure ExperimentsExperiments

Document Findability Large Findability Legal or Patent RetrievalLarge Findability of each and every document in Collection is considered an important factor in Legal or Patent Retrieval Settings. For example, in Patent Retrieval Settings, un- accessibility of a single related Patent document can approve wrong Patent application.

Document Findability Easy vs. Hard Findability called easy Findable top rank results –A patent is called easy Findable, if it is accessible on top rank results of its several relevant queries. harder will be its Findability –More the Patent will far away from the top rank results, the harder will be its Findability. –Why?, because users are more interested on only top rank results (say top 30).

Document Findability Considered two Retrieval Systems (RS1, RS2) and three Patents (P1, P2, P3). The following table shows the Findability values of three Patents on top 30 results. It is clear, RS2 makes all Patents more Findable than RS1. P1P2P3 RS1027 RS2475

What Makes Hard to Find Documents System Bias preference to some features –Bias is a term used in IR, when retrieval system give preference to some features of documents when it rank results of queries. PageRank BM25, BM25F, TF-IDF –Example, PageRank is bias toward larger in-links, BM25, BM25F, TF-IDF are bias toward large terms frequencies. why? –Bias is dangerous, why?, since under Bias some documents will be more findable, while rest of others will be very hard to find.

Bias with Findability analysis capture the bias Findability analysisWe can capture the bias impact of different retrieval systems using Findability analysis. system has less bias more FindableIf a system has less bias, then it will make the individual documents more Findable. Findability evaluation vs. Precision based EvaluationFindability evaluation vs. Precision based Evaluation –We can’t use Findability evaluation at individual queries level. –It is just large scale evaluation, only use for capturing the bias of retrieval systems.

Findability Measure Given a collection of documents d  D, with large set of Queries Q. k dq is the rank of d  D in the result set of query q  Q, c denotes the maximum rank that a user is willing to proceed down. The function f(k dq,c) returns a value of 1 if k dq <= c, and 0 otherwise.

GINI Coefficient For viewing the Bias of Retrieval System in a single value, we can use GINI coefficient. For computing GINI index, r(d i ) should be sort in ascending order. N total number of documents. If G = 0, then no bias, because all document are equally Findable. If G = 1, then only one document is Findable, and all other document have r(d) = 0.

Bias with Findability (Example) r(d) with RS1 r(d) with RS2 d129 d207 d3612 d4514 D53418 D6411 d73919 GINI GINI Coefficient with Lorenz Curve

Bias of Retrieval Systems Experiment Setting –We used total Patents listed under United State Patent Classification (USPC) classes – 433 (Dentistry), 424 (Drug, Bio-affecting and body treating compositions), 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing), and 423 (Chemistry of inorganic compounds).

Experiment Setting Retrieval Systems used: BM25 –The OKAPI retrieval function (BM25). –Exact –Exact match model. –TFIDF LM –Language Modeling with term smoothing for Pseudo Relevance Feedback selection (LM). KLD –Kullback-Leibler divergence (KLD). QE TS –Term selection value (Robertson and Walker) (QE TS). –Pseudo Relevance Feedback documents selection using clustering approach (Cluster). top terms –For all Query Expansion models, we used top 35 documents for Pseudo relevance feedback and 50 terms for query expansion.

Experiment Setting Queries Creation for Findability analysis In query creation, we try to reflect the approach of Patent Examiners, how they create their query sets during “Patent Invalidity Search”.

Experiment Setting Approach 1: single frequent terms Claim sections –First, we extract all the single frequent terms from the Claim sections which have support greater than some threshold. –Then we combine these single frequent terms with two, three, and four terms combinations for construction longer queries. Patent (A) Patent Patent Patent Patent Patent Patent Patent Patent Patent Use Patent (A) as a query for searching related documents.

Experiment Setting Terms with Support >= 3

Experiment Setting Approach 2: –If patent contain many rare terms, –then we can’t search all of its similar Patents –using queries collected from only single Patent document, we can’t search all of its similar Patents. Patent relatedness –In this Query Creation approach, we construct queries with considering Patent relatedness.

Experiment Setting Approach 2 Steps: related Patents in set (R) –(Step 1): For each Patent, group all of its related Patents in set (R) using k-nearest neighbor approach. R –(Step 2): Then using this R, construct its language model, for finding dominant terms which can search the documents in R. –Where P jm (t|R) is the probability of term t in set R, and P jm (t|corpos) is the probability of term t in whole collection. –This is similar approach, as terms in Language Modeling (Query Expansion) are used for brining up relevant documents. –(Step 3): Combine single terms with two, three, and four terms combinations for constructing longer queries.

Experiment Setting Properties of Queries used in Experiments CQG 1: Approach 1 CQG 2: Approach 2

Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 1

Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 1

Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 1

Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 2

Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 2

Bias of Retrieval Systems with Patent Collection (433, 424) With Query Creation Approach 2

GINI Index of Retrieval Systems with Patent Collection (433, 424)

GINI Index of Retrieval Systems with Patent Collection (422, 423)

Future Work We are working toward improving Findability of Patents using Query Expansion approach. We have results, in which selecting better documents for Pseudo Relevance Feedback can improve the Findability of documents. Considering external provided Ontology in Query Expansion, can also create its role in improving Findability of documents.

References Leif Azzopardi, Vishwa Vinay, Retrievability: an evaluation measure for higher order information access tasks, CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages , October 26-30, 2008, Napa Valley, California, USA. Chris Jordan, Carolyn Wattters, Qigang Gao, Using controlled query generation to evaluate blind relevance feedback algorithms, JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 2006, Pages , Chapel Hill, NC, USA. Tonya Custis, Khalid Al-Kofahi, A new approach for evaluating query expansion: query-document term mismatch, SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages , July 23-27, 2007, Amsterdam, The Netherlands.