......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Evaluating Search Engine
Information Retrieval Review
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Clustering C.Watters CS6403.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
KNN & Naïve Bayes Hongning Wang
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Sampath Jayarathna Cal Poly Pomona
Queensland University of Technology
An Efficient Algorithm for Incremental Update of Concept space
Information Retrieval and Web Search
Compact Query Term Selection Using Topically Related Text
Language Models for Information Retrieval
Applying Key Phrase Extraction to aid Invalidity Search
Chapter 5: Information Retrieval and Web Search
INF 141: Information Retrieval
Large Scale Findability Analysis
Information Retrieval and Web Design
Presentation transcript:

Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas Rauber Institute of Software Engineering and Interactive Systems Vienna University of Technology

Outline  Introduction to Retrievability (Findability) Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

Introduction  Retrieval Systems are used for searching information  Rely on retrieval models for ranking documents  How to select best Retrieval Model  Evaluate Retrieval Models  State of the Art – Effectiveness Analysis, or – Efficiency (Speed/Memory)

Effectiveness Measures  (Precision, Recall, MAP) depends upon – Few topics – Few judged documents  Suitable for precision oriented retrieval task  Less suitable for recall oriented retrieval task – (e.g. patent or legal retrieval)

Findability Measure  Considers all documents  The goal is to maximize the findability of documents  Documents in Retrieval Model having higher findability are more easy to find than Retrieval Model having lower findability  Applications – Offers another measure for comparing Retrieval Models – Subset of documents that are hard or easy to find

Findability Measure  Factors that affect Findability 1.User Query – [Query = Data Mining books] vs [Query = Han Kamber books] for searching book “Data Mining Concepts and Techniques” 2.The maximum number of top links/docs checked 3.The ranking strategy of Retrieval Models

Retrievability Measure  [Leif Azzopardi and Vishwa Vinay, CIKM 2008]  Given a collection D of documents, and query set Q  retrievability of d  D  k dq rank of d  D in the result set of query q  Q  c the point in rank list where user will stop  f(k dq,c) = 1 if k dq <= c, and 0 otherwise  Gini-Coefficient = Summarize findability scores Gini-Coefficient

Outline  Introduction to Findability Measure  Setup for Experiments  Retrievability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

Setup for Experiments  Collections 1.TREC Chemical Retrieval Track Collection 2009 (TREC-CRT) 2.USPTO Patent Collections USPC Class 433 (Dentistry) (DentPat) USPC Class 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing) (ChemAppPat) 3.Austrian News Dataset (ATNews) TREC-CRT, ATNews are more skewed USPTO Collections are less skewed

Setup for Experiments  Retrieval Models – Standard Retrieval Models TFIDF, NormTFIDF, BM25, SMART – Language Models Jelinek-Mercer Smoothing, Dirichlet Smoothing (DirS), Two-Stage Smoothing (TwoStage), Absolute Discounting Smoothing (AbsDis)  Query Generation – All sections of Patent documents – Terms removed with document frequency (df) > 25% – All term combinations of 3- and 4-terms

Docs. Ordered by Increasing Vocabulary Size Setup for Experiments TREC-CRTATNews ChemAppPatDentPat Docs. Ordered by Increasing Vocabulary Size Docs. Ordered by Increasing Vocabulary Size Docs. Ordered by Increasing Vocabulary Size

Outline  Introduction to Retrievability Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

Findability Scoring Functions  Standard Findability Scoring Function – Does not consider the difference in Docs. vocabulary size – Biased towards long documents – With r(d), Doc2 has higher Findability than Doc5 – But, due to small vocabulary size Doc5 does not have larger query subset All 3-Terms combinations Findability Percentage Doc2 = 3600/6545 = 0.55 Doc5 = 90/120 = 0.75

Findability Scoring Functions  Normalize Findability – Normalize r(d) relative to number of Queries generated from d – This will account for the difference between doc lengths –  (d) queries generated from d

Findability Scoring Functions  Comparison between r(d) and r^(d) – Retrieval ordered by Gini-Coefficients (Retrieval Bias) – Findability Ranks of Documents

Findability Scoring Functions Retrieval Model c=100 r(d) Retrieval Model c=100 r^(d) BM250.48DirS0.69 TwoStage0.49AbsDis0.69 DirS0.51JM0.69 AbsDis0.56BM NormTFIDF0.57TwoStage0.72 JM0.59NormTFIDF0.72 TFIDF0.78TFIDF0.94 SMART0.92SMART0.95 TREC-CRT ChemAppPat Retrieval Model c=10 r(d) Retrieval Model c=10 r^(d) BM250.33JM0.37 AbsDis0.34BM DirS0.36AbsDis0.38 TwoStage0.37DirS0.39 JM0.39TwoStage0.42 NormTFIDF0.40NormTFIDF0.42 TFIDF0.47TFIDF0.56 SMART0.85SMART0.56  Correlation between r(d) and in Terms of Gini-Coefficients Retrieval Models are ordered by r(d) and r^(d)

Findability Scoring Functions  Correlation between r(d) and in Terms of Documents Findability Ranks – TREC-CRT and ATNews The correlation between r(d) and is low (high difference) Due to large difference between document lengths – ChemAppPat and DentPat The correlation between r(d) and is high (low difference) Due to not large difference between document lengths Correlation between r(d) and r^(d) Back

Findability Scoring Functions  Which Findability Functions is better (r(d) or r^(d) ). – On Gini-Coefficient it is difficult to decide..... Bucket 1Bucket 2Bucket 30 Ordered the documents based on findability scores and then partitioned into 30 Buckets 40 Random Docs (Known Items) One Query/Document between 4 – 6 length Low Findability Buckets High Findability Buckets..... The goal is to search known-item using its own Query Effectiveness of Known-Items is measured through Mean Reciprocal Rank (MRR) Low MRR Effectiveness High MRR Effectiveness

Retrievability Scoring Functions  Which Findability Functions is better (r(d) or r^(d) ). – Expected Results High findability buckets should have high effectiveness, since they are easy to findable than low findability buckets Positive correlation with MRR – r^(d) buckets have good positive correlation with MRR than r(d) TREC-CRTChemAppPat Correlation between Findability and MRR

Outline  Introduction to Findability Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

Query Characteristics and Findability Q = Query Set Findability Score of Documents GINI-Coefficients  Queries do not have similar quality  Some queries are very specific (target oriented) than others  What is the effect of query quality on Findability  Need to analyze Findability with different query quality subsets  Creating Query Quality Subsets – Supervised Quality Labels: We do not have supervised labels – Query Characteristics (QC): Query Result List size Query Term Frequencies in the Documents Query Quality based on Query Performance Prediction Methods For each QC, large query set is partitioned into 50 subsets Current Findability Analysis Style

Query Characteristics and Findability  Query Subsets with Query Quality Q = Query Set Query Quality is predicted Simplified Clarity Score (SCS) [He & Ounis SPIRE 2004] Q ordered by SCS score And Partitioned into 50 Subsets Query Subset 1 = Findability Analysis Query Subset 2 = Findability Analysis Query Subset 50 = Findability Analysis TREC-CRT Collections X-Axis = Query Subsets ordered by Low SCS score to High SCS score Y-Axis = Gini-Coefficients Low SCS scores Subsets = High Gini-Coefficients High SCS scores Subsets = Low Gini-Coefficients......

Outline  Introduction to Findability Measure  Setup for Experiments  Retrievability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

Document Features and Findability Findability Analysis Query Processing Large Processing Time Large Computation Resources Relationship between Document Features and Findability Scores Can we predict Findability without processing exhaustive set of queries Does not require heavy Processing Only predict Findability Ranks Can’t predict Gini- Coefficients

Document Features and Findability  The following three classes of document features are considered – Surface Level features Surface Level features Based on (Term Frequencies within Documents) and (Term Document Frequencies within Collection) – Features based on Term Weights Features based on Term Weights Based on the Term Weighting strategy of retrieval model – Density around Nearest Neighbors of Documents Density around Nearest Neighbors of Documents These features are based on the density around nearest neighbors of documents

Document Features and Findability #FeatureDescription 1NATFAverage of the normalized term frequencies of a document 2freqCounts the frequency of high frequent terms of a document (tf t,d /|d| > 0.03) 3NATF_freqComputes the NATF with the frequent terms of (freq) feature 4GC_termsComputes the term frequency inequality between terms of a document 5freq_GCThe total number of terms of a document that increase the GC_terms score greater than GC_terms = ADFConsiders the average document frequency of the terms 7freq_low_dfCounts the frequency of terms of a document that have document frequency < 5% of total collection size 8ADF_freqComputes the ADF score only based on the freq_low_df terms 9Document Length Document length 10Vocabulary Size Total number of unique terms Surface Level Features

Document Features and Findability TREC-CRT ChemAppPat

Document Features and Findability  Combining Multiple Features – No feature performs best for all collections and for all retrieval models – Worth to analyze to what extent combining multiple features increases the correlation – Regression Tree, 50%/50% training/testing splitting Correlation by combining multiple features Correlation with best single feature % of increase in correlation

Outline  Introduction to Findability Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

Relationship between Findability and Effectiveness IR Goal: Maximizing Findability Does not need Relevance Judgments Findability Measure Effectiveness Measures (Recall, Precision, MAP) Does any relationship exist between both? Maximizing Findability -> Maximizing Effectiveness If relationship exists  Automatic Retrieval Models Ranking  Tuning/Increasing Retrieval Model Effectiveness on the basis of Findability Measure Goal: Maximizing Effectiveness Depends upon Relevance Judgments

Relationship between Findability and Effectiveness  Retrieval Models – Standard Retrieval Models and Language Models – Low Level Features of IR (tf, idf, doc length, vocabulary size, collection frequency) – Term Proximity based Retrieval Models #FeatureDescription 1f1 (SumMinDist)Sum of minimum distances of all query term pairs 2f2 (SumMaxDist)Sum of maximum distances of all query term pairs 3f3 (AvgDist)Average of the sum of all query term pairs distances in the document 4f4 (MinDistCount)Frequency of query’s term pairs that have a minimum distance of less than 4 terms 5f5 (AvgPairDist)Similar to f3, this feature calculates the average of the sum of distances between all query’s term pairs and all single terms of a query 6f6 (CoOccurrence)Counts the frequency of co-occurrence of query’s term pairs within a window of less than 4 terms 7f7 (PairCoOccurrence)Counts the frequency of co-occurrence of query’s term pairs with single terms of query within a window of less than 10 terms 8f8 (MinCover)Shortest text segment in the document that covers all terms of a query at least once

FeatureGini-Coefficient with 1 PairCoOccurrence0.391 JM MinDistCount0.452 DirS CoOccurrence0.493 TwoStage BM AbsDis SumMinDist0.555 MinDistCount TwoStage0.566 BM DirS0.577 CoOccurrence MinCover0.588 PairCoOccurrence AbsDis0.609 AvgPairDist JM MinCover NormTFIDF SumMinDist ntf(d,q) ntf(d,q) AvgPairDist SumMaxDist AvgDist AvgDist SumMaxDist NormTFIDF |d| SMART sdf(d,q) sdf(d,q) scf(d,q) tf(d,q) TFIDF TFIDF tf(d,q) scf(d,q) SMART |d| T d T d 0.001

Relationship between Findability and Effectiveness Correlation =  Correlation exists  Not perfect, but retrieval models having low retrieval bias consistently appear in at least top half of the ranks

Relationship between Findability and Effectiveness  Tuning Parameter values over Findability – Retrieval Models contain parameters – Controls the query term normalization or smooth the document relevance score in case of unseen query terms – We tune the parameter values over f indability – Examine this effect on Gini-Coefficient and Recall/Precision/MAP

Relationship between Findability and Effectiveness Parameter b values are changed between 0 to 1

Relationship between Findability and Effectiveness For JM Parameter values are changed between 0 to 1

Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – Genetic Programming branch of soft computing – Helps to solve exhaustive search space problems Retrieval Features Randomly Combine IR Features Selecting Best Retrieval Model (Findability Measure) Next Generation Initial population Recombination (Crossover, Mutation) Repeat until 100 generations complete Genetic Programming

Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – Solution (Retrieval Model) are represented with Tree structure. – Nodes of trees either operators (+,/,*) or ranking features – Ranking Features Low Level Retrieval Features Term Proximity based Retrieval Features Constant Values (0.1 to 1) – 100 Generations are evolved with 50 solutions per generation

Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – Two correlation analysis are test – (1) Relationship between Findability and Effectiveness on the basis of fittest individual of each generation – (2) Relationship between Findability and Effectiveness on the basis of average fitness of each generation

Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – (First): Relationship between Findability and Effectiveness on the basis of Fittest individual of each generation

Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – (Second): Relationship between Findability and Effectiveness on the basis of Average Fitness of each generation – Generations having low average Gini-Coefficient also have high effectiveness on

Conclusions  Findability focuses on all documents not set of few judged documents  We propose normalized findability scoring function that produces better findability rank of documents  Analysis between findability and query characteristics – Different ranges of query characteristics have different retrieval bias  Analysis between findability and document features – Suitable for predicting document findability ranks  Relationship between findability and effectiveness – Findability can be used for automatic ranking – Used to find tune IR systems in un-supervised manner

Future Work  Query Popularity and Findability – We are not differentiating between popular and unpopular queries  Visualizing Findability – Documents that are high findable with one model – Documents that are high findable with multiple models – Documents that are not findable with all models  Effect of Retrieval Bias in K-Nearest Neighbor classification – High Findable samples also affect the classification voting in K-NN

Thank You

Gini-Coefficient  Gini-Coefficient calculates retrievability inequality between documents.  Also represents retrieval bias.  Provides bird-eye view.  If G = 0, then no bias,  If G = 1, then only one document is Findable, and all other document have r(d) = 0. Documentsr(d) with RS1r(d) with RS2 D129 D207 D3612 D4514 D53418 D6411 D73919 Gini- Coefficient Back

Findability Scoring Functions ChemAppPat TREC-CRT

Document Features and Retrievability  Features based on Term Weights – Terms of Documents are weighted by retrieval model. – Then terms are added into inverted lists. – Term weights in the inverted lists are sorted by decreasing score. #FeatureDescription 1ATWComputes the average of term weights of a document. 2ATRPComputes the average of term rank positions in the inverted lists. 3VTRPVariance of term rank positions in the inverted lists. 4DiffMedainWeightsComputes the average of difference of term weights of a document with the median weight of terms. 5LowRankRatioComputes how many terms of a document are appeared in the top 200 rank positions of sorted inverted lists.

Document Features and Retrievability  On high skewed collections, these features have good correlation.  On less skewed collections, these features do not have good correlation.  This may be because, in less skewed collection the term weights of documents are less extreme due to almost similar doc lengths. TREC-CRT ChemAppPat Back

Document Features and Retrievability  Document Density based Features – These feature are based on average density of the k-nearest neighbors of documents. – k is used with 50,100, and 150. – Density is also computed with all terms of a document and top 40 (high frequency) terms of a document. #FeatureDescription 1AvgDensity(k=50)Average density with 50-nearest neighbors. 2AvgDensity(k=100)Average density with 100-nearest neighbors. 3AvgDensity(k=250)Average density with 150-nearest neighbors. 4AvgDensity-Top40Terms (k=50)Average density with 50-nearest neighbors and top 40 terms. 5AvgDensity-Top40Terms (k=100)Average density with 100-nearest neighbors and top 40 terms. 6AvgDensity-Top40Terms (k=150)Average density with 150-nearest neighbors and top 40 terms.

Query Expansion and Retrievability  Query Expansion methods are investigate for improving retrievability.  Terms for expansion are identified via Pseudo-Relevance Feedback (PRF).  Baseline results are promising (PRF selection with Top-N docs).  We further, propose two PRF selection approaches – Based on Documents Clustering. – Similarity of Retrieved Documents with Query Patent. q = Query D9 D4 D3 D5 D11 D2 D1 D14 Process Query Ranked List Pseudo Relevance Feedback Extract Expansion Terms (E) from {D9, D4, D3, D5} qEqE Query Expansion Process Query 1. Top-N 2. Document Clustering 3. Query Patent Similarity

Query Expansion and Retrievability  Baseline Approaches – Both approaches rely on Top documents of query result lists for PRF. – Query Expansion based on Language Modeling. Terms for expansion are ranked according to sum of divergences between the documents they occurred and the importance of terms in the whole collection. – Query Expansion based on Kullback-Leibler Divergence. Terms of expansion are ranked according to the relevance rareness of terms in PRF set as opposed to the whole collection.

Query Expansion and Retrievability TREC-CRT ChemAppPat

Query Expansion and Retrievability q = Query D9 D4 D3 D5 D11 D2 D1 D14 Process Query Ranked List Pseudo Relevance Feedback Extract Expansion Terms (E) from {D2, D1, D3, D4} qEqE Query Expansion Process Query D9 D4 D3 D5 D11 D2 D1 D14 Clustering with top N docs DocCluster Size D9( ) D4(D9, D3) D3(D9, D4, D5) D5(D9, D3) D11(D3, D5) D2(D9, D4, D3, D5) D1(D9, D4, D3, D5) D14(D4) Sort Docs on Cluster Size D2 D1 D3 D4 D5 D11 D14 D9

Query Expansion and Retrievability D9 D4 D3 D5 D11 D2 D1 D14 Clustering with top N docs DocCluster Size D9( ) D4(D9, D3) D3(D9, D4, D5) D5(D9, D3) D11(D3, D5) D2(D9, D4, D3, D5) D1(D9, D4, D3, D5) D14(D4) Sort Docs on Cluster Size D2 D1 D3 D4 D5 D11 D14 D9  Constructing Clusters – Offline cluster construction, to avoid large processing time. – Each document makes its own cluster with other docs using k-Nearest Neighbors.

Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – In prior-art search, patent examiners usually extract query terms for given query patent. – Due to complex structure of Patent docs, searching relevant keywords is always a difficult problem. – Missing terms are another problem. – Query Expansion could help to overcome this problem. – Query Expansion depends on PRF documents. – PRF documents are ranked on the basis of Query Patent Similarity

Query Expansion and Retrievability q = Query D9 D4 D3 D5 D11 D2 D1 D14 Process Query Ranked List Pseudo Relevance Feedback Extract Expansion Terms (E) from {D1, D3, D5, D2} qEqE Query Expansion Process Query D9 D4 D3 D5 D11 D2 D1 D14 Query Patent Similarity with Query Patent D1 D3 D5 D2 D11 D9 D14 D4 Rank Docs based on Similarity

Query Expansion and Retrievability  Full Query Patent can be used for ranking PRF. – Full Query Patent may contains thousands of terms, and could be distributed in documents not relevant to the query.  How to identify the best terms from query patent.  We try to separate the good terms from bad terms using term classification. D9 D4 D3 D5 D11 D2 D1 D14 Query Patent Similarity with Query Patent D1 D3 D5 D2 D11 D9 D14 D4 Rank Docs based on Similarity

Query Expansion and Retrievability  Training Dataset for Term Classification – 30 random prior-art (PA) topics from TREC-CRT. – From each 30 PA topic, short queries of 4 length (based on high TF) are used as search queries. – Baseline Score = For each query, PRF documents are ranked according to query relevance scores. Effectiveness scores are used a baseline score.

Query Expansion and Retrievability q = Query Process Query qEqE Query Expansion Process Query Training Queries Pseudo Relevance Feedback Baseline Score Unique Terms of Query Patent T1 T2 T3 T4. Tn Check each Term = T qTqT Process Query qEqE Query Expansion Process Query Pseudo Relevance Feedback q  T Score If (q  T) Score > Baseline Score then T = good Term If (q  T) Score = Baseline Score then T = neutral Term If (q  T) Score < Baseline Score then T = neutral Term Identifying good, neutral and bad terms

Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – Terms are classified (predicted) using Term Features. – Features are identified from Query Patents, based on expanded term (T) and query terms proximity distribution. – J48 is used for classification. – The overall accuracy of positive classified samples is 83%.

Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – Results on TREC-CRT collection. – CCGen = PRF Selection using Clustering – QP-TS = PRF selection using query patent similarity

Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – Results on TREC-CRT collection. – CCGen = PRF Selection using Clustering – QP-TS = PRF selection using query patent similarity

Document Features and Retrievability  The positive correlation indicates low retrievable docs are mostly lied in low density areas.  Their nearest neighbors have mostly similar term weights. This makes them low retrievable.  O High skewed collections, these features have good correlation.  On Less skewed collections, these features do not have good correlation.  This may be because, in less skewed collections, the term weights of documents are less extreme due to similar doc lengths. TREC-CRT ChemAppPat Back