......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas Rauber Institute of Software Engineering and Interactive Systems Vienna University of Technology bashir@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~bashir/

......................................... 2 Outline  Introduction to Retrievability (Findability) Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

......................................... 3 Introduction  Retrieval Systems are used for searching information  Rely on retrieval models for ranking documents  How to select best Retrieval Model  Evaluate Retrieval Models  State of the Art – Effectiveness Analysis, or – Efficiency (Speed/Memory)

......................................... 4 Effectiveness Measures  (Precision, Recall, MAP) depends upon – Few topics – Few judged documents  Suitable for precision oriented retrieval task  Less suitable for recall oriented retrieval task – (e.g. patent or legal retrieval)

......................................... 5 Findability Measure  Considers all documents  The goal is to maximize the findability of documents  Documents in Retrieval Model having higher findability are more easy to find than Retrieval Model having lower findability  Applications – Offers another measure for comparing Retrieval Models – Subset of documents that are hard or easy to find

......................................... 6 Findability Measure  Factors that affect Findability 1.User Query – [Query = Data Mining books] vs [Query = Han Kamber books] for searching book “Data Mining Concepts and Techniques” 2.The maximum number of top links/docs checked 3.The ranking strategy of Retrieval Models

......................................... 7 Retrievability Measure  [Leif Azzopardi and Vishwa Vinay, CIKM 2008]  Given a collection D of documents, and query set Q  retrievability of d  D  k dq rank of d  D in the result set of query q  Q  c the point in rank list where user will stop  f(k dq,c) = 1 if k dq <= c, and 0 otherwise  Gini-Coefficient = Summarize findability scores Gini-Coefficient

......................................... 8 Outline  Introduction to Findability Measure  Setup for Experiments  Retrievability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

......................................... 9 Setup for Experiments  Collections 1.TREC Chemical Retrieval Track Collection 2009 (TREC-CRT) 2.USPTO Patent Collections USPC Class 433 (Dentistry) (DentPat) USPC Class 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing) (ChemAppPat) 3.Austrian News Dataset (ATNews) TREC-CRT, ATNews are more skewed USPTO Collections are less skewed

......................................... 10 Setup for Experiments  Retrieval Models – Standard Retrieval Models TFIDF, NormTFIDF, BM25, SMART – Language Models Jelinek-Mercer Smoothing, Dirichlet Smoothing (DirS), Two-Stage Smoothing (TwoStage), Absolute Discounting Smoothing (AbsDis)  Query Generation – All sections of Patent documents – Terms removed with document frequency (df) > 25% – All term combinations of 3- and 4-terms

......................................... 11 52 443 583 746 962 1474 Docs. Ordered by Increasing Vocabulary Size Setup for Experiments TREC-CRTATNews ChemAppPatDentPat 5 101 155 198 255 427 Docs. Ordered by Increasing Vocabulary Size 243 597 690 776 895 Docs. Ordered by Increasing Vocabulary Size 284 381 426 463 504 559 866 Docs. Ordered by Increasing Vocabulary Size

......................................... 12 Outline  Introduction to Retrievability Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

......................................... 13 Findability Scoring Functions  Standard Findability Scoring Function – Does not consider the difference in Docs. vocabulary size – Biased towards long documents – With r(d), Doc2 has higher Findability than Doc5 – But, due to small vocabulary size Doc5 does not have larger query subset All 3-Terms combinations Findability Percentage Doc2 = 3600/6545 = 0.55 Doc5 = 90/120 = 0.75

......................................... 14 Findability Scoring Functions  Normalize Findability – Normalize r(d) relative to number of Queries generated from d – This will account for the difference between doc lengths –  (d) queries generated from d

......................................... 15 Findability Scoring Functions  Comparison between r(d) and r^(d) – Retrieval ordered by Gini-Coefficients (Retrieval Bias) – Findability Ranks of Documents

......................................... 16 Findability Scoring Functions Retrieval Model c=100 r(d) Retrieval Model c=100 r^(d) BM250.48DirS0.69 TwoStage0.49AbsDis0.69 DirS0.51JM0.69 AbsDis0.56BM250.71 NormTFIDF0.57TwoStage0.72 JM0.59NormTFIDF0.72 TFIDF0.78TFIDF0.94 SMART0.92SMART0.95 TREC-CRT ChemAppPat Retrieval Model c=10 r(d) Retrieval Model c=10 r^(d) BM250.33JM0.37 AbsDis0.34BM250.38 DirS0.36AbsDis0.38 TwoStage0.37DirS0.39 JM0.39TwoStage0.42 NormTFIDF0.40NormTFIDF0.42 TFIDF0.47TFIDF0.56 SMART0.85SMART0.56  Correlation between r(d) and in Terms of Gini-Coefficients Retrieval Models are ordered by r(d) and r^(d)

......................................... 17 Findability Scoring Functions  Correlation between r(d) and in Terms of Documents Findability Ranks – TREC-CRT and ATNews The correlation between r(d) and is low (high difference) Due to large difference between document lengths – ChemAppPat and DentPat The correlation between r(d) and is high (low difference) Due to not large difference between document lengths Correlation between r(d) and r^(d) Back

......................................... 18 Findability Scoring Functions  Which Findability Functions is better (r(d) or r^(d) ). – On Gini-Coefficient it is difficult to decide..... Bucket 1Bucket 2Bucket 30 Ordered the documents based on findability scores and then partitioned into 30 Buckets 40 Random Docs (Known Items) One Query/Document between 4 – 6 length Low Findability Buckets High Findability Buckets..... The goal is to search known-item using its own Query Effectiveness of Known-Items is measured through Mean Reciprocal Rank (MRR) Low MRR Effectiveness High MRR Effectiveness

......................................... 19 Retrievability Scoring Functions  Which Findability Functions is better (r(d) or r^(d) ). – Expected Results High findability buckets should have high effectiveness, since they are easy to findable than low findability buckets Positive correlation with MRR – r^(d) buckets have good positive correlation with MRR than r(d) TREC-CRTChemAppPat Correlation between Findability and MRR

......................................... 20 Outline  Introduction to Findability Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

......................................... 21 Query Characteristics and Findability Q = Query Set Findability Score of Documents GINI-Coefficients  Queries do not have similar quality  Some queries are very specific (target oriented) than others  What is the effect of query quality on Findability  Need to analyze Findability with different query quality subsets  Creating Query Quality Subsets – Supervised Quality Labels: We do not have supervised labels – Query Characteristics (QC): Query Result List size Query Term Frequencies in the Documents Query Quality based on Query Performance Prediction Methods For each QC, large query set is partitioned into 50 subsets Current Findability Analysis Style

......................................... 22 Query Characteristics and Findability  Query Subsets with Query Quality Q = Query Set Query Quality is predicted Simplified Clarity Score (SCS) [He & Ounis SPIRE 2004] Q ordered by SCS score And Partitioned into 50 Subsets Query Subset 1 = Findability Analysis Query Subset 2 = Findability Analysis Query Subset 50 = Findability Analysis TREC-CRT Collections X-Axis = Query Subsets ordered by Low SCS score to High SCS score Y-Axis = Gini-Coefficients Low SCS scores Subsets = High Gini-Coefficients High SCS scores Subsets = Low Gini-Coefficients......

......................................... 23 Outline  Introduction to Findability Measure  Setup for Experiments  Retrievability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

......................................... 24 Document Features and Findability Findability Analysis Query Processing Large Processing Time Large Computation Resources Relationship between Document Features and Findability Scores Can we predict Findability without processing exhaustive set of queries Does not require heavy Processing Only predict Findability Ranks Can’t predict Gini- Coefficients

......................................... 25 Document Features and Findability  The following three classes of document features are considered – Surface Level features Surface Level features Based on (Term Frequencies within Documents) and (Term Document Frequencies within Collection) – Features based on Term Weights Features based on Term Weights Based on the Term Weighting strategy of retrieval model – Density around Nearest Neighbors of Documents Density around Nearest Neighbors of Documents These features are based on the density around nearest neighbors of documents

......................................... 26 Document Features and Findability #FeatureDescription 1NATFAverage of the normalized term frequencies of a document 2freqCounts the frequency of high frequent terms of a document (tf t,d /|d| > 0.03) 3NATF_freqComputes the NATF with the frequent terms of (freq) feature 4GC_termsComputes the term frequency inequality between terms of a document 5freq_GCThe total number of terms of a document that increase the GC_terms score greater than GC_terms = 0.25 6ADFConsiders the average document frequency of the terms 7freq_low_dfCounts the frequency of terms of a document that have document frequency < 5% of total collection size 8ADF_freqComputes the ADF score only based on the freq_low_df terms 9Document Length Document length 10Vocabulary Size Total number of unique terms Surface Level Features

......................................... 27 Document Features and Findability TREC-CRT ChemAppPat

......................................... 28 Document Features and Findability  Combining Multiple Features – No feature performs best for all collections and for all retrieval models – Worth to analyze to what extent combining multiple features increases the correlation – Regression Tree, 50%/50% training/testing splitting Correlation by combining multiple features Correlation with best single feature % of increase in correlation

......................................... 29 Outline  Introduction to Findability Measure  Setup for Experiments  Findability Scoring Functions  Relationship between Findability and Query Characteristics  Relationship between Findability and Document Features  Relationship between Findability and Effectiveness Measures

......................................... 30 Relationship between Findability and Effectiveness IR Goal: Maximizing Findability Does not need Relevance Judgments Findability Measure Effectiveness Measures (Recall, Precision, MAP) Does any relationship exist between both? Maximizing Findability -> Maximizing Effectiveness If relationship exists  Automatic Retrieval Models Ranking  Tuning/Increasing Retrieval Model Effectiveness on the basis of Findability Measure Goal: Maximizing Effectiveness Depends upon Relevance Judgments

......................................... 31 Relationship between Findability and Effectiveness  Retrieval Models – Standard Retrieval Models and Language Models – Low Level Features of IR (tf, idf, doc length, vocabulary size, collection frequency) – Term Proximity based Retrieval Models #FeatureDescription 1f1 (SumMinDist)Sum of minimum distances of all query term pairs 2f2 (SumMaxDist)Sum of maximum distances of all query term pairs 3f3 (AvgDist)Average of the sum of all query term pairs distances in the document 4f4 (MinDistCount)Frequency of query’s term pairs that have a minimum distance of less than 4 terms 5f5 (AvgPairDist)Similar to f3, this feature calculates the average of the sum of distances between all query’s term pairs and all single terms of a query 6f6 (CoOccurrence)Counts the frequency of co-occurrence of query’s term pairs within a window of less than 4 terms 7f7 (PairCoOccurrence)Counts the frequency of co-occurrence of query’s term pairs with single terms of query within a window of less than 10 terms 8f8 (MinCover)Shortest text segment in the document that covers all terms of a query at least once

......................................... 32 FeatureGini-Coefficient with c=100FeatureRecall@100 1 PairCoOccurrence0.391 JM0.184 2 MinDistCount0.452 DirS0.177 3 CoOccurrence0.493 TwoStage0.174 4 BM250.524 AbsDis0.170 5 SumMinDist0.555 MinDistCount0.156 6 TwoStage0.566 BM250.156 7 DirS0.577 CoOccurrence0.147 8 MinCover0.588 PairCoOccurrence0.139 9 AbsDis0.609 AvgPairDist0.134 10 JM0.6210 MinCover0.130 11 NormTFIDF0.6211 SumMinDist0.126 12 ntf(d,q)0.6312 ntf(d,q)0.107 13 AvgPairDist0.6613 SumMaxDist0.107 14 AvgDist0.6714 AvgDist0.106 15 SumMaxDist0.6815 NormTFIDF0.082 16 |d|0.7416 SMART0.074 17 sdf(d,q)0.8517 sdf(d,q)0.042 18 scf(d,q)0.8518 tf(d,q)0.016 19 TFIDF0.9119 TFIDF0.008 20 tf(d,q)0.9220 scf(d,q)0.002 21 SMART0.9321 |d|0.001 22 T d 0.9922 T d 0.001

......................................... 33 Relationship between Findability and Effectiveness Correlation = 0.80 0.75 0.80 0.73  Correlation exists  Not perfect, but retrieval models having low retrieval bias consistently appear in at least top half of the ranks

......................................... 34 Relationship between Findability and Effectiveness  Tuning Parameter values over Findability – Retrieval Models contain parameters – Controls the query term normalization or smooth the document relevance score in case of unseen query terms – We tune the parameter values over f indability – Examine this effect on Gini-Coefficient and Recall/Precision/MAP

......................................... 35 Relationship between Findability and Effectiveness Parameter b values are changed between 0 to 1

......................................... 36 Relationship between Findability and Effectiveness For JM Parameter values are changed between 0 to 1

......................................... 37 Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – Genetic Programming branch of soft computing – Helps to solve exhaustive search space problems Retrieval Features Randomly Combine IR Features Selecting Best Retrieval Model (Findability Measure) Next Generation Initial population Recombination (Crossover, Mutation) Repeat until 100 generations complete Genetic Programming

......................................... 38 Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – Solution (Retrieval Model) are represented with Tree structure. – Nodes of trees either operators (+,/,*) or ranking features – Ranking Features Low Level Retrieval Features Term Proximity based Retrieval Features Constant Values (0.1 to 1) – 100 Generations are evolved with 50 solutions per generation

......................................... 39 Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – Two correlation analysis are test – (1) Relationship between Findability and Effectiveness on the basis of fittest individual of each generation – (2) Relationship between Findability and Effectiveness on the basis of average fitness of each generation

......................................... 40 Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – (First): Relationship between Findability and Effectiveness on the basis of Fittest individual of each generation

......................................... 41 Relationship between Findability and Effectiveness  Evolving Retrieval Model using Genetic Programming and Findability – (Second): Relationship between Findability and Effectiveness on the basis of Average Fitness of each generation – Generations having low average Gini-Coefficient also have high effectiveness on Recall@100

......................................... 42 Conclusions  Findability focuses on all documents not set of few judged documents  We propose normalized findability scoring function that produces better findability rank of documents  Analysis between findability and query characteristics – Different ranges of query characteristics have different retrieval bias  Analysis between findability and document features – Suitable for predicting document findability ranks  Relationship between findability and effectiveness – Findability can be used for automatic ranking – Used to find tune IR systems in un-supervised manner

......................................... 43 Future Work  Query Popularity and Findability – We are not differentiating between popular and unpopular queries  Visualizing Findability – Documents that are high findable with one model – Documents that are high findable with multiple models – Documents that are not findable with all models  Effect of Retrieval Bias in K-Nearest Neighbor classification – High Findable samples also affect the classification voting in K-NN

......................................... 44 Thank You

......................................... 45

......................................... 46 Gini-Coefficient  Gini-Coefficient calculates retrievability inequality between documents.  Also represents retrieval bias.  Provides bird-eye view.  If G = 0, then no bias,  If G = 1, then only one document is Findable, and all other document have r(d) = 0. Documentsr(d) with RS1r(d) with RS2 D129 D207 D3612 D4514 D53418 D6411 D73919 Gini- Coefficient 0.580.18 Back

......................................... 47 Findability Scoring Functions ChemAppPat TREC-CRT

......................................... 48 Document Features and Retrievability  Features based on Term Weights – Terms of Documents are weighted by retrieval model. – Then terms are added into inverted lists. – Term weights in the inverted lists are sorted by decreasing score. #FeatureDescription 1ATWComputes the average of term weights of a document. 2ATRPComputes the average of term rank positions in the inverted lists. 3VTRPVariance of term rank positions in the inverted lists. 4DiffMedainWeightsComputes the average of difference of term weights of a document with the median weight of terms. 5LowRankRatioComputes how many terms of a document are appeared in the top 200 rank positions of sorted inverted lists.

......................................... 49 Document Features and Retrievability  On high skewed collections, these features have good correlation.  On less skewed collections, these features do not have good correlation.  This may be because, in less skewed collection the term weights of documents are less extreme due to almost similar doc lengths. TREC-CRT ChemAppPat Back

......................................... 50 Document Features and Retrievability  Document Density based Features – These feature are based on average density of the k-nearest neighbors of documents. – k is used with 50,100, and 150. – Density is also computed with all terms of a document and top 40 (high frequency) terms of a document. #FeatureDescription 1AvgDensity(k=50)Average density with 50-nearest neighbors. 2AvgDensity(k=100)Average density with 100-nearest neighbors. 3AvgDensity(k=250)Average density with 150-nearest neighbors. 4AvgDensity-Top40Terms (k=50)Average density with 50-nearest neighbors and top 40 terms. 5AvgDensity-Top40Terms (k=100)Average density with 100-nearest neighbors and top 40 terms. 6AvgDensity-Top40Terms (k=150)Average density with 150-nearest neighbors and top 40 terms.

......................................... 51 Query Expansion and Retrievability  Query Expansion methods are investigate for improving retrievability.  Terms for expansion are identified via Pseudo-Relevance Feedback (PRF).  Baseline results are promising (PRF selection with Top-N docs).  We further, propose two PRF selection approaches – Based on Documents Clustering. – Similarity of Retrieved Documents with Query Patent. q = Query D9 D4 D3 D5 D11 D2 D1 D14 Process Query Ranked List Pseudo Relevance Feedback Extract Expansion Terms (E) from {D9, D4, D3, D5} qEqE Query Expansion Process Query 1. Top-N 2. Document Clustering 3. Query Patent Similarity

......................................... 52 Query Expansion and Retrievability  Baseline Approaches – Both approaches rely on Top documents of query result lists for PRF. – Query Expansion based on Language Modeling. Terms for expansion are ranked according to sum of divergences between the documents they occurred and the importance of terms in the whole collection. – Query Expansion based on Kullback-Leibler Divergence. Terms of expansion are ranked according to the relevance rareness of terms in PRF set as opposed to the whole collection.

......................................... 53 Query Expansion and Retrievability TREC-CRT ChemAppPat

......................................... 54 Query Expansion and Retrievability q = Query D9 D4 D3 D5 D11 D2 D1 D14 Process Query Ranked List Pseudo Relevance Feedback Extract Expansion Terms (E) from {D2, D1, D3, D4} qEqE Query Expansion Process Query D9 D4 D3 D5 D11 D2 D1 D14 Clustering with top N docs DocCluster Size D9( ) D4(D9, D3) D3(D9, D4, D5) D5(D9, D3) D11(D3, D5) D2(D9, D4, D3, D5) D1(D9, D4, D3, D5) D14(D4) Sort Docs on Cluster Size D2 D1 D3 D4 D5 D11 D14 D9

......................................... 55 Query Expansion and Retrievability D9 D4 D3 D5 D11 D2 D1 D14 Clustering with top N docs DocCluster Size D9( ) D4(D9, D3) D3(D9, D4, D5) D5(D9, D3) D11(D3, D5) D2(D9, D4, D3, D5) D1(D9, D4, D3, D5) D14(D4) Sort Docs on Cluster Size D2 D1 D3 D4 D5 D11 D14 D9  Constructing Clusters – Offline cluster construction, to avoid large processing time. – Each document makes its own cluster with other docs using k-Nearest Neighbors.

......................................... 56 Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – In prior-art search, patent examiners usually extract query terms for given query patent. – Due to complex structure of Patent docs, searching relevant keywords is always a difficult problem. – Missing terms are another problem. – Query Expansion could help to overcome this problem. – Query Expansion depends on PRF documents. – PRF documents are ranked on the basis of Query Patent Similarity

......................................... 57 Query Expansion and Retrievability q = Query D9 D4 D3 D5 D11 D2 D1 D14 Process Query Ranked List Pseudo Relevance Feedback Extract Expansion Terms (E) from {D1, D3, D5, D2} qEqE Query Expansion Process Query D9 D4 D3 D5 D11 D2 D1 D14 Query Patent Similarity with Query Patent D1 D3 D5 D2 D11 D9 D14 D4 Rank Docs based on Similarity

......................................... 58 Query Expansion and Retrievability  Full Query Patent can be used for ranking PRF. – Full Query Patent may contains thousands of terms, and could be distributed in documents not relevant to the query.  How to identify the best terms from query patent.  We try to separate the good terms from bad terms using term classification. D9 D4 D3 D5 D11 D2 D1 D14 Query Patent Similarity with Query Patent D1 D3 D5 D2 D11 D9 D14 D4 Rank Docs based on Similarity

......................................... 59 Query Expansion and Retrievability  Training Dataset for Term Classification – 30 random prior-art (PA) topics from TREC-CRT. – From each 30 PA topic, short queries of 4 length (based on high TF) are used as search queries. – Baseline Score = For each query, PRF documents are ranked according to query relevance scores. Effectiveness scores are used a baseline score.

......................................... 60 Query Expansion and Retrievability q = Query Process Query qEqE Query Expansion Process Query Training Queries Pseudo Relevance Feedback Baseline Score Unique Terms of Query Patent T1 T2 T3 T4. Tn Check each Term = T qTqT Process Query qEqE Query Expansion Process Query Pseudo Relevance Feedback q  T Score If (q  T) Score > Baseline Score then T = good Term If (q  T) Score = Baseline Score then T = neutral Term If (q  T) Score < Baseline Score then T = neutral Term Identifying good, neutral and bad terms

......................................... 61 Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – Terms are classified (predicted) using Term Features. – Features are identified from Query Patents, based on expanded term (T) and query terms proximity distribution. – J48 is used for classification. – The overall accuracy of positive classified samples is 83%.

......................................... 62 Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – Results on TREC-CRT collection. – CCGen = PRF Selection using Clustering – QP-TS = PRF selection using query patent similarity

......................................... 63 Query Expansion and Retrievability  PRF Selection via Query Patent Similarity – Results on TREC-CRT collection. – CCGen = PRF Selection using Clustering – QP-TS = PRF selection using query patent similarity

......................................... 64 Document Features and Retrievability  The positive correlation indicates low retrievable docs are mostly lied in low density areas.  Their nearest neighbors have mostly similar term weights. This makes them low retrievable.  O High skewed collections, these features have good correlation.  On Less skewed collections, these features do not have good correlation.  This may be because, in less skewed collections, the term weights of documents are less extreme due to similar doc lengths. TREC-CRT ChemAppPat Back

......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

Similar presentations

Presentation on theme: "......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

Similar presentations

Presentation on theme: "......................................... Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas."— Presentation transcript:

Similar presentations

About project

Feedback