Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Scale Findability Analysis

Similar presentations


Presentation on theme: "Large Scale Findability Analysis"— Presentation transcript:

1 Large Scale Findability Analysis
Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems

2 Agenda Large Scale Findability Experiments Findability Analysis
Million of Patents used for Indexing Using all possible queries of a patent, Findability is analyzed Findability Analysis With different queries classes Issues in Queries used for Findability Analysis

3 Patent Retrieval is a recall oriented domain
Introduction Patent Retrieval is a recall oriented domain Findability of each and every patent in collection is considered as an important factor There is need to analyze, how many patents are hard or easy to Find in collection. Findability Measurement: Analysis are based on Findability Measurement Findability is a measurement in IR, used for analyzing how easily we can find a document in collection. Can Figure out Low and High Findable subsets Can compare different Retrieval Systems, which is better for finding patents than other Can identify bias of system, whether system give preference to shorter documents over longer, or longer over shorter.

4 Large Scale Patents Findability Experiments
In related Findability experiments, analysis are usually performed on a random set of queries. For example, taking random set of 200 queries of 2 terms, 3 terms or 4 terms from each patent. However, this does not clear us, that whether we are testing queries generation approach or retrieval system. Large Scale Experiments: Rather than taking random queries, Experiments are performed using all possible queries of a Patent. We considered all possible 3 Terms queries (using AND operator). 1.2 million patents are used for indexing (with Full Text). TFIDF retrieval model is used for ranking patents to queries. (rank cutoff factor) c = 100 is used for analysis.

5 Large Scale Patents Findability Experiments
Since, all possible queries space is very large. Therefore we could process only a small number of patents. A set of Low and High Findable patents are used for Large Scale analysis. We take these patents from our previous experiments, which were based on a random set of small number of queries. Motivation: We want to make sure, whether low Findable patents are really low Findable, or there is any fault in queries generation approach.

6 Patents # Patent ID (Low Findable) # Patent ID (High Findable) 1
US A 2 US A 3 US A 4 US A 5 US A 6 US A 7 US A 8 US A 9 US A 10 US A # Patent ID (High Findable) 1 US A 2 US A 3 US A 4 US A 5 US A 6 US A 7 US A 8 US A 9 US A 10 US A

7 Findability Results Analysis (Percentage in all Queries)
Limitation of Numeric Score Do not provide accurate analysis. For example consider two patents. Using numeric score, Patent A has large Findability score than Patent B, but it has very poor Findability Percentage. So, in next slides, analysis are based on Findability Percentage using all Queries of a Patent. Moreover, for clear understanding, analysis are divided into four factors, What is Findability Percentage, in those Queries which can retrieve < 500 patents. which can retrieve >= 500 & <= 1000 patents. which can retrieve > 1000 & <= 1500 patents. which can retrieve > 1500 patents. # Unique Terms Total Queries Findability Percentage/ Total Queries Findability Numeric Score A 578 32 Million 1% 320,000 B 60 34,220 95% 32,509

8 Queries Distribution Large Percentage of Queries in both sets (Low and High Findable) can retrieve more than 1500 patents. 79% in Low Findable Patents. 65% in High Findable Patents.

9 Findability Percentage
Average = 53.7%. Out of every 100 queries, patent can be findable from 54 queries. Average = 3.9%. Out of every 100 queries, patent can be findable from only 4 queries.

10 Findability Distribution in Queries
In what type of Queries, Patents have more Findable Percentage. In Low Findable Patents, Queries < 500 (patents) share more Findability Percentage than others. (But only 7% of Queries in whole Queries set are < 500). 79% of queries contained > 1500 patents. Based on these results, we can yield two important findings. First low Findable Patents have very poor Findability Percentage (3.9%). Second, in 3.9% queries, most of the queries can retrieve < 500 patents. Based on Average Based on Individual Patents

11 Findability Distribution in Queries
In High Findable Patents, Queries which can retrieve > 1500 (patents) share more Findability Percentage than others. (65% of queries contain > 1500 patents). Based on Average Based on Individual Patents

12 Findability Distribution in Queries
Low Findable Patent = (Patent ID = US A) High Findable Patent = (Patent ID = US A)

13 Findability Percentage in Different Queries
Queries which can retrieve more than > 1500 patents. On Average = 79% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 1.1% queries. Out of 100 queries, patent is findable from almost one query On Average = 65% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 49% queries. Out of 100 queries, patent is findable from almost 49 queries

14 Findability Percentage in Different Queries
Queries which can retrieve more than > 1000 & <= 1500 patents. On Average = 8% queries can retrieve (> 1000 & <= 1500) patents. In all (> 1000 & <= 1500) queries, patents are finable from only 67% queries. Out of 100 queries, patent is findable from almost 67 queries On Average = 5.5% queries can retrieve (> 1000 & < =1500) patents. In all (>1000 & <= 1500) queries, patents are finable from only 5.3% queries. Out of 100 queries, patent is findable from almost 5 queries

15 Findability Percentage in Different Queries
Queries which can retrieve more than >= 500 & <= 1000 patents. On Average = 13% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 52% queries. Out of 100 queries, patent is findable from almost 52 queries On Average = 8.5% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 7% queries. Out of 100 queries, patent is findable from almost 7 queries

16 Findability Percentage in Different Queries
Queries which can retrieve more than < 500 patents. On Average = 14% queries can retrieve < 500 patents. In all < 500 queries, patents are finable from only 65% queries. Out of 100 queries, patent is findable from almost 65 queries On Average = 7% queries can retrieve < patents. In all < 500 queries, patents are finable from only 16% queries. Out of 100 queries, patent is findable from almost 16 queries

17 Issues in Queries used for Findability Analysis
It is very time consuming, to analyze Findability using all possible queries of a patent. What about other combinations, 4 terms, 5 terms, 6 terms.. What about other Boolean operators (OR, NOT). How we can prune irrelevant queries. Query Performance Prediction, such as query clarity score may be help us in Pruning Irrelevant Queries. Query Log can help us in building Simulated Queries.

18 Thank You


Download ppt "Large Scale Findability Analysis"

Similar presentations


Ads by Google