Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.

Similar presentations


Presentation on theme: "Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems."— Presentation transcript:

1 Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems

2 Agenda Large Scale Findability Experiments  Million of Patents used for Indexing  Using all possible queries of a patent, Findability is analyzed Findability Analysis  With different queries classes  Frequent Terms impact on Findability Issues in Queries used for Findability Analysis

3 Introduction Patent Retrieval is a recall oriented domain Findability of each and every patent in collection is considered as an important factor There is need to analyze, how many patents are hard or easy to Find in collection. Findability Measurement:  Analysis are based on Findability Measurement  Findability is a measurement in IR, used for analyzing how easily we can find a document in collection. Can Figure out Low and High Findable subsets Can compare different Retrieval Systems, which is better for finding patents than other Can identify bias of system, whether system give preference to shorter documents over longer, or longer over shorter.

4 Large Scale Patents Findability Experiments In related Findability experiments, analysis are usually performed on a random set of queries. For example, taking random set of 200 queries of 2 terms, 3 terms or 4 terms from each patent. However, this does not clear us, that whether we are testing queries generation approach or retrieval system. Large Scale Experiments: Rather than taking random queries, Experiments are performed using all possible queries of a Patent. We considered all possible 3 Terms queries (using AND operator). 1 million patents are used for indexing (with Full Text). TFIDF retrieval model is used for ranking patents to queries. (rank cutoff factor) c = 100 is used for analysis.

5 Large Scale Patents Findability Experiments Since, all possible queries space is very large. Therefore we could process only a small number of patents. A set of Low and High Findable patents are used for Large Scale analysis.  We take these patents from our previous experiments, which were based on a random set of small number of queries. Motivation:  We want to make sure, whether low Findable patents are really low Findable, or there is any fault in queries generation approach.

6 Patents #Patent ID (Low Findable) 1US-4299687-A 2US-4318912-A 3US-4127578-A 4US-4136079-A 5US-4031023-A 6US-4034106-A 7US-4034087-A 8US-4229478-A 9US-4087551-A 10US-4082851-A #Patent ID (High Findable) 1US-4175011-A 2US-4085890-A 3US-4002425-A 4US-4154025-A 5US-4052415-A 6US-4110128-A 7US-4166008-A 8US-4067813-A 9US-4009156-A 10US-4147736-A

7 Findability Results Analysis (Percentage in all Queries) Limitation of Numeric Score  Do not provide accurate analysis.  For example consider two patents.  Using numeric score, Patent A has large Findability score than Patent B, but it has very poor Findability Percentage.  So, in next slides, analysis are based on Findability Percentage using all Queries of a Patent.  Moreover, for clear understanding, analysis are divided into four factors, What is Findability Percentage, in those Queries which can retrieve < 500 patents. which can retrieve >= 500 & <= 1000 patents. which can retrieve > 1000 & <= 1500 patents. which can retrieve > 1500 patents. #Unique Terms Total Queries Findability Percentage/ Total Queries Findability Numeric Score A57832 Million1%320,000 B6034,22095%32,509

8 Queries Distribution Large Percentage of Queries in both sets (Low and High Findable) can retrieve more than 1500 patents.  79% in Low Findable Patents.  65% in High Findable Patents.

9 Findability Percentage Average = 3.9%. Out of every 100 queries, patent can be findable from only 4 queries. Average = 53.7%. Out of every 100 queries, patent can be findable from 54 queries.

10 Findability Distribution in Queries In what type of Queries, Patents have more Findable Percentage. In Low Findable Patents, Queries < 500 (patents) share more Findability Percentage than others.  (But only 7% of Queries in whole Queries set are 1500 patents. Based on these results, we can yield two important findings.  First low Findable Patents have very poor Findability Percentage (3.9%).  Second, in 3.9% queries, most of the queries can retrieve < 500 patents. Based on AverageBased on Individual Patents

11 Findability Distribution in Queries In High Findable Patents, Queries which can retrieve > 1500 (patents) share more Findability Percentage than others. (65% of queries contain > 1500 patents). Based on AverageBased on Individual Patents

12 Findability Distribution in Queries Low Findable Patent = (Patent ID = US-4299687-A) High Findable Patent = (Patent ID = US-4085890-A)

13 Findability Percentage in Different Queries Queries which can retrieve more than > 1500 patents. On Average = 79% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 1.1% queries. Out of 100 queries, patent is findable from almost one query On Average = 65% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 49% queries. Out of 100 queries, patent is findable from almost 49 queries

14 Findability Percentage in Different Queries Queries which can retrieve more than > 1000 & <= 1500 patents. On Average = 5.5% queries can retrieve (> 1000 & < =1500) patents. In all (>1000 & <= 1500) queries, patents are finable from only 5.3% queries. Out of 100 queries, patent is findable from almost 5 queries On Average = 8% queries can retrieve (> 1000 & <= 1500) patents. In all (> 1000 & <= 1500) queries, patents are finable from only 67% queries. Out of 100 queries, patent is findable from almost 67 queries

15 Findability Percentage in Different Queries Queries which can retrieve more than >= 500 & <= 1000 patents. On Average = 13% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 52% queries. Out of 100 queries, patent is findable from almost 52 queries On Average = 8.5% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 7% queries. Out of 100 queries, patent is findable from almost 7 queries

16 Findability Percentage in Different Queries Queries which can retrieve more than < 500 patents. On Average = 14% queries can retrieve < 500 patents. In all < 500 queries, patents are finable from only 65% queries. Out of 100 queries, patent is findable from almost 65 queries On Average = 7% queries can retrieve < patents. In all < 500 queries, patents are finable from only 16% queries. Out of 100 queries, patent is findable from almost 16 queries

17 Effect of Individual Terms on Findability A patent contains many unique terms. Are patents Findable from most of Terms, or a small number of Terms create major impact on Findability This factor analyzes the effect of individual terms of patents on Findability score Does removing a small percentage of frequent Terms from Queries, decrease a large percentage of Findability score What is the effect of this factor on Low Findable and High Findable patents

18 Effect of Individual Terms on Findability On Low Findable Patents, removing small percentage of Frequent Terms quickly decrease the Findability as compared to High Findable Patents.

19 Issues in Queries used for Findability Analysis It is very time consuming, to analyze Findability using all possible queries of a patent.  What about other combinations, 4 terms, 5 terms, 6 terms..  What about other Boolean operators (OR, NOT). How we can prune irrelevant queries.  Query Performance Prediction, such as clarity score may be help us in Pruning Irrelevant Queries. Query Log can help us in building Simulated Queries.


Download ppt "Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems."

Similar presentations


Ads by Google