Large Scale Findability Analysis

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.

Multimedia Database Systems

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz,

Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Modern Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.

Ch 4: Information Retrieval and Text Mining

Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.

Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.

Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna.

Chapter 5: Information Retrieval and Web Search

Chapter 2 Dimensionality Reduction. Linear Methods

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.

1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Statistical Properties of Text

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Document Clustering and Collection Selection Diego Puppin Web Mining,

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Personalizing Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Recommendation in Scholarly Big Data

Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1

Text Indexing and Search

Improving searches through community clustering of information

Text Based Information Retrieval

Software Fault Interactions and Implications for Software Testing

Evaluation of IR Systems

CHAPTER 3 Architectures for Distributed Systems

Implementation Issues & IR Systems

Multimedia Information Retrieval

Compact Query Term Selection Using Topically Related Text

Application layer Lecture 7.

A Markov Random Field Model for Term Dependencies

موضوع پروژه : بازیابی اطلاعات Information Retrieval

Organizing and Visualizing Data

Conjoint analysis.

Retrieval Utilities Relevance feedback Clustering

Retrieval Evaluation - Measures

INF 141: Information Retrieval

Retrieval Performance Evaluation - Measures

Information Retrieval and Web Design

Presentation transcript:

Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems

Agenda Large Scale Findability Experiments Findability Analysis Million of Patents used for Indexing Using all possible queries of a patent, Findability is analyzed Findability Analysis With different queries classes Issues in Queries used for Findability Analysis

Patent Retrieval is a recall oriented domain Introduction Patent Retrieval is a recall oriented domain Findability of each and every patent in collection is considered as an important factor There is need to analyze, how many patents are hard or easy to Find in collection. Findability Measurement: Analysis are based on Findability Measurement Findability is a measurement in IR, used for analyzing how easily we can find a document in collection. Can Figure out Low and High Findable subsets Can compare different Retrieval Systems, which is better for finding patents than other Can identify bias of system, whether system give preference to shorter documents over longer, or longer over shorter.

Large Scale Patents Findability Experiments In related Findability experiments, analysis are usually performed on a random set of queries. For example, taking random set of 200 queries of 2 terms, 3 terms or 4 terms from each patent. However, this does not clear us, that whether we are testing queries generation approach or retrieval system. Large Scale Experiments: Rather than taking random queries, Experiments are performed using all possible queries of a Patent. We considered all possible 3 Terms queries (using AND operator). 1.2 million patents are used for indexing (with Full Text). TFIDF retrieval model is used for ranking patents to queries. (rank cutoff factor) c = 100 is used for analysis.

Large Scale Patents Findability Experiments Since, all possible queries space is very large. Therefore we could process only a small number of patents. A set of Low and High Findable patents are used for Large Scale analysis. We take these patents from our previous experiments, which were based on a random set of small number of queries. Motivation: We want to make sure, whether low Findable patents are really low Findable, or there is any fault in queries generation approach.

Patents # Patent ID (Low Findable) # Patent ID (High Findable) 1 US-4299687-A 2 US-4318912-A 3 US-4127578-A 4 US-4136079-A 5 US-4031023-A 6 US-4034106-A 7 US-4034087-A 8 US-4229478-A 9 US-4087551-A 10 US-4082851-A # Patent ID (High Findable) 1 US-4175011-A 2 US-4085890-A 3 US-4002425-A 4 US-4154025-A 5 US-4052415-A 6 US-4110128-A 7 US-4166008-A 8 US-4067813-A 9 US-4009156-A 10 US-4147736-A

Findability Results Analysis (Percentage in all Queries) Limitation of Numeric Score Do not provide accurate analysis. For example consider two patents. Using numeric score, Patent A has large Findability score than Patent B, but it has very poor Findability Percentage. So, in next slides, analysis are based on Findability Percentage using all Queries of a Patent. Moreover, for clear understanding, analysis are divided into four factors, What is Findability Percentage, in those Queries which can retrieve < 500 patents. which can retrieve >= 500 & <= 1000 patents. which can retrieve > 1000 & <= 1500 patents. which can retrieve > 1500 patents. # Unique Terms Total Queries Findability Percentage/ Total Queries Findability Numeric Score A 578 32 Million 1% 320,000 B 60 34,220 95% 32,509

Queries Distribution Large Percentage of Queries in both sets (Low and High Findable) can retrieve more than 1500 patents. 79% in Low Findable Patents. 65% in High Findable Patents.

Findability Percentage Average = 53.7%. Out of every 100 queries, patent can be findable from 54 queries. Average = 3.9%. Out of every 100 queries, patent can be findable from only 4 queries.

Findability Distribution in Queries In what type of Queries, Patents have more Findable Percentage. In Low Findable Patents, Queries < 500 (patents) share more Findability Percentage than others. (But only 7% of Queries in whole Queries set are < 500). 79% of queries contained > 1500 patents. Based on these results, we can yield two important findings. First low Findable Patents have very poor Findability Percentage (3.9%). Second, in 3.9% queries, most of the queries can retrieve < 500 patents. Based on Average Based on Individual Patents

Findability Distribution in Queries In High Findable Patents, Queries which can retrieve > 1500 (patents) share more Findability Percentage than others. (65% of queries contain > 1500 patents). Based on Average Based on Individual Patents

Findability Distribution in Queries Low Findable Patent = (Patent ID = US-4299687-A) High Findable Patent = (Patent ID = US-4085890-A)

Findability Percentage in Different Queries Queries which can retrieve more than > 1500 patents. On Average = 79% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 1.1% queries. Out of 100 queries, patent is findable from almost one query On Average = 65% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 49% queries. Out of 100 queries, patent is findable from almost 49 queries

Findability Percentage in Different Queries Queries which can retrieve more than > 1000 & <= 1500 patents. On Average = 8% queries can retrieve (> 1000 & <= 1500) patents. In all (> 1000 & <= 1500) queries, patents are finable from only 67% queries. Out of 100 queries, patent is findable from almost 67 queries On Average = 5.5% queries can retrieve (> 1000 & < =1500) patents. In all (>1000 & <= 1500) queries, patents are finable from only 5.3% queries. Out of 100 queries, patent is findable from almost 5 queries

Findability Percentage in Different Queries Queries which can retrieve more than >= 500 & <= 1000 patents. On Average = 13% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 52% queries. Out of 100 queries, patent is findable from almost 52 queries On Average = 8.5% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 7% queries. Out of 100 queries, patent is findable from almost 7 queries

Findability Percentage in Different Queries Queries which can retrieve more than < 500 patents. On Average = 14% queries can retrieve < 500 patents. In all < 500 queries, patents are finable from only 65% queries. Out of 100 queries, patent is findable from almost 65 queries On Average = 7% queries can retrieve < patents. In all < 500 queries, patents are finable from only 16% queries. Out of 100 queries, patent is findable from almost 16 queries

Issues in Queries used for Findability Analysis It is very time consuming, to analyze Findability using all possible queries of a patent. What about other combinations, 4 terms, 5 terms, 6 terms.. What about other Boolean operators (OR, NOT). How we can prune irrelevant queries. Query Performance Prediction, such as query clarity score may be help us in Pruning Irrelevant Queries. Query Log can help us in building Simulated Queries.

Thank You