Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Chapter 5: Introduction to Information Retrieval

Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.

Multimedia Database Systems

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz,

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Evaluating Search Engine

Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Modern Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Ch 4: Information Retrieval and Text Mining

INFO 624 Week 3 Retrieval System Evaluation

Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.

Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna.

Chapter 5: Information Retrieval and Web Search

Mining and Summarizing Customer Reviews Minqing Hu and Bing Liu University of Illinois SIGKDD 2004.

Evaluation David Kauchak cs160 Fall 2009 adapted from:

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 11 Understanding Randomness.

1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

Dr. Fowler AFM Unit 8-1 Organizing & Visualizing Data Organize data in a frequency table. Visualizing data in a bar chart, and stem and leaf display.

Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.

Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Organizing and Visualizing Data © 2010 Pearson Education, Inc. All rights reserved.Section 15.1, Slide

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Statistical Properties of Text

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

User Modeling and Recommender Systems: recommendation algorithms

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Multimedia Information Retrieval

Compact Query Term Selection Using Topically Related Text

A Markov Random Field Model for Term Dependencies

موضوع پروژه : بازیابی اطلاعات Information Retrieval

Organizing and Visualizing Data

Conjoint analysis.

Large Scale Findability Analysis

Retrieval Performance Evaluation - Measures

Information Retrieval and Web Design

Presentation transcript:

Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems

Agenda Large Scale Findability Experiments  Million of Patents used for Indexing  Using all possible queries of a patent, Findability is analyzed Findability Analysis  With different queries classes  Frequent Terms impact on Findability Issues in Queries used for Findability Analysis

Introduction Patent Retrieval is a recall oriented domain Findability of each and every patent in collection is considered as an important factor There is need to analyze, how many patents are hard or easy to Find in collection. Findability Measurement:  Analysis are based on Findability Measurement  Findability is a measurement in IR, used for analyzing how easily we can find a document in collection. Can Figure out Low and High Findable subsets Can compare different Retrieval Systems, which is better for finding patents than other Can identify bias of system, whether system give preference to shorter documents over longer, or longer over shorter.

Large Scale Patents Findability Experiments In related Findability experiments, analysis are usually performed on a random set of queries. For example, taking random set of 200 queries of 2 terms, 3 terms or 4 terms from each patent. However, this does not clear us, that whether we are testing queries generation approach or retrieval system. Large Scale Experiments: Rather than taking random queries, Experiments are performed using all possible queries of a Patent. We considered all possible 3 Terms queries (using AND operator). 1 million patents are used for indexing (with Full Text). TFIDF retrieval model is used for ranking patents to queries. (rank cutoff factor) c = 100 is used for analysis.

Large Scale Patents Findability Experiments Since, all possible queries space is very large. Therefore we could process only a small number of patents. A set of Low and High Findable patents are used for Large Scale analysis.  We take these patents from our previous experiments, which were based on a random set of small number of queries. Motivation:  We want to make sure, whether low Findable patents are really low Findable, or there is any fault in queries generation approach.

Patents #Patent ID (Low Findable) 1US A 2US A 3US A 4US A 5US A 6US A 7US A 8US A 9US A 10US A #Patent ID (High Findable) 1US A 2US A 3US A 4US A 5US A 6US A 7US A 8US A 9US A 10US A

Findability Results Analysis (Percentage in all Queries) Limitation of Numeric Score  Do not provide accurate analysis.  For example consider two patents.  Using numeric score, Patent A has large Findability score than Patent B, but it has very poor Findability Percentage.  So, in next slides, analysis are based on Findability Percentage using all Queries of a Patent.  Moreover, for clear understanding, analysis are divided into four factors, What is Findability Percentage, in those Queries which can retrieve < 500 patents. which can retrieve >= 500 & <= 1000 patents. which can retrieve > 1000 & <= 1500 patents. which can retrieve > 1500 patents. #Unique Terms Total Queries Findability Percentage/ Total Queries Findability Numeric Score A57832 Million1%320,000 B6034,22095%32,509

Queries Distribution Large Percentage of Queries in both sets (Low and High Findable) can retrieve more than 1500 patents.  79% in Low Findable Patents.  65% in High Findable Patents.

Findability Percentage Average = 3.9%. Out of every 100 queries, patent can be findable from only 4 queries. Average = 53.7%. Out of every 100 queries, patent can be findable from 54 queries.

Findability Distribution in Queries In what type of Queries, Patents have more Findable Percentage. In Low Findable Patents, Queries < 500 (patents) share more Findability Percentage than others.  (But only 7% of Queries in whole Queries set are 1500 patents. Based on these results, we can yield two important findings.  First low Findable Patents have very poor Findability Percentage (3.9%).  Second, in 3.9% queries, most of the queries can retrieve < 500 patents. Based on AverageBased on Individual Patents

Findability Distribution in Queries In High Findable Patents, Queries which can retrieve > 1500 (patents) share more Findability Percentage than others. (65% of queries contain > 1500 patents). Based on AverageBased on Individual Patents

Findability Distribution in Queries Low Findable Patent = (Patent ID = US A) High Findable Patent = (Patent ID = US A)

Findability Percentage in Different Queries Queries which can retrieve more than > 1500 patents. On Average = 79% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 1.1% queries. Out of 100 queries, patent is findable from almost one query On Average = 65% queries can retrieve > 1500 patents. In all > 1500 queries, patents are finable from only 49% queries. Out of 100 queries, patent is findable from almost 49 queries

Findability Percentage in Different Queries Queries which can retrieve more than > 1000 & <= 1500 patents. On Average = 5.5% queries can retrieve (> 1000 & < =1500) patents. In all (>1000 & <= 1500) queries, patents are finable from only 5.3% queries. Out of 100 queries, patent is findable from almost 5 queries On Average = 8% queries can retrieve (> 1000 & <= 1500) patents. In all (> 1000 & <= 1500) queries, patents are finable from only 67% queries. Out of 100 queries, patent is findable from almost 67 queries

Findability Percentage in Different Queries Queries which can retrieve more than >= 500 & <= 1000 patents. On Average = 13% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 52% queries. Out of 100 queries, patent is findable from almost 52 queries On Average = 8.5% queries can retrieve (>= 500 & <= 1000) patents. In all (>= 500 & <= 1000) queries, patents are finable from only 7% queries. Out of 100 queries, patent is findable from almost 7 queries

Findability Percentage in Different Queries Queries which can retrieve more than < 500 patents. On Average = 14% queries can retrieve < 500 patents. In all < 500 queries, patents are finable from only 65% queries. Out of 100 queries, patent is findable from almost 65 queries On Average = 7% queries can retrieve < patents. In all < 500 queries, patents are finable from only 16% queries. Out of 100 queries, patent is findable from almost 16 queries

Effect of Individual Terms on Findability A patent contains many unique terms. Are patents Findable from most of Terms, or a small number of Terms create major impact on Findability This factor analyzes the effect of individual terms of patents on Findability score Does removing a small percentage of frequent Terms from Queries, decrease a large percentage of Findability score What is the effect of this factor on Low Findable and High Findable patents

Effect of Individual Terms on Findability On Low Findable Patents, removing small percentage of Frequent Terms quickly decrease the Findability as compared to High Findable Patents.

Issues in Queries used for Findability Analysis It is very time consuming, to analyze Findability using all possible queries of a patent.  What about other combinations, 4 terms, 5 terms, 6 terms..  What about other Boolean operators (OR, NOT). How we can prune irrelevant queries.  Query Performance Prediction, such as clarity score may be help us in Pruning Irrelevant Queries. Query Log can help us in building Simulated Queries.