Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

PRES A Score Metric for Evaluating Recall- Oriented IR Applications Walid Magdy Gareth Jones Dublin City University SIGIR, 22 July 2010.

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing,

Evaluating Search Engine

Information Retrieval in Practice

Search Engines and Information Retrieval

Information Retrieval Review

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

Evaluating the Performance of IR Sytems

Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Overview of Search Engines

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Search Engines and Information Retrieval Chapter 1.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.

Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Chapter 6: Information Retrieval and Web Search

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

Information Retrieval

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Information Retrieval Quality of a Search Engine.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.

Information Storage and Retrieval Fall Lecture 1: Introduction and History.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Walid Magdy Gareth Jones

Statistical Methods for Text Error Correction

IR Theory: Evaluation Methods

CSE 635 Multimedia Information Retrieval

Introduction to Search Engines

Presentation transcript:

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of Computing Dublin City University 5 July 2011

This Talk Is not an introduction to Information Retrieval (IR) Does not require experience in IR Is not highly technical Is not about my PhD work only Gives overview on some IR tasks I worked on

Outline Information Retrieval Printed Documents Search OCR text Search OCRless Search Patent Search

Information Retrieval Information Retrieval (IR) = Search Role: retrieve answer to user’s information need Objective: find relevant content at top ranks (usually) The definition of relevant differs across users/tasks Various search tasks (Web search is the most common) Examples: Web search: webpages, images, news, …. Library search: digital books, scientific papers, …. Social search: friends, posts, tweets, …. Speech search, printed documents search, patent search …………….. Introduction

Outline Information Retrieval Printed Documents Search OCR text Search OCRless Search Patent Search

Printed Document Search Many books are only available in printed form Massive efforts is moving toward digitization Digitization is for: Availability & Information Retrieval OCR is the main enabling technology OCR systems is far from perfect, especially for languages of complex orthography ( e.g. Arabic: WER=40% ) There is need to create high quality retrieval systems to enable reaching information in these books Printed Document Search

OCR-based IR Printed Document Search

Approaches Search OCR text OCR error correction Query garbling Multi-OCR text fusion Search images of text (OCRless) Printed Document Search

OCR Error Correction using Error Model OCR Search OCR text Language Model Manually corrected version Corrected text Character Error Model Use for search Error Reduction: 60 to 70% (1:1 vs. m:n character alignment) Significant improvement for retrieval effectiveness Indistinguishable results from when searching clean text Error Reduction: 60 to 70% (1:1 vs. m:n character alignment) Significant improvement for retrieval effectiveness Indistinguishable results from when searching clean text

Query Garbling using Error Model OCR Search Character Error Model Use for search QueryQuery, Ouery, Qucry, …. Significant improvement for retrieval effectiveness Still worse than when searching clean text Significant improvement for retrieval effectiveness Still worse than when searching clean text

OCR Error Correction using Edit Distance OCR Search OCR text Language Model Dictionary Edit Distance Corrected text Use for search Error Reduction: 56% (vs. 70% when using error model) Significant improvement for retrieval effectiveness Indistinguishable results from when searching clean text Error Reduction: 56% (vs. 70% when using error model) Significant improvement for retrieval effectiveness Indistinguishable results from when searching clean text

Multi-OCR Text Fusion OCR Search Language Model OCR text 1 OCR text 2 Fused text Use for search WER fused << min{WER OCR } Fusion of OCR documents using the same OCR system but at different scan resolutions reduces the WER Significant improvement in retrieval results WER fused << min{WER OCR } Fusion of OCR documents using the same OCR system but at different scan resolutions reduces the WER Significant improvement in retrieval results

OCR Search Recognition errors in OCR text degrades retrieval Different methods of text processing can overcome the negative effect of retrieval and improves search Some training and resources are needed which can be manual correction, trained language model, or both Research Outcomes Publications (ACM TOIS, Springer IR, EMNLP, SPIRE, …) MSc degree OCR Search

Searching Printed Document without OCR OCRless Search Text Domain Image Domain X Domain Effectiveness & Efficiency

Scenario (Index Phase) OCRless Search … Index of IDs Cluster ID Cluster Segment to elements Clustering Create IDs document Indexing

Scenario (Query Phase) OCRless Search الإيمان syn(1284, 21, 673, 1208) syn(430, 4, 6412, 3094) syn(231, 9011, 32, 721) syn(40, 110, 2213, 2214) List of ranked documents Draw query Replace with candidate IDs and formulate query Search Index of IDs

Architecture OCRless Search Clusters of elements Index of IDs Text Query List of Candidate IDs for each element with scoring Ranked Results Index Phase Query Phase

OCRless Effective and fast Robust to OCR errors No training resources required Language independent Research Outcomes Patent (filed by Microsoft in 2008) Publication (SPIRE) TechFest Demo The same engine for searching printed documents in: Arabic, English, Chinese, Hebrew, and Hieroglyphic OCRless English

Outline Information Retrieval Printed Documents Search OCR text Search OCRless Search Patent Search

Given a patent application, check if the invention described is novel Patent Search Patent Collection QueryResults list Patent application Several languages Many results to check A System and Method for … ………………… ………………… ………………… ………………… ………………… ………………… ………………..

Properties Task: Find related patents to an invention (check novelty) Nature: Recall-oriented search task Objective: Find all possible relevant documents Search time: takes much longer Users: Patent examiners (experts in field of search) Involves cross-language search Huge effort & amount of money for search IR evaluation campaigns: NTCIR, CLEF, TREC Patent Search

State-of-the-art Patent application  Query (80% of research) Which fields in patents to be considered in query formulation Query terms weighting Keywords extraction Cross-language patent search (10%) Translation dictionaries Mixed language index Retrieval models, query expansion, image search.. (10%) Avg. achieved MAP ~ 0.1 Contribution: Evaluation and Cross-language search Patent Search

Evaluation Recall is the objective Precision is also important Huge # documents checked ( documents) Evaluation: average precision (AP)!! Focuses on finding relevant documents early in ranked list Has weak reflection of recall Patent Search Evaluation

Example For a topic with 4 relevant docs and 1 st 100 docs to be examined: System1: relevant ranks = {1} System2: relevant ranks = {50, 51, 53, 54} System3: relevant ranks = {1, 2, 3, 4} AP system1 = 0.25 AP system2 = AP system3 = 1 R system1 = 0.25 R system2 = 1 R system3 = 1 We need a metric that reflects recall and ranking quality in one measure Patent Search Evaluation

Patent Retrieval Evaluation Score n: number of relevant docs r i : the rank at which the i th relevant document is retrieved N max : max number of checked docs Patent Search Evaluation

PRES Gives higher score for systems achieving higher recall and better average relative ranking Dependent on user’s potential/effort (N max ) Very robust with incomplete relevance judgements. Used in the CLEF-IP evaluation task. Research Outcomes Publications (SIGIR, CLEF) License agreement for CLEF-IP organisers to use PRES Currently the standard metric for evaluating patent search Patent Search Evaluation

Cross-Language Patent Search Patent queries are very long Dictionary-based translation quality < MT MT takes significant time Domain specific data required Limited resources for many language pairs Problems: time and resources Cross-Language Patent Search

Idea Manual translation: MT output: MT evaluation: MT sucks IR evaluation: MT rocks Questions: Can we create an MT4IR system? What benefits can be achieved? he are an great ideas to applied stem by information retrieving It is a great idea to apply stemming in information retrieval great idea apply stem information retriev great idea appli stem information retriev i Cross-Language Patent Search

Current Approach vs New Approach Search Translate Process Train MT Query (lang x) Index (lang y) MT Model (lang x  y) Parallel Corpus Results (lang y) Query (lang y) Query (lang y, no stop words, and stemmed) lang x, Cross-Language Patent Search

Experimentation English patent collection French patent topics 8M parallel sentences from patent domain Test new approach (processed MT) vs ordinary approach (ordinary MT) Multiple training sets: 8M, 800k, 80k, 8k, and 2k Test retrieval effectiveness and processing time Baselines: Google translate: PRES MaTrEx (8M training set): PRES, tr time = 31mins/topic Cross-Language Patent Search

Results (retrieval effectiveness) Cross-Language Patent Search

Results (OOV) E.g. play, plays, played, playing Cross-Language Patent Search

Results (translation time) 5 times faster9 times faster20 times faster Cross-Language Patent Search

MT4IR Much 2 faster than ordinary MT Similar retrieval results Better with limited MT training resources Research Outcomes Publications (ECIR, SIGIR) Patent (filed by DCU in 2011) Cross-Language Patent Search

Conclusion Search is not only about the web Many search tasks have different natures and challenges Sometimes solution for a problem in one task can be useful to improve performance for another one Thinking of problems differently usually leads to novel and effective results It does not have to be complicated to be a good idea Conclusion

Thank you