A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Chapter 5: Introduction to Information Retrieval

PRES A Score Metric for Evaluating Recall- Oriented IR Applications Walid Magdy Gareth Jones Dublin City University SIGIR, 22 July 2010.

Improved TF-IDF Ranker

Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz,

Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Search Engines and Information Retrieval

Information Retrieval Review

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Modern Information Retrieval

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.

Evaluating the Performance of IR Sytems

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Query Expansion.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Search Engines and Information Retrieval Chapter 1.

CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,

Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.

“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.

Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.

1 Query Operations Relevance Feedback & Query Expansion.

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.

Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.

Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland

F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo

Walid Magdy Gareth Jones

An Empirical Study of Learning to Rank for Entity Search

Applying Key Phrase Extraction to aid Invalidity Search

Chapter 5: Information Retrieval and Web Search

Relevance and Reinforcement in Interactive Browsing

Presentation transcript:

A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City University 24 October 2011

Outline What is the Problem? Why Patents? Current Solutions Testing Existing Approaches New Approach Results Conclusion Motivation Patent Characteristics Prior Work Applying Standard QE Novel Method Outcome Findings Agenda

Why Patents? Challenging wording Using vague and general terms Strange combination of terms No defined query (what words to select for search?) Low retrieval effectiveness Recall-oriented IR task Hypothesis: QE  better query/doc match  better results

Prior Work Pseudo Relevance Feedback (PRF) (Kishida K, NTCIR-3; Itoh H, NTCIR-4) QE using Rocchio formula: no significant improvement QE using Taylor formula: no significant improvement Reweighting query terms using PRF: no significant improvement Inter Query Expansion (QE) for Patent Invalidity Search (Takeuchi H. et al, NTCIR-5) QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks Improving Retrievability for Patents (Bashir and Rauber, ECIR 2010) Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task

Testing QE for Prior-Art Patent Search CLEF-IP 2010: 1.35M patents from the EPO 1.35K English patent topics Collection contains EN/FR/DE patents, with translations of titles and claims in three languages Expand query by: PRF vs. WordNet Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query) MAP and PRES was used for evaluation BL: 0.14 MAP, PRES

Applying Pseudo Relevance Feedback PRF implemented in Indri was used Different values of FB terms and docs was tested Terms Docs MAP BL = PRES BL =

Using WordNet for Expansion Expand terms in query using synonyms, hyponyms for nouns and verbs Apply QE to sample 100 topics, then use best combination to the full 1.35k topics set MAPPRES value%changevalue%change Baseline0.1668NA0.584NA NS % % NS+NH % % NS+VS % % NS+NH+VS+VH % % Baseline0.1399NA0.486NA WordNet (NS) % %

Standard QE Approaches PRF: Significant degradation in retrieval effectiveness. This can be expected due to the low initial retrieval precision WordNet: Statistically significant degradation of results, but with some successful instances (31% of topics) Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH) A new effective and efficient QE method is required!

Automatically Generated SynSet Align Sentences Remove Stopwords Stem Words Align Terms Backoff Alignment English fields French transl. EN  FR terms dic. FR  EN terms dic. EN  EN terms dic. process for eliminating foreign matter from a waste heat stream procédé pour éliminer de la matière étrangère d'un courant de chaleur perdue process elimin foreign matter wast heat stream procéd élimin mati étrangèr cour chaleur perdu elimin: élimin 0.71 elimin 0.13 élimin: remov 0.71 elimin 0.14 elimin: remov 0.6 elimin 0.16 elimin: remov 0.85 elimin 0.15

Samples of the Output motorweighttravelcolorlink motor motor0.64 engin engin0.36 weight weight0.86 wt wt0.14 travel travel0.67 move move0.19 displac displac0.14 color color0.56 colour colour0.25 dye dye0.19 link link0.4 connect connect0.18 bond bond0.17 crosslink crosslink0.13 bind bind0.12 clothtubeareagameplay fabric fabric0.36 cloth cloth0.3 garment garment0.2 tissu tissu0.14 tube tube0.88 pipe pipe0.12 area area0.4 zone zone0.23 region region0.2 surfac surfac0.17 set set0.6 game game0.4 set set0.3 play play0.24 read read0.2 game game0.16 reproduc reproduc0.1

SynSet QE Results 8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets Two runs were adopted: Expanding query using SynSet without weights (Usynset) Utilizing SynSet probabilities as weights to terms in query MAPPRES value%changevalue%change Baseline0.1399NA0.486NA Wsynset % % Usynset % %

SynSet Expansion Significantly better MAP, but significantly worse PRES i.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall Some topics were improved (34% of topics), but some were degraded (39% of topics). Significantly more efficient than PRF and WordNet (query size is only 60% larger)

Deeper Look on SynSet No features with high correlation to SynSet QE success Initial retrieval quality of BL does not relate to the performance of QE Topic IDBaselineWsynset%change Topic IDBaselineWsynset%change PAC ∞ PAC % PAC ∞ PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC %

Conclusions PRF is not effective with patent prior-art search WordNet QE for patent search: Leads to overall significant degradation of retrieval Has some positive impact on the retrieval of some topics High computational cost SynSet QE for patent search: The most effective and efficient QE technique among those tested Significant improvement for very high ranks, but significant degradation of overall ranking and recall No indication of when it fails/succeeds SynSet can be used as a lexical resource for patent examiners

Future Work More analysis to better understand when QE fails/succeeds Applying SynSet on real patent examiners’ queries rather than automatically formulated queries Combining different QE methods Alternative methods for query modification, for example query reduction (QR)

Please Check in CIKM Poster Session Magdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search. Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents. Thank you