Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
An Overview of the Indri Search Engine Don Metzler Center for Intelligent Information Retrieval University of Massachusetts, Amherst Joint work with Trevor.
Beyond Bags of Words: A Markov Random Field Model for Information Retrieval Don Metzler.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Information Retrieval in Practice
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Modern Information Retrieval
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Scalable Text Mining with Sparse Generative Models
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Information Retrieval in Practice
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Effective Query Formulation with Multiple Information Sources
Chapter 6: Information Retrieval and Web Search
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
INDRI - Overview Don Metzler Center for Intelligent Information Retrieval University of Massachusetts, Amherst.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
Search Engine Architecture
Experiments for the CL-SR task at CLEF 2006
Implementation Issues & IR Systems
Compact Query Term Selection Using Topically Related Text
A Markov Random Field Model for Term Dependencies
John Lafferty, Chengxiang Zhai School of Computer Science
Relevance and Reinforcement in Interactive Browsing
INF 141: Information Retrieval
Presentation transcript:

Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst

Terabyte Track Summary GOV2 test collection –Collection size: documents (426 GB) –Index size: 253 GB (includes compressed collection) –Index time: 6 hours (parallel across 6 machines) ~ 12GB/hr –Vocabulary size: 49,657,854 –Total terms: 22,811,162,783 Parsing –No index-time stopping –Porter stemmer –Normalization (U.S. => US, etc…) Topics –50.gov-related standard TREC ad hoc topics

UMass Runs indri04QL –query likelihood indri04QLRM –query likelihood + pseudo relevance feedback indri04AW –“adaptive window” indri04AWRM –“adaptive window” + pseudo relevance feedback indri04FAW –“adaptive window” + fields

indri04QL / indri04QLRM Query likelihood –Standard query likelihood run –Smoothing parameter trained on TREC 9 and 10 main web track data –Example: #combine( pearl farming ) Pseudo-relevance feedback –Estimate relevance model from top n documents in initial retrieval –Augment original query with these term –Formulation: #weight(0.5 #combine( Q ORIGINAL ) 0.5 #combine( Q RM ) )

indri04AW / indri04AWRM Goal: –Given only a title query, automatically construct an Indri query –What’s the best we can do without query expansion techniques? Can we model query term dependence using Indri proximity operators? –Ordered window ( #N ) –Unordered window ( #uwN )

Related Work InQuery query formulations –Hand crafted formulations –Used part of speech tagging and placed noun phrases into #phrase operator Dependence models –van Risjbergen’s tree dependence –Dependence language models Google

Assumptions Assumption 1 –Query terms are likely to appear in relevant documents Assumption 2 –Query term are likely to appear ordered in relevant documents Assumption 3 –Query terms are likely to appear in close proximity to one another in relevant documents

Assumption 1 Proposed feature: – (for every term q in our query) Indri representation: –q Elements from our example: –TREC –terabyte –track “Query terms are likely to appear in relevant documents”

Assumption 2 Proposed feature: – (for every subphrase q i … q i+k for k > 1 in our query) Indri representation –#1( q i … q i+k ) Elements from our example: –#1( TREC terabyte ) –#1( terabyte track ) –#1( TREC terabyte track ) “Query term are likely to appear ordered in relevant documents”

Assumption 3 Proposed feature: – (for every non-singleton subset of query terms) Indri representation: –#uwN( q 1 … q k ) –N = 4*k Elements from our example: –#8( TREC terabyte ) –#8( terabyte track ) –#8( TREC track ) –#12( TREC terabyte track ) “Query terms are likely to appear in close proximity to one another in relevant documents”

Putting it all together… #weight(w 0 TREC w 1 terabyte w 2 track w 3 #1( TREC terabyte ) w 4 #1( terabyte track ) w 5 #1( TREC terabyte track ) w 6 #8( TREC terabyte ) w 7 #8( terabyte track ) w 8 #8( TREC track ) w 9 #12( TREC terabyte track ) ) Too many parameters!

Parameter Tying Assume that all weights of the same type have equal value Resulting query: #weight(w TERM #combine(TREC terabyte track ) ) w ORDERED #combine(#1( TREC terabyte ) #1( terabyte track ) #1( TREC terabyte track ) ) w UNORDERED #combine(#8( TREC terabyte ) #8( terabyte track ) #8( TREC track ) #12( TREC terabyte track ) ) ) Now only 3 parameters!

Setting the weights Training data: WT10g / TREC web queries Objective function: mean average precision Coordinate ascent via line search Likely to find local maxima –Global if MAP is a convex function of the weights Requires evaluating a lot of queries –Not really a problem with a fast retrieval engine Able to use all of the training data Other methods to try –MaxEnt (maximize likelihood) –SVM (maximize margin)

Weights The following weights were found: –w TERM = 1.5 –w ORDERED = 0.1 –w UNORDERED = 0.3 Indicative of relative importance of each feature Also indicative of how idf factors over weight phrases

indri04FAW Combines evidence from different fields –Fields indexed: anchor, title, body, and header (h1, h2, h3, h4) –Formulation: #weight( 0.15 Q ANCHOR 0.25 Q TITLE 0.10 Q HEADING 0.50 Q BODY ) Needs to be explore in more detail

Other Approaches Glasgow –Terrier –Divergence from randomness model University of Melbourne –Document-centric integral impacts CMU –Used Indri University of Amsterdam –Only indexed titles and anchor text –20 minutes to index, 1 second to query –Very poor effectiveness

T = title D = description N = narrative Indri Terabyte Track Results fields ->TTDTDN QL QLRM AW AWRM P10 fields ->TTDTDN QL QLRM AW AWRM MAP italicized values denote statistical significance over QL

Best Run per Group uogTBQEL cmuapfs2500 indri04AWRM MU04tb4 THUIRtb4 humT04l zetplain iit00t MSRAt1 sabir04ta2 mpi04tb07 DcuTB04Base nn04tint pisa3 UAmsT04TBm1 apl04w4tdn irttbtl run id MAP

Conclusions Indri was competitive both in terms of efficiency and effectiveness in the TB track arena “Adaptive window” approach is a promising way of formulating queries More work must be done on combining other forms of evidence to further improve effectiveness –Fields –Prior information –Link analysis