Lemur Toolkit Introduction

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
An Overview of the Indri Search Engine Don Metzler Center for Intelligent Information Retrieval University of Massachusetts, Amherst Joint work with Trevor.
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Information Retrieval in Practice
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Why Can’t We All Get Along? ( Structured Data and Information Retrieval) Bruce Croft Computer Science Department University of Massachusetts Amherst.
Vector Space Model CS 652 Information Extraction and Integration.
Scalable Text Mining with Sparse Generative Models
IR Models: Review Vector Model and Probabilistic.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Retrieval in Practice
Lemur Toolkit Tutorial
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
INDRI - Overview Don Metzler Center for Intelligent Information Retrieval University of Massachusetts, Amherst.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Information Retrieval Models: Vector Space Models
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Implementation Issues & IR Systems
Information Retrieval Models: Probabilistic Models
Lemur Toolkit 彭波 北京大学信息科学技术学院 3/21/2011.
John Lafferty, Chengxiang Zhai School of Computer Science
Introduction to Information Retrieval
Presentation transcript:

Lemur Toolkit Introduction http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011

Recap Information Retrieval Models Vector Space Model Probabilistic models Language model

Some formulas for Sim(VSM) Dot product Cosine Dice Jaccard t 1 D Q t 3 t 2 θ

BM25 (Okapi system) – Robertson et al. Consider tf, qtf, document length Doc. length normalization TF factors k1, k2, k3, b: parameters qtf: query term frequency dl: document length avdl: average document length

Standard Probabilistic IR Information need d1 matching d2 query … dn document collection

IR based on Language Model (LM) Information need d1 generation d2 query … … A query generation process For an information need, imagine an ideal document Imagine what words could appear in that document Formulate a query using those words dn document collection

Language Modeling for IR Estimate a multinomial probability distribution from the text Smooth the distribution with one estimated from the entire collection P(w|D) = (1-) P(w|D)+  P(w|C)

Query Likelihood ? Estimate probability that document generated the query terms P(Q|D) =  P(q|D)

Kullback-Leibler Divergence Estimate models for document and query and compare ? = KL(Q|D) =  P(w|Q) log(P(w|Q) / P(w|D))

Question Among the three classic information retrieval model, which one is your best choice in designing your retrieval system? How can you tune the model parameters to achieve optimized performance? When you have a new idea on retrieval problem, how can you prove it?

University of California, Berkeley A Brief History of IR Slides from Prof. Ray Larson University of California, Berkeley School of Information http://courses.sims.berkeley.edu/i240/s11/

Experimental IR systems Probabilistic indexing – Maron and Kuhns, 1960 SMART – Gerard Salton at Cornell – Vector space model, 1970’s SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?) IS 240 – Spring 2011

Historical Milestones in IR Research 1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft) IS 240 – Spring 2011

Historical Milestones in IR Research (cont.) 1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton, Yu) 1985 Generalized Vector Space Model (Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) 1990 Latent Semantic Indexing (Dumais, Deerwester) 1991 Polynomial & Logistic Regression (Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok) 1998 Language Models (Ponte, Croft) IS 240 – Spring 2011

Information Retrieval – Historical View Research Industry Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) IS 240 – Spring 2011

Research Systems Software INQUERY (Croft) OKAPI (Robertson) PRISE (Harman) http://potomac.ncsl.nist.gov/prise SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson) http://cheshire.berkeley.edu LEMUR toolkit Lucene Others IS 240 – Spring 2011

Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman Lemur Project Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman [1]INDRI – Overview [2]Lemur Toolkit Tutorial@SIGIR2006

Zoology 101 Lemurs are primates found only in Madagascar 50 species (17 are endangered) Ring-tailed lemurs lemur catta

Zoology 101 The indri is the largest type of lemur When first spotted the natives yelled “Indri! Indri!” Malagasy for "Look!  Over there!"

About The Lemur Project The Lemur Project was started in 2000 by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LTI) at Carnegie Mellon University. Over the years, a large number of UMass and CMU students and staff have contributed to the project. The project's first product was the Lemur Toolkit, a collection of software tools and search engines designed to support research on using statistical language models for information retrieval tasks. Later the project added the Indri search engine for large-scale search, the Lemur Query Log Toolbar for capture of user interaction data, and the ClueWeb09 dataset for research on web search. http://www.lemurproject.org

Installation http://www.lemurproject.org Linux, OS/X: Windows Extract software/lemur-4.12.tar.gz ./configure --prefix=/install/path ./make ./make install Windows Run software/lemur-4.12-install.exe Documentation in windoc/index.html

Installation Use Lemur-4.12 instead~ JAVA Runtime(JDK 6) need for evaluation tool. Environment Variable : PATH Linux: modify ~/.bash_profile Windows: MyComputer/Properties…

Indexing Document Preparation Indexing Parameters Time and Space Requirements More information can be found at: http://www.lemurproject.org/tutorials/begin_indexing-1.html

Two Index Formats KeyFile Term Positions Metadata Offline Incremental InQuery Query Language Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri Query Languages Blue text indicates settings specific to Indri applications. Refer to online documentation for other Lemur application parameters.

Indexing – Document Preparation Document Formats: The Lemur Toolkit can inherently deal with several different document format types without any modification: TREC Text TREC Web Plain Text Microsoft Word(*) Microsoft PowerPoint(*) HTML XML PDF Mbox (*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.

Indexing – Document Preparation If your documents are not in a format that the Lemur Toolkit can inherently process: If necessary, extract the text from the document. Wrap the plaintext in TREC-style wrappers: <DOC> <DOCNO>document_id</DOCNO> <TEXT> Index this document text. </TEXT> </DOC> – or – For more advanced users, write your own parser to extend the Lemur Toolkit. Could also work with plain text documents

Indexing - Parameters Basic usage to build index: IndriBuildIndex <parameter_file> Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.

Indexing – Parameters Standard parameter file specification an XML document: <parameters> <option></option> … </parameters>

Indexing – Parameters where to find your source files and what type to expect BuildIndex <dataFiles> name of file containing list of datafiles to index. IndriBuildIndex <parameters> <corpus> <path>/path/to/source/files</path> <class>trectext</class> </corpus> </parameters> Put link into web page for document classes <corpus> <path>: (required) the path to the source files (absolute or relative) <class>: (optional) the document type to expect. If omitted, IndriBuildIndex will attempt to guess at the filetype based on the file’s extension.

Indexing - Parameters The <index> parameter tells IndriBuildIndex where to create or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new documents into the index. <parameters> <index>/path/to/the/index</index> </parameters>

Indexing - Parameters <memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory>256M</memory> </parameters>

Indexing - Parameters Stopwords defined within <stopwords>filename</stopwords> IndriBuildIndex <parameters> <stopper> <word>first_word</word> <word>next_word</word> … <word>final_word</word> </stopper> </parameters> Stopwords can be defined within a <stopper> block with individual stopwords within enclosed in <word> tags.

Indexing – Parameters Term stemming can be used while indexing as well via the <stemmer> tag. Specify the stemmer type via the <name> tag within. Stemmers included with the Lemur Toolkit include the Krovetz Stemmer and the Porter Stemmer. <parameters> <stemmer> <name>krovetz</name> </stemmer> </parameters>

Retrieval Parameters Query Formatting Interpreting Results

Retrieval - Parameters Basic usage for retrieval: IndriRunQuery/RetEval <parameter_file> Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.

Retrieval - Parameters The <index> parameter tells IndriRunQuery/RetEval where to find the repository. <parameters> <index>/path/to/the/index</index> </parameters>

Retrieval - Parameters The <query> parameter specifies a query plain text or using the Indri query language <parameters> <query> <number>1</number> <text>this is the first query</text> </query> <number>2</number> <text>another query to run</text> </parameters> Query file format <DOC> <DOCNO> 1 </DOCNO> What articles exist which deal with TSS (Time Sharing System), anoperating system for IBM computers? </DOC> <DOCNO> 2 </DOCNO> I am interested in articles written either by Prieve or Udo PoochPrieve, B.Pooch, U. Run a simple query now from the command line

Retrieval – Query Formatting TREC-style topics are not directly able to be processed via IndriRunQuery/RetEval. Format the queries accordingly: Format by hand Write a script to extract the fields (可爱的Python~)

Retrieval – Parameters To specify a maximum number of results to return, use the <count> tag: <parameters> <count>50</count> </parameters>

Retrieval - Parameters Result formatting options: IndriRunQuery/RetEval has built in formatting specifications for TREC and INEX retrieval tasks

Retrieval – Parameters TREC – Formatting directives: <runID>: a string specifying the id for a query run, used in TREC scorable output. <trecFormat>: true to produce TREC scorable output, otherwise use false (default). <parameters> <runID>runName</runID> <trecFormat>true</trecFormat> </parameters> Command line options –runID=runName

Outputting INEX Result Format Must be wrapped in <inex> tags <participant-id>: specifies the participant-id attribute used in submissions. <task>: specifies the task attribute (default CO.Thorough). <query>: specifies the query attribute (default automatic). <topic-part>: specifies the topic-part attribute (default T). <description>: specifies the contents of the description tag. <parameters> <inex> <participant-id>LEMUR001</participant-id> </inex> </parameters>

Retrieval - Evaluation To use trec_eval: format IndriRunQuery results with appropriate trec_eval formatting directives in the parameter file: <runID>runName</runID> <trecFormat>true</trecFormat> Resulting output will be in standard TREC format ready for evaluation: <queryID> Q0 <DocID> <rank> <score> <runID> 150 Q0 AP890101-0001 1 -4.83646 runName 150 Q0 AP890101-0015 2 -7.06236 runName Run canned queries now! Evaluate using trec_eval

Use RetEval for TF.IDF First run ParseToFile to convert doc formatted queries into queries <parameters> <docFormat>web</docFormat> <outputFile>filename</outputFile> <stemmer>stemmername</stemmer> <stopwords>stopwordfile</stopwords> </parameters> ParseToFile paramfile queryfile http://www.lemurproject.org/lemur/parsing.html#parsetofile <parameters> <docFormat>web</docFormat> <outputFile>queries.reteval</outputFile> <stemmer>krovetz</stemmer> </parameters>

Use RetEval for TF.IDF Then run RetEval RetEval paramfile <parameters> <index>index</index> <retModel>0</retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery>querie filename</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile> </parameters> RetEval paramfile http://www.lemurproject.org/lemur/retrieval.html#RetEval <parameters> <index>index</index> <retModel>0</retModel> <textQuery>queries.reteval</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile> </parameters>

Evluate Results TREC qrels Ground Truth: judge by human assessors.

Ireval tool java -jar “D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar”result qrels >pr.result Demo: Data Collection Lemur GUI(indexer, retrieval, 3. Cmd line D:\work\idx>ParseToFile ../parse_query_param ../topics.51-100.desc  D:\work\idx>RetEval simple_tfidf_param  D:\work\idx>java -jar "D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar" res.simple_tfidf qrel.ap >pr.simple_tfidf

Use Lemur API & Make Extension to Lemur

Task When you have a new idea on retrieval problem, how can you prove it?

Introducing the API Lemur “Classic” API Indri API Many objects, highly customizable May want to use this when you want to change how the system works Support for clustering, distributed IR, summarization Indri API Two main objects Best for integrating search into larger applications Supports Indri query language, XML retrieval, “live” incremental indexing, and parallel retrieval

Lemur “Classic” API Primarily useful for retrieval operations Most indexing work in the toolkit has moved to the Indri API Indri indexes can be used with Lemur “Classic” retrieval applications Extensive documentation and tutorials on the website (more are coming)

Lemur Index Browsing The Lemur API gives access to the index data (e.g. inverted lists, collection statistics) IndexManager::openIndex Returns a pointer to an index object Detects what kind of index you wish to open, and returns the appropriate kind of index class Index:: docInfoList (inverted list), termInfoList (document vector), termCount, documentCount

Lemur: DocInfoList Index::docInfoList( int termID ) Returns an iterator to the inverted list for termID. The list contains all documents that contain termID, including the positions where termID occurs.

Lemur: TermInfoList Index::termInfoList( int docID ) Returns an iterator to the direct list for docID. The list contains term numbers for every term contained in document docID, and the number of times each word occurs. (use termInfoListSeq to get word positions)

Lemur Index Browsing Index::termCount Index::documentCount termCount() : Total number of terms indexed termCount( int id ) : Total number of occurrences of term number id. Index::documentCount docCount() : Number of documents indexed docCount( int id ) : Number of documents that contain term number id.

Lemur Index Browsing Index::term Index::document term( char* s ) : convert term string to a number term( int id ) : convert term number to a string Index::document document( char* s ) : convert doc string to a number document( int id ) : convert doc number to a string

Lemur Index Browsing Index::docLength( int docID ) Index::docLengthAvg The length, in number of terms, of document number docID. Index::docLengthAvg Average indexed document length Index::termCountUnique Size of the index vocabulary

Browsing Posting List

Lemur Retrieval Class Name Description TFIDFRetMethod BM25 SimpleKLRetMethod KL-Divergence InQueryRetMethod Simplified InQuery CosSimRetMethod Cosine CORIRetMethod CORI OkapiRetMethod Okapi IndriRetMethod Indri (wraps QueryEnvironment)

Lemur Retrieval RetMethodManager::runQuery query: text of the query index: pointer to a Lemur index modeltype: “cos”, “kl”, “indri”, etc. stopfile: filename of your stopword list stemtype: stemmer datadir: not currently used func: only used for Arabic stemmer model->scoreCollection() RetMethodManager :: createModel

Lemur Retrieval Method lemur::api::RetrievalMethod Class Reference

TextQueryMethod A text query retrieval method is determined by specifying the following elements: A method to compute the query representation A method to compute the doc representation The scoring function A method to update the query representation based on a set of (relevant) documents

TextQueryMethod Scoring Function Given a query q =(q1,q2,...,qN) and a document d=(d1,d2,...,dN), where q1,...,qN and d1,...,dN are terms: s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d) function w gives the weight of each matched term function g makes it possible to perform any further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length).

TextQueryMethod Scoring Function ScoreFunction::matchedTermWeight compute the score contribution of a matched term ScoreFunction:: adjustedScore score adjustment (e.g., appropriate length normalization)

Lemur: Other tasks Clustering: ClusterDB Distributed IR: DistMergeMethod Language models: UnigramLM, DirichletUnigramLM, etc.

More Stories about Indri

Why Can’t We All Get Along? Structured Data and Information Retrieval Prof. Bruce Croft Center for Intelligent Information Retrieval (CIIR) University of Massachusetts Amherst Slides from Prof. Bruce Croft dbirday-croft.ppt

Similarities and Differences Common interest in providing efficient access to information on a very large scale indexing and optimization key topics Until recently, concern about effectiveness (accuracy) of access was domain of IR Focus on structured vs. unstructured data is historically true but less relevant today Statistical inference and ranking are central to IR, becoming more important in DB

Similarities and Differences IR systems have focused on providing access to information rather than answers e.g. Web search evaluation typically based on topical relevance and user relevance rather than correctness (except QA) IR works with multiple databases but not multiple relations IR query languages more like calculus than algebra Integrity, security, concurrency are central for DB, less so in IR

What is the Goal? An IR system with extended capability for structured data i.e. extend IR model to include combination of evidence from structured and unstructured components of complex objects (documents) backend database system used to store objects (cf. “one hand clapping”) many applications look like this (e.g. desktop search, web shopping) users seem to prefer this approach (simple queries or forms and ranking)

Combination of Evidence Example: Where was George Washington born? Returns a ranked list of sentences containing the phrase George Washington, the term born, and a snippet of text tagged as a PLACE named entity “Structured Query” .vs. TextQuery

Indri – A Candidate IR System Indri is a separate, downloadable component of the Lemur Toolkit Influences INQUERY [Callan, et. al. ’92] Inference network framework Structured Query language Lemur [http://www.lemurproject.org] Language modeling (LM) toolkit Lucene [http://jakarta.apache.org/lucene/docs/index.html] Popular off the shelf Java-based IR system Based on heuristic retrieval models Designed for new retrieval environments i.e. GALE, CALO, AQUAINT, Web retrieval, and XML retrieval IR-17: (1991) Turtle, H., "Inference Networks for Document Retrieval," Ph.D. dissertation. [View bibtex]

Zoology 101 The indri is the largest type of lemur When first spotted the natives yelled “Indri! Indri!” Malagasy for "Look!  Over there!"

Design Goals Off the shelf (Windows, *NIX, Mac platforms) Simple to set up and use Fully functional API w/ language wrappers for Java, etc… Robust retrieval model Inference net + language modeling [Metzler and Croft ’04] Powerful query language Designed to be simple to use, yet support complex information needs Provides “adaptable, customizable scoring” Scalable Highly efficient code Distributed retrieval Incremental update

Model Based on original inference network retrieval framework [Turtle and Croft ’91] Casts retrieval as inference in simple graphical model Extensions made to original model Incorporation of probabilities based on language modeling rather than tf.idf Multiple language models allowed in the network (one per indexed context)

Model … … … Model hyperparameters (observed) Document node (observed) α,βbody D α,βtitle α,βh1 Context language models θtitle θbody θh1 … … … r1 rN r1 rN r1 rN Representation nodes (terms, phrases, etc…) q1 q2 Belief nodes (#combine, #not, #max) Information need node (belief node) I

Model … … … α,βh1 α,βbody D α,βtitle θtitle θbody θh1 r1 rN r1 rN r1 q1 q2 I

Assume r ~ Bernoulli( θ ) Nearly any model may be used here P( r | θ ) Probability of observing a term, phrase, or feature given a context language model ri nodes are binary Assume r ~ Bernoulli( θ ) “Model B” – [Metzler, Lavrenko, Croft ’04] Nearly any model may be used here tf.idf-based estimates (INQUERY) Mixture models Metzler, D. and Croft, W.B., "Combining the Language Model and Inference Network Approaches to Retrieval," Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval, 40(5), 735-750, 2004. [pdf][abstract]

Model D θtitle θbody θh1 r1 rN … q1 q2 α,βtitle α,βbody α,βh1 I

P( θ | α, β, D ) Prior over context language model determined by α, β Assume P( θ | α, β ) ~ Beta( α, β ) Bernoulli’s conjugate prior αr = μP( r | C ) + 1 βr = μP( ¬ r | C ) + 1 μ is a free parameter

Model … … … α,βbody α,βh1 D α,βtitle θtitle θbody θh1 r1 rN r1 rN r1 q1 q2 I

P( q | r ) and P( I | r ) Belief nodes are created dynamically based on query Belief node estimates are derived from standard link matrices Combine evidence from parents in various ways Allows fast inference by making marginalization computationally tractable Information need node is simply a belief node that combines all network evidence into a single value Documents are ranked according to P( I | α, β, D)

Belief Inference p1 p2 …… Pn q

Example: #AND P(Q=true|a,b) A B false true 1 A B Q

Other Belief Operators

Query Language Extension of INQUERY query language “Structured” query language Term weighting Ordered / unordered windows Synonyms Additional features Language modeling motivated constructs Added flexibility to deal with fields via contexts Generalization of passage retrieval (extent retrieval)

Document Representation <html> <head> <title>Department Descriptions</title> </head> <body> The following list describes … <h1>Agriculture</h1> … <h1>Chemistry</h1> … <h1>Computer Science</h1> … <h1>Electrical Engineering</h1> … … <h1>Zoology</h1> </body> </html> <title> context <title>department descriptions</title> <title> extents 1. department descriptions <body> context <body>the following list describes … <h1>agriculture</h1> … </body> <body> extents 1. the following list describes <h1>agriculture </h1> … <h1> context <h1>agriculture</h1> <h1>chemistry</h1>… <h1>zoology</h1> <h1> extents 1. agriculture 2. chemistry … 36. zoology .

Terms Type Example Matches Stemmed term dog All occurrences of dog (and its stems) Surface term “dogs” Exact occurrences of dogs (without stemming) Term group (synonym group) <”dogs” canine> All occurrences of dogs (without stemming) or canine (and its stems) POS qualified term <”dogs” canine>.NNS Same as previous, except matches must also be tagged with the NNS POS tag

Proximity Type Example Matches #odN(e1 … em) or #N(e1 … em) #od5(dog cat) or #5(dog cat) All occurrences of dog and cat appearing ordered within a window of 5 words #uwN(e1 … em) #uw5(dog cat) All occurrences of dog and cat that appear in any order within a window of 5 words #phrase(e1 … em) #phrase(#1(willy wonka) #uw3(chocolate factory)) System dependent implementation (defaults to #odm) #syntax:xx(e1 … em) #syntax:np(fresh powder) System dependent implementation

Context Restriction Example Matches dog.title All occurrences of dog appearing in the title context dog.title,paragraph All occurrences of dog appearing in both a title and paragraph contexts (may not be possible) <dog.title dog.paragraph> All occurrences of dog appearing in either a title context or a paragraph context #5(dog cat).head All matching windows contained within a head context

Context Evaluation Example Evaluated dog.(title) The term dog evaluated using the title context as the document dog.(title, paragraph) The term dog evaluated using the concatenation of the title and paragraph contexts as the document dog.figure(paragraph) The term dog restricted to figure tags within the paragraph context.

Belief Operators INQUERY INDRI #sum / #and #combine #wsum* #weight #or #not #max * #wsum is still available in INDRI, but should be used with discretion

Extent Retrieval Example Evaluated #combine[section](dog canine) Evaluates #combine(dog canine) for each extent associated with the section context #combine[title, section](dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context #sum(#sum[section](dog)) Returns a single score that is the #sum of the scores returned from #sum(dog) evaluated for each section extent #max(#sum[section](dog)) Same as previous, except returns the maximum score

Ad Hoc Retrieval Flat documents SGML/XML documents Query likelihood retrieval: q1 … qN ≡ #combine( q1 … qN) SGML/XML documents Can either retrieve documents or extents Context restrictions and context evaluations allow exploitation of document structure

Web Search Homepage / known-item finding Use mixture model of several document representations [Ogilvie and Callan ’03] Example query: Yahoo! #combine( #wsum( 0.2 yahoo.(body) 0.5 yahoo.(inlink) 0.3 yahoo.(title) ) )

Example Indri Web Query #weight( 0.1 #weight( 1.0 #prior(pagerank) 0.75 #prior(inlinks) ) 1.0 #weight( 0.9 #combine( #wsum( 1 stellwagen.(inlink) 1 stellwagen.(title) 3 stellwagen.(mainbody) 1 stellwagen.(heading) ) #wsum( 1 bank.(inlink) 1 bank.(title) 3 bank.(mainbody) 1 bank.(heading) ) ) 0.1 #combine( #wsum( 1 #uw8( stellwagen bank ).(inlink) 1 #uw8( stellwagen bank ).(title) 3 #uw8( stellwagen bank ).(mainbody) 1 #uw8( stellwagen bank ).(heading) ) ) ) )

Question Answering More expressive passage- and sentence-level retrieval Example: Where was George Washington born? #combine[sentence]( #1( george washington ) born #any:place) Returns a ranked list of sentences containing the phrase George Washington, the term born, and a snippet of text tagged as a PLACE named entity

Indri Examples Paragraphs from news feed articles published between 1991 and 2000 that mention a person, a monetary amount, and the company InfoCom #filreq(#band( NewsFeed.doctype #date:between(1991 2000) ) #combine[paragraph]( #any:person #any:money InfoCom ) )

Getting Help http://www.lemurproject.org Central website, tutorials, documentation, news http://www.lemurproject.org/phorum Discussion board, developers read and respond to questions README file in the code distribution Paul Ogilvie Trevor Strohman Don Metzler

Thank You! Q&A