Why Can’t We All Get Along? ( Structured Data and Information Retrieval) Bruce Croft Computer Science Department University of Massachusetts Amherst.

Slides:



Advertisements
Similar presentations
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Advertisements

Chapter 5: Introduction to Information Retrieval
An Overview of the Indri Search Engine Don Metzler Center for Intelligent Information Retrieval University of Massachusetts, Amherst Joint work with Trevor.
Modern Information Retrieval Chapter 1: Introduction
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst.
Information Retrieval in Practice
Search Engines and Information Retrieval
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Information Retrieval in Practice
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Vector Space Model CS 652 Information Extraction and Integration.
Scalable Text Mining with Sparse Generative Models
Overview of Search Engines
Information Retrieval in Practice
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
Lemur Toolkit Introduction
Search Engines and Information Retrieval Chapter 1.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
INDRI - Overview Don Metzler Center for Intelligent Information Retrieval University of Massachusetts, Amherst.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Introduction to Information Retrieval
INF 141: Information Retrieval
Presentation transcript:

Why Can’t We All Get Along? ( Structured Data and Information Retrieval) Bruce Croft Computer Science Department University of Massachusetts Amherst

Overview History of structured data in IR Conceptual similarities and differences What is the goal? The Indri System Examples using IR for structured data –XML retrieval –Relevance models –Entity retrieval

History IR systems have had Boolean field restrictions since 1970s –metadata: date, type, source, keywords –content structure: title, body Implementing IR systems using a relational DBMS first done in the 70s –Crawford and McCleod, –Efficiency issues with this approach persisted until 90s (e.g. DeFazio et al, SIGIR 95) –Inquery IR system successfully used object management system (Brown, SIGIR 95)

History Modifying DBMS model to incorporate probabilities to integrate DB/IR –e.g. probabilistic relational algebra (Fuhr and Rolleke, ACM TOIS 1994) –e.g. probabilistic datalog (Fuhr, SIGIR 95) Text retrieval as a SQL function in commercial DBMSs –e.g. Oracle, early 90s

History Ranked retrieval of “complex” documents –e.g. office documents with structure and significant text content (Croft, Krovetz and Turtle, IPM 1990) –Bayesian inference net model to combine evidence from different parts of document structure (Croft and Turtle, EDT 1992) –e.g. marked-up documents (Croft, Smith, and Turtle, SIGIR 1992) XML retrieval –INEX (2002)

Similarities and Differences Common interest in providing efficient access to information on a very large scale –indexing and optimization key topics Until recently, concern about effectiveness (accuracy) of access was domain of IR Focus on structured vs. unstructured data is historically true but less relevant today Statistical inference and ranking are central to IR, becoming more important in DB

Similarities and Differences IR systems have focused on providing access to information rather than answers –e.g. Web search –evaluation typically based on topical relevance and user relevance rather than correctness (except QA) IR works with multiple databases but not multiple relations IR query languages more like calculus than algebra Integrity, security, concurrency are central for DB, less so in IR

What is the Goal? One unified information system? –i.e. a single conceptual and formal framework to support the entire range of information needs –at least a grand challenge –or is it the Web? An integrated DB/IR system? –i.e. extend database model to fully support statistical inference and ranking –a major challenge given established systems and models

What is the Goal? An IR system with extended capability for structured data –i.e. extend IR model to include combination of evidence from structured and unstructured components of complex objects (documents) –backend database system used to store objects (cf. “one hand clapping”) –many applications look like this (e.g. desktop search, web shopping) –users seem to prefer this approach (simple queries or forms and ranking)

What is the Goal? What about important database functionality? –Source data can be stored in databases –Extended IR system will construct separate indexes What about optimization? –Search engines worry about optimization! –Can incorporate ideas from DB optimization What about updates? –Search engines worry about updates! –Backend database system still available What about joins? –Interesting. Treat IR objects as a view?

Indri – A Candidate IR System Indri is a separate, downloadable component of the Lemur Toolkit Influences –INQUERY [Callan, et. al. ’92] Inference network framework Query language –Lemur [ Language modeling (LM) toolkit –Lucene [ Popular off the shelf Java-based IR system Based on heuristic retrieval models Designed for new retrieval environments –i.e. GALE, CALO, AQUAINT, Web retrieval, and XML retrieval

Zoology 101 The indri is the largest type of lemur When first spotted the natives yelled “Indri! Indri!” Malagasy for "Look! Over there!"

Design Goals Off the shelf (Windows, *NIX, Mac platforms) –Simple to set up and use –Fully functional API w/ language wrappers for Java, etc… Robust retrieval model –Inference net + language modeling [Metzler and Croft ’04] Powerful query language –Designed to be simple to use, yet support complex information needs –Provides “adaptable, customizable scoring” Scalable –Highly efficient code –Distributed retrieval –Incremental update

Model Based on original inference network retrieval framework [Turtle and Croft ’91] Casts retrieval as inference in simple graphical model Extensions made to original model –Incorporation of probabilities based on language modeling rather than tf.idf –Multiple language models allowed in the network (one per indexed context)

Model D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … I q1q1 q2q2 α,β title α,β body α,β h1 Document node (observed) Model hyperparameters (observed) Context language models Representation nodes (terms, phrases, etc…) Belief nodes (#combine, #not, #max) Information need node (belief node)

Model I D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … q1q1 q2q2 α,β title α,β body α,β h1

P( r | θ ) Probability of observing a term, phrase, or feature given a context language model –r i nodes are binary Assume r ~ Bernoulli( θ ) –“Model B” – [Metzler, Lavrenko, Croft ’04]

Model I D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … q1q1 q2q2 α,β title α,β body α,β h1

P( θ | α, β, D ) Prior over context language model determined by α, β Assume P( θ | α, β ) ~ Beta( α, β ) –Bernoulli’s conjugate prior –α r = μP( r | C ) + 1 –β r = μP( ¬ r | C ) + 1 –μ is a free parameter

Model I D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … q1q1 q2q2 α,β title α,β body α,β h1

P( q | r ) and P( I | r ) Belief nodes are created dynamically based on query Belief node estimates are derived from standard link matrices –Combine evidence from parents in various ways –Allows fast inference by making marginalization computationally tractable Information need node is simply a belief node that combines all network evidence into a single value Documents are ranked according to P( I | α, β, D)

Example: #AND AB Q P(Q=true|a,b)AB 0false 0 true 0 false 1true

Query Language Extension of INQUERY query language “Structured” query language –Term weighting –Ordered / unordered windows –Synonyms Additional features –Language modeling motivated constructs –Added flexibility to deal with fields via contexts –Generalization of passage retrieval (extent retrieval)

Document Representation Department Descriptions The following list describes … Agriculture … Chemistry … Computer Science … Electrical Engineering … … Zoology department descriptions agriculture chemistry … zoology the following list describes … agriculture … context 1. agriculture 2. chemistry … 36. zoology extents 1. the following list describes agriculture … extents 1. department descriptions extents

Terms TypeExampleMatches Stemmed termdogAll occurrences of dog (and its stems) Surface term“dogs”Exact occurrences of dogs (without stemming) Term group (synonym group) All occurrences of dogs (without stemming) or canine (and its stems) POS qualified term.NNSSame as previous, except matches must also be tagged with the NNS POS tag

Proximity TypeExampleMatches #odN (e 1 … e m ) or #N (e 1 … e m ) #od5 (dog cat) or #5 (dog cat) All occurrences of dog and cat appearing ordered within a window of 5 words #uwN (e 1 … e m ) #uw5 (dog cat)All occurrences of dog and cat that appear in any order within a window of 5 words #phrase (e 1 … e m ) #phrase ( #1 (willy wonka) #uw3 (chocolate factory)) System dependent implementation (defaults to #od m) #syntax:xx (e 1 … e m ) #syntax:np (fresh powder)System dependent implementation

Context Restriction ExampleMatches dog.titleAll occurrences of dog appearing in the title context dog.title,paragraphAll occurrences of dog appearing in both a title and paragraph contexts (may not be possible) All occurrences of dog appearing in either a title context or a paragraph context #5 (dog cat).headAll matching windows contained within a head context

Context Evaluation ExampleEvaluated dog.(title)The term dog evaluated using the title context as the document dog.(title, paragraph)The term dog evaluated using the concatenation of the title and paragraph contexts as the document dog.figure(paragraph)The term dog restricted to figure tags within the paragraph context.

Belief Operators INQUERYINDRI #sum / #and#combine #wsum*#weight #or #not #max * #wsum is still available in INDRI, but should be used with discretion

Extent Retrieval ExampleEvaluated #combine [section](dog canine)Evaluates #combine (dog canine) for each extent associated with the section context #combine [title, section](dog canine)Same as previous, except is evaluated for each extent associated with either the title context or the section context #sum ( #sum [section](dog))Returns a single score that is the #sum of the scores returned from #sum (dog) evaluated for each section extent #max ( #sum [section](dog))Same as previous, except returns the maximum score

Extent Retrieval Example Introduction Statistical language modeling allows formal methods to be applied to information retrieval.... Multinomial Model Here we provide a quick review of multinomial language models.... Multiple-Bernoulli Model We now examine two formal methods for statistically modeling documents and queries based on the multiple-Bernoulli distribution.... … Query: #combine[section]( dirichlet smoothing ) SCOREDOCIDBEGINEND 0.50IR IR IR ………… Treat each section extent as a “document” 2.Score each “document” according to #combine( … ) 3.Return a ranked list of extents

Indri Examples “Where was George Washington born?” #combine[sentence]( #1( george washington ) born #any:place ) Paragraphs from news feed articles published between 1991 and 2000 that mention a person, a monetary amount, and the company InfoCom #filreq(#band( NewsFeed.doctype #date:between( ) ) #combine[paragraph]( #any:person #any:money InfoCom ) )

Example Indri Web Query #weight( 0.1 #weight( 1.0 #prior(pagerank) 0.75 #prior(inlinks) ) 1.0 #weight( 0.9 #combine( #wsum( 1 stellwagen.(inlink) 1 stellwagen.(title) 3 stellwagen.(mainbody) 1 stellwagen.(heading) ) #wsum( 1 bank.(inlink) 1 bank.(title) 3 bank.(mainbody) 1 bank.(heading) ) ) 0.1 #combine( #wsum( 1 #uw8( stellwagen bank ).(inlink) 1 #uw8( stellwagen bank ).(title) 3 #uw8( stellwagen bank ).(mainbody) 1 #uw8( stellwagen bank ).(heading) ) ) ) )

Examples of Using IR for Structured Data XML search Relevance models for incomplete data Extracted entity retrieval

XML Search INEX workshop is similar to TREC but focused on XML documents Queries contain varying degrees of structural specification –Not clear that these queries are realistic earlier study showed that people are not good about remembering structure –document structure can provide valuable evidence for content representation

Example INEX Query “NEXI”

Hierarchical Language Models Estimate a language model for each component of a document tree (Ogilvie 2004, 2005) Smooth using a weighted mixture of a background model, a document model, a parent model, and a mixture of the children models

Hierarchical Language Models title P(w|θ title ) body P(w|θ body ) section 1 bibliography P(w|θ bib ) … … document P(w|θ doc ) … section 2section n section titleparagraph 1paragraph n …

Does it work? Results from Ogilvie, 2003

Does it work? Results from Ogilvie, 2003

Indri INEX extensions Indri incorporates hierarchical language models Allows weights to be set for different language models and component type Query language extended to reference parent and child extents –use the.\field operator to access a child reference –use the./field operator to access a parent reference –use the.//field operator to access an ancestor reference –e.g. #combine[section]( bootstrap #combine[./title]( methodology ) )

Relevance Models for Incomplete Data Relevance models (Lavrenko, 2001) are used for query expansion in IR based on generative LMs Estimates dependencies between words based on training set or initial ranking Recently extended to semi-structured data for applications where records are missing data (Lavrenko, Yi, Allan, 2006) –e.g. NSDL collection with fields title, description, subject, content, audience –24% of 650,000 records have no subject field, 30% no author, 96% no audience

Relevance Models for Incomplete Data Basic process is to estimate relevance models for each field based on training data for a query, then rank test records based on comparison to relevance models Relevance model estimates how likely it is that a word occurs in a field of a record, given that a record matches the specified query fields Ranking is done using a weighted cross-entropy –weights reflect importance of field

Relevance Models for Incomplete Data In NSDL experiment, 127 queries of form {subject=’philosophy’ AND audience=‘high school’} In test collection, all records had subject and audience field values removed Retrieved records had precision of 30% in top 10, compared to 15% for a baseline that ranked text records containing all fields Shows potential of probabilistic models for this type of application –can also generate structured queries (Calado et al, CIKM 02)

Extracted Entity Retrieval Information extraction extracts structure from text –e.g. names, addresses, addresses, CVs, publications, tables Creates semi-structured (and noisy) data rather than databases –Table extraction can be the basis for question answering (Wei, Croft and McCallum, 2006) –Publication extraction is the basis of CITESEER-like systems (e.g. REXA, McCallum, 2005) –Person extraction can be the basis for “expert finding”

Expert Finding Evaluated in TREC Enterprise Track People are represented by text that co-occurs with names –which names? what text? People are ranked for a query using the text “profile” Relevance model approach is effective

Conclusion For many applications involving retrieval of semi-structured data, the right approach is an IR system based on a probabilistic retrieval model as a front-end, and a database system as the back-end –but IR system is not implemented using database system “Right” means gives effective results and supports users’ world view IR systems based on language models (e.g. Indri) are a good candidate