Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Probabilistic Information Retrieval
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
CpSc 881: Information Retrieval
Lecture 11: Probabilistic Information Retrieval
Probabilistic Ranking Principle
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Chapter 7 Retrieval Models.
Hinrich Schütze and Christina Lioma
ISP 433/533 Week 2 IR Models.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
CS276A Text Retrieval and Mining Lecture 10. Recap of the last lecture Improving search results Especially for high recall. E.g., searching for aircraft.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
C.Watterscsci64031 Probabilistic Retrieval Model.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 6 22 Oct 2002.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
1 Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof.
Probabilistic Information Retrieval
CS276 Information Retrieval and Web Search
Probabilistic Retrieval Models
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Web-Mining Agents Probabilistic Information Retrieval
Information Retrieval Models: Probabilistic Models
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
John Lafferty, Chengxiang Zhai School of Computer Science
Text Retrieval and Mining
Recuperação de Informação B
CS 430: Information Discovery
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)

Acknowledgements Slides taken from: –Introduction to Information Retrieval Christopher Manning and Prabhakar Raghavan 2

Query How exact is the representation of the document ? How exact is the representation of the query ? How well is query matched to data? How relevant is the result to the query ? Document collection Document Representation Query representation Query Answer TYPICAL IR PROBLEM 3

Why probabilities in IR? User Information Need Documents Document Representation Query Representation Query Representation How to match? In traditional IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties? Uncertain guess of whether document has relevant content Understanding of user need is uncertain 4

Probabilistic Approaches to IR Probability Ranking Principle (Robertson, 70ies; Maron, Kuhns, 1959) Information Retrieval as Probabilistic Inference (van Rijsbergen & co, since 70ies) Probabilistic Indexing (Fuhr & Co., late 80ies-90ies ) Bayesian Nets in IR (Turtle, Croft, 90ies) Success : varied 5

Probability Ranking Principle Collection of Documents User issues a query A set of documents needs to be returned Question: In what order to present documents to user ?Question: In what order to present documents to user ? 6

Probability Ranking Principle Question: In what order to present documents to user ?Question: In what order to present documents to user ? Intuitively, want the “best” document to be first, second best - second, etc… Need a formal way to judge the “goodness” of documents w.r.t. queries. Idea: Probability of relevance of the document w.r.t. queryIdea: Probability of relevance of the document w.r.t. query 7

Let us recap probability theory Bayesian probability formulas Odds: 8

Odds vs. Probabilities 9

Probability Ranking Principle Let x be a document in the retrieved collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance. p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x. Need to find p(R|x) - probability that a retrieved document x is relevant. p(R),p(NR) - prior probability of retrieving a relevant or non- relevant document, respectively 10

Probability Ranking Principle Ranking Principle (Bayes’ Decision Rule): If p(R|x) > p(NR|x) then x is relevant, otherwise x is not relevant Note: 11

Probability Ranking Principle Claim: PRP minimizes the average probability of error If we decide NR If we decide R p(error) is minimal when all p(error|x) are minimimal. Bayes’ decision rule minimizes each p(error|x). 12

Probability Ranking Principle More complex case: retrieval costs. –C - cost of retrieval of relevant document –C’ - cost of retrieval of non-relevant document –let d, be a document Probability Ranking Principle: if for all d’ not yet retrieved, then d is the next document to be retrieved 13

PRP: Issues (Problems?) How do we compute all those probabilities? –Cannot compute exact probabilities, have to use estimates. –Binary Independence Retrieval (BIR) See below Restrictive assumptions –“Relevance” of each document is independent of relevance of other documents. –Most applications are for Boolean model. 14

Bayesian Nets in IR Bayesian Nets is the most popular way of doing probabilistic inference. What is a Bayesian Net ? How to use Bayesian Nets in IR? J. Pearl, “Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference”, Morgan-Kaufman,

Bayesian Nets for IR: Idea Document Network Query Network Large, but Compute once for each document collection Small, compute once for every query d1 dn d2 t1t2 tn r1r2 r3 rk di -documents ti - document representations ri - “concepts” tn’ I q2 q1 cm c2c1 ci - query concepts qi - high-level concepts I - goal node 16

Example: “reason trouble –two” Hamlet Macbeth reasondouble reasontwo ORNOT User query trouble Document Network Query Network 17

Bayesian Nets for IR: Roadmap Construct Document Network (once !) For each query –Construct best Query Network –Attach it to Document Network –Find subset of di’s which maximizes the probability value of node I (best subset). –Retrieve these di’s as the answer to query. 18

Bayesian Nets in IR: Pros / Cons More of a cookbook solution Flexible:create-your- own Document (Query) Networks Relatively easy to update Generalizes other Probabilistic approaches –PRP –Probabilistic Indexing Best-Subset computation is NP-hard –have to use quick approximations –approximated Best Subsets may not contain best documents Where do we get the numbers ? Pros Cons 19

Relevance models Given: PRP Goal: Estimate probability P(R|q,d) Binary Independence Retrieval (BIR): –Manyone –Many documents D - one query q –Estimate P(R|q,d) by considering whether d in D is relevant for q Binary Independence Indexing (BII): –Onemany –One document d - many queries Q –Estimate P(R|q,d) by considering whether a document d is relevant for a query q in Q 20

Binary Independence Retrieval Traditionally used in conjunction with PRP “Binary” = Boolean: documents are represented as binary vectors of terms: – – iff term i is present in document x. “Independence”: terms occur in documents independently Different documents can be modeled as same vector. 21

Binary Independence Retrieval Queries: binary vectors of terms Given query q, –for each document d need to compute p(R|q,d). –replace with computing p(R|q,x) where x is vector representing d Interested only in ranking Will use odds: 22

Binary Independence Retrieval Using Independence Assumption: Constant for each query Needs estimation So : 23

Binary Independence Retrieval Since x i is either 0 or 1: Let Assume, for all terms not occuring in the query (q i =0) Then... 24

All matching terms Non-matching query terms Binary Independence Retrieval All matching terms All query terms 25

Binary Independence Retrieval Constant for each query Only quantity to be estimated for rankings Retrieval Status Value: 26

Binary Independence Retrieval All boils down to computing RSV. So, how do we compute c i ’s from our data ? 27

Binary Independence Retrieval Estimating RSV coefficients. For each term i look at the following table: Estimates: 28

Binary Independence Indexing “Learning” from queries –More queries: better results p(q|x,R) - probability that if document x had been deemed relevant, query q had been asked The rest of the framework is similar to BIR 29

Binary Independence Indexing vs. Binary Independence Retrieval BIRBIR BIIBII Many Documents, One Query Bayesian Probability: Varies: document representation Constant: query (representation) One Document, Many Queries Bayesian Probability Varies: query Constant: document 30

Estimation – key challenge If non-relevant documents are approximated by the whole collection, then r i (prob. of occurrence in non- relevant documents for query) is n/N and –log (1– r i )/r i = log (N– n)/n ≈ log N/n = IDF! p i (probability of occurrence in relevant documents) can be estimated in various ways: –from relevant documents if know some Relevance weighting can be used in feedback loop –constant (Croft and Harper combination match) – then just get idf weighting of terms –proportional to prob. of occurrence in collection more accurately, to log of this (Greiff, SIGIR 1998) We have a nice theoretical foundation of wTD.IDF 31

Iteratively estimating p i  Assume that p i constant over all x i in query –p i = 0.5 (even odds) for any given doc  Determine guess of relevant document set: –V is fixed size set of highest ranked documents on this model (note: now a bit like tf.idf!)  We need to improve our guesses for p i and r i, so –Use distribution of x i in docs in V. Let V i be set of documents containing x i p i = |V i | / |V| –Assume if not retrieved then not relevant r i = (n i – |V i |) / (N – |V|)  Go to 2. until converges then return ranking 32

Probabilistic Relevance Feedback  Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents V, as above.  Interact with the user to refine the description: learn some definite members of R and NR  Reestimate p i and r i on the basis of these –Or can combine new information with original guess (use Bayesian prior):  Repeat, thus generating a succession of approximations to R. κ is prior weight 33

PRP and BIR: The lessons Getting reasonable approximations of probabilities is possible. Simple methods work only with restrictive assumptions: –term independence –terms not in query do not affect the outcome –boolean representation of documents/queries –document relevance values are independent Some of these assumptions can be removed 34

Food for thought Think through the differences between standard tf.idf and the probabilistic retrieval model in the first iteration Think through the differences between vector space (pseudo) relevance feedback and probabilistic (pseudo) relevance feedback 35

Good and Bad News Standard Vector Space Model –Empirical for the most part; success measured by results –Few properties provable Probabilistic Model Advantages –Based on a firm theoretical foundation –Theoretically justified optimal ranking scheme Disadvantages –Binary word-in-doc weights (not using term frequencies) –Independence of terms (can be alleviated) –Amount of computation –Has never worked convincingly better in practice 36