Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 15-2: Learning to Rank 1.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
ECG Signal processing (2)
Chapter 5: Introduction to Information Retrieval
An Introduction of Support Vector Machine
4. FREQUENCY DISTRIBUTION
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Hinrich Schütze and Christina Lioma
K nearest neighbor and Rocchio algorithm
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Scalable Text Mining with Sparse Generative Models
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Learning to Rank for Information Retrieval
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
SVM by Sequential Minimal Optimization (SMO)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Vector Space Models.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 20 11/3/2011.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Ranking and Learning 293S UCSB, Tao Yang, 2017
Ranking and Learning 290N UCSB, Tao Yang, 2014
Learning to Rank Shubhra kanti karmaker (Santu)
SVMs for Document Ranking
Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 15-2: Learning to Rank 1

Introduction to Information Retrieval Overview ❶ Learning Boolen Weights ❷ Learning Real-Valued Weights ❸ Rank Learning as Ordinal Regression 2

Introduction to Information Retrieval Outline ❶ Learning Boolen Weights ❷ Learning Real-Valued Weights ❸ Rank Learning as Ordinal Regression 3

Introduction to Information Retrieval 4 Main Idea  The aim of term weights (e.g. TF-IDF) is to measure term salience  Summing up term weights for a document allows to measure the relevance of a document to a query, hence to rank the document  Think of this as a text classification problem  Term weights can be learned using training examples that have been judged  This methodology falls under a general class of approaches known as machine learned relevance or learning to rank 4

Introduction to Information Retrieval 5 Learning weights Main methodology  Given a set of training examples, each of which is a tuple of: a query q, a document d, a relevance judgment for d on q  Simplest case: R(d, q) is either relevant (1) or nonrelevant (0)  More sophisticated cases: graded relevance judgments  Learn weights from these examples, so that the learned scores approximate the relevance judgments in the training examples Example? Weighted zone scoring 5

Introduction to Information Retrieval 6 What is weighted zone scoring?  Given a query and a collection where documents have three zones (a.k.a. fields): author, title, body  Weighted zone scoring requires a separate weight for each zone, e.g. g 1, g 2, g 3  Not all zones are equally important: e.g. author ¡ title ¡ body → g 1 = 0.2, g 2 = 0.3, g 3 = 0.5 (so that they add up to 1)  Score for a zone = 1 if the query term occurs in that zone, 0 otherwise (Boolean) Example Query term appears in title and body only Document score: (0.3 ・ 1) + (0.5 ・ 1) =

Introduction to Information Retrieval 7 Weighted zone scoring: let’s generalise Given q and d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0,1] by computing a linear combination of document zone scores, where each zone contributes a value  Consider a set of documents, which have l zones  Let g 1,..., g l ∈ [0, 1], such that g i =1 For 1 ≤ i ≤ l, let s i be the Boolean score denoting a match (or non-match) between q and the i th zone  E.g. s i could be any Boolean function that maps the presence of query terms in a zone to 0,1 Weighted zone score a.k.a ranked Boolean retrieval g i s i 7

Introduction to Information Retrieval 8 Weighted zone scoring and learning weights  Weighted zone scoring may be viewed as learning a linear function of the Boolean match scores contributed by the various zones  Bad news: labour-intensive assembly of user-generated relevance judgments from which to learn the weights  Especially in a dynamic collection (such as the Web)  Good news: reduce the problem of learning the weights g i to a simple optimisation problem 8

Introduction to Information Retrieval 9 Learning weights in weighted zone scoring  Simple case: let documents have two zones: title, body  The weighted zone scoring formula we saw before: g i s i (2)  Given q, d, we want to compute s T (d, q) and s B (d, q), depending whether the title or body zone of d matches query q  We compute a score between 0 and 1 for each (d, q) pair using s T (d, q) and s B (d, q) by using a constant g ∈ [0, 1]: score(d, q) = g ・ s T (d, q) + (1 − g) ・ s B (d, q) (3) 9

Introduction to Information Retrieval 10 Learning weights: determine g from training examples  Training examples: triples of the form Ф j = (d j, q j, r (d j, q j ))  A given training document d j and a given training query qj are assessed by a human who decides r (d j, q j ) (either relevant or nonrelevant) Example 10

Introduction to Information Retrieval 11 Learning weights: determine g from training examples  For each training example Ф j we have Boolean values s T (d j, q j ) and s B (d j, q j ) that we use to compute a score from: score(d j, q j ) = g ・ s T (d j, q j ) + (1 − g) ・ s B (d j, q j ) (4) Example 11

Introduction to Information Retrieval 12 Learning weights  We compare this computed score (score(d j, q j )) with the human relevance judgment for the same document-query pair (d j, q j )  We quantisize each relevant judgment as 1, and each nonrelevant judgment as 0  We define the error of the scoring function with weight g as ϵ(g,Ф j ) = (r (d j, q j ) − score(d j, q j )) 2 (5)  Then, the total error of a set of training examples is given by ϵ (g,Ф j ) (6)  The problem of learning the constant g from the given training examples then reduces to picking the value of g that minimises the total error 12

Introduction to Information Retrieval 13 Exercise: Find the value of g that minimises total error ϵ ❶ Quantisize: relevant as 1, and nonrelevant as 0 ❷ Compute score: score(d j, q j ) = g ・ s T ( d j, q j ) + (1 − g ) ・ s B ( d j, q j ) ❸ Compute total error: j ϵ ( g,Ф j ), where ϵ ( g,Ф j ) = ( r ( d j, q j ) − score ( d j, q j )) 2 ❹ Pick the value of g that minimises the total error 13 Training example

Introduction to Information Retrieval 14 Exercise solution ❶ Compute score score(d j, q j ) score(d 1, q 1 ) = g ・ 1 + (1 − g) ・ 1 = g + 1 − g = 1 score(d 2, q 2 ) = g ・ 0 + (1 − g) ・ 1 = − g = 1 − g score(d 3, q 3 ) = g ・ 0 + (1 − g) ・ 1 = − g = 1 − g score(d 4, q 4 ) = g ・ 0 + (1 − g) ・ 0 = = 0 score(d 5, q 5 ) = g ・ 1 + (1 − g) ・ 1 = g + 1 − g = 1 score(d 6, q 6 ) = g ・ 0 + (1 − g) ・ 1 = − g = 1 − g score(d 7, q 7 ) = g ・ 1 + (1 − g) ・ 0 = g + 0 = g ❷ Compute total error j ϵ ( g,Ф j ) (1− 1) 2 +(0− 1+g) 2 +(1− 1+g) 2 +(0 −0) 2 +(1 −1) 2 + (1 − 1 + g) 2 + (0 − g) 2 ❸ Pick the value of g that minimises the total error Solve by ranging g between and pick the g value that minimises the error 14

Introduction to Information Retrieval Outline ❶ Learning Boolen Weights ❷ Learning Real-Valued Weights ❸ Rank Learning as Ordinal Regression 15

Introduction to Information Retrieval 16 A simple example of machine learned scoring  So far, we considered a case where we had to combine Boolean indicators of relevance  Now consider more general factors that go beyond Boolean functions of query term presence in document zones 16

Introduction to Information Retrieval 17 A simple example of machine learned scoring  Setting: the scoring function is a linear combination of two factors: ❶ 1 the vector space cosine similarity between query and document (denoted α) ❷ 2 the minimum window width within which the query terms lie (denoted ω)  query term proximity is often very indicative of topical relevance  query term proximity gives an implementation of implicit phrases  Thus, we have one factor that depends on the statistics of query terms in the document as a bag of words, and another that depends on proximity weighting 17

Introduction to Information Retrieval 18 A simple example of machine learned scoring Given a set of training examples r (d j, q j ). For each example we compute:  vector space cosine similarity α  window width ω The result is a training set, with two real-valued features (α, ω) Example 18

Introduction to Information Retrieval 19 The same thing seen graphically on a 2-D plane 19

Introduction to Information Retrieval 20 A simple example of machine learned scoring  Again, let’s say: relevant = 1 and nonrelevant = 0  We now seek a scoring function that combines the values of the features to generate a value that is (close to) 0 or 1  We wish this function to be in agreement with our set of training examples as much as possible Without loss of generality, a linear classifier will use a linear combination of features of the form: Score(d, q) = Score(α, ω) = aα + bω + c, (7) with the coefficients a, b, c to be learned from the training data 20

Introduction to Information Retrieval 21 A simple example of machine learned scoring  The function Score(α, ω) represents a plane “hanging above” the figure  Ideally this plane assumes values close to 1 above the points marked R, and values close to 0 above the points marked N 21

Introduction to Information Retrieval 22 A simple example of machine learned scoring  We use thresholding: for a query, document pair, we pick a value θ  if Score(α, ω) > θ, we declare the document relevant, otherwise we declare it nonrelevant  As we know from SVMs, all points that satisfy Score(α, ω) = θ form a line (dashed here) → linear classifier that separates relevant from nonrelevant instances 22

Introduction to Information Retrieval 23 A simple example of machine learned scoring Thus, the problem of making a binary relevant/nonrelevant judgment given training examples turns into one of learning the dashed line in the figure separating relevant from nonrelevant training examples  In the α-ω plane, this line can be written as a linear equation involving α and ω, with two parameters (slope and intercept)  We have already seen linear classification methods for choosing this line  Provided we can build a sufficiently rich collection of training samples, we can thus altogether avoid hand-tuning score functions  Bottleneck: maintaining a suitably representative set of training examples, whose relevance assessments must be made by experts 23

Introduction to Information Retrieval 24 Result ranking by machine learning  The above ideas can be readily generalized to functions of many more than two variables  In addition to cosine similarity and query term window, there are lots of other indicators of relevance, e.g. PageRank-style measures, document age, zone contributions, document length, etc.  If these measures can be calculated for a training document collection with relevance judgments, any number of such measures can be used to train a machine learning classifier 24

Introduction to Information Retrieval Features released from Microsoft Research on 16 June Zones: body, anchor, title, url, whole document Features: query term number, query term ratio, stream length, idf, sum of term frequency, min of term frequency, max of term frequency, mean of term frequency, variance of term frequency, sum of stream length normalized term frequency, min of stream length normalized term frequency, max of stream length normalized term frequency, mean of stream length normalized term frequency, variance of stream length normalized term frequency, sum of tf*idf, min of tf*idf, max of tf*idf, mean of tf*idf, variance of tf*idf, boolean model, vector space model, BM25, LMIR.ABS, LMIR.DIR, LMIR.JM, number of slash in url, length of url, inlink number, outlink number, PageRank, SiteRank, QualityScore, QualityScore2, query-url click count, url click count, url dwell time. 25

Introduction to Information Retrieval 26 Result ranking by machine learning However, approaching IR ranking like this is not necessarily the right way to think about the problem  Statisticians normally first divide problems into classification problems (where a categorical variable is predicted) versus regression problems (where a real number is predicted)  In between is the specialised field of ordinal regression where a ranking is predicted  Machine learning for ad hoc retrieval is most properly thought of as an ordinal regression problem, where the goal is to rank a set of documents for a query, given training data of the same sort 26

Introduction to Information Retrieval Outline ❶ Learning Boolen Weights ❷ Learning Real-Valued Weights ❸ Rank Learning as Ordinal Regression 27

Introduction to Information Retrieval 28 IR ranking as ordinal regression Why formulate IR ranking as an ordinal regression problem?  because documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness  hence, the problem space weakens, since just a ranking is required rather than an absolute measure of relevance Especially germane in web search, where the ranking at the very top of the results list is exceedingly important Structural SVM for IR ranking Such work has been pursued using the structural SVM framework, where the class being predicted is a ranking of results for a query 28

Introduction to Information Retrieval 29 The construction of a ranking SVM  We begin with a set of judged queries  For each training query q, we have a set of documents returned in response to the query, which have been totally ordered by a person for relevance to the query  We construct a vector of features ψ j = ψ(d j, q) for each document/query pair, using features such as those discussed, and many more  For two documents d i and d j, we then form the vector of feature differences: Ф(d i, d j, q) = ψ(d i, q) − ψ(d j, q) (8) 29

Introduction to Information Retrieval 30 The construction of a ranking SVM  By hypothesis, one of d i and d j has been judged more relevant  If d i is judged more relevant than dj, denoted d i ≺ d j (d i should precede d j in the results ordering), then we will assign the vector Ф(d i, d j, q) the class y ijq = +1; otherwise −1  The goal then is to build a classifier which will return w T Ф(d i, d j, q) > 0 iff d i ≺ d j (9) 30

Introduction to Information Retrieval 31 Ranking SVM  This approach has been used to build ranking functions which outperform standard hand-built ranking functions in IR evaluations on standard data sets  See the references for papers that present such results (page 316) 31

Introduction to Information Retrieval 32 Note: Linear vs. nonlinear weighting  Both of the methods that we’ve seen use a linear weighting of document features that are indicators of relevance, as has most work in this area  Much of traditional IR weighting involves nonlinear scaling of basic measurements (such as log-weighting of term frequency, or idf)  At the present time, machine learning is very good at producing optimal weights for features in a linear combination, but it is not good at coming up with good nonlinear scalings of basic measurements  This area remains the domain of human feature engineering 32

Introduction to Information Retrieval 33 Recap  The idea of learning ranking functions has been around for a number of years, but it is only very recently that sufficient machine learning knowledge, training document collections, and computational power have come together to make this method practical and exciting  While skilled humans can do a very good job at defining ranking functions by hand, hand tuning is difficult, and it has to be done again for each new document collection and class of users 33