Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Pattern Recognition and Machine Learning

3.6 Support Vector Machines

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Dynamic Programming 25-Mar-17.

Diversified Retrieval as Structured Prediction

ICML 2009 Yisong Yue Thorsten Joachims Cornell University

Structured Prediction and Active Learning for Information Retrieval

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

1 LP, extended maxflow, TRW OR: How to understand Vladimirs most recent work Ramin Zabih Cornell University.

ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.

Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

CHAPTER 13: Alpaydin: Kernel Machines

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Text Categorization.

Abbas Edalat Imperial College London Contains joint work with Andre Lieutier (AL) and joint work with Marko Krznaric (MK) Data Types.

CSE3201/4500 Information Retrieval Systems

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Minimum Vertex Cover in Rectangle Graphs

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

ECG Signal processing (2)

Application of Ensemble Models in Web Ranking

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

An Introduction of Support Vector Machine

Linear Classifiers (perceptrons)

Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Structural SVMs Part 2 & General Structured Prediction 1.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

An Introduction of Support Vector Machine

Machine learning continued Image source:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Reduced Support Vector Machine

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

SVM by Sequential Minimal Optimization (SMO)

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Jifeng Dai 2011/09/27.  Introduction  Structural SVM  Kernel Design  Segmentation and parameter learning  Object Feature Descriptors  Experimental.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Universit at Dortmund, LS VIII

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.

Support Vector Machines Tao Department of computer science University of Illinois.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

NTU & MSRA Ming-Feng Tsai

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Lecture 7: Constrained Conditional Models

Learning to Rank Shubhra kanti karmaker (Santu)

Structured Learning of Two-Level Dynamic Rankings

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Learning to Rank with Ties

Presentation transcript:

Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University Joint work with: Thorsten Joachims, Filip Radlinski, and Thomas Finley

Supervised Learning Find function from input space X to output space Y such that the prediction error is low. Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters… x y 1 GATACAACCTATCCCCGTATATATATTCTA TGGGTATAGTATTAAATCAATACAACCTAT CCCCGTATATATATTCTATGGGTATAGTAT TAAATCAATACAACCTATCCCCGTATATAT ATTCTATGGGTATAGTATTAAATCAGATAC AACCTATCCCCGTATATATATTCTATGGGT ATAGTATTAAATCACATTTA x y x y 7.3

Examples of Complex Output Spaces Natural Language Parsing –Given a sequence of words x, predict the parse tree y. –Dependencies from structural constraints, since y has to be a tree. The dog chased the cat x S VPNP DetNV NP DetN y

Part-of-Speech Tagging –Given a sequence of words x, predict sequence of tags y. –Dependencies from tag-tag transitions in Markov model. Similarly for other sequence labeling problems, e.g., RNA Intron/Exon Tagging. The rain wet the cat x DetNV N y Examples of Complex Output Spaces

Information Retrieval –Given a query x, predict a ranking y. –Dependencies between results (e.g. avoid redundant hits) –Loss function over rankings (e.g. Average Precision) SVM x 1.Kernel-Machines 2.SVM-Light 3.Learning with Kernels 4.SV Meppen Fan Club 5.Service Master & Co. 6.School of Volunteer Management 7.SV Mattersburg Online … y

Goals of this Talk Learning to Rank –Optimizing Average Precision –Diversified Retrieval Structured Prediction –Complex Retrieval Goals –Structural SVMs (Supervised Learning)

Learning to Rank At intersection of ML and IR First generate input features

Learning to Rank Design a retrieval function f(x) = w T x –(weighted average of features) For each query q –Score all s q,d = w T x q,d –Sort by s q,d to produce ranking Which weight vector w is best?

Conventional SVMs Input: x (high dimensional point) Target: y (either +1 or -1) Prediction: sign(w T x) Training: subject to: The sum of slacks smooth upper bound for 0/1 loss

2 pairwise disagreements Optimizing Pairwise Agreements

Pairwise Preferences SVM Such that: Large Margin Ordinal Regression [Herbrich et al., 1999] Can be reduced to time [Joachims, 2005] Pairs can be reweighted to more closely model IR goals [Cao et al., 2006]

Mean Average Precision Consider rank position of each relevant doc –K 1, K 2, … K R Compute for each K 1, K 2, … K R Average precision = average of Ex: has AvgPrec of MAP is Average Precision across multiple queries/rankings

Optimization Challenges Rank-based measures are multivariate –Cannot decompose (additively) into document pairs –Need to exploit other structure Defined over rankings –Rankings do not vary smoothly –Discontinuous w.r.t model parameters –Need some kind of relaxation/approximation

[Yue & Burges, 2007]

Structured Prediction Let x be a structured input (candidate documents) Let y be a structured output (ranking) Use a joint feature map to encode the compatibility of predicting y for given x. –Captures all the structure of the prediction problem Consider linear models: after learning w, we can make predictions via

Linear Discriminant for Ranking Let x = (x 1,…x n ) denote candidate documents (features) Let y jk = {+1, -1} encode pairwise rank orders Feature map is linear combination of documents. Prediction made by sorting on document scores w T x i

Structural SVM Let x denote a structured input (candidate documents) Let y denote a structured output (ranking) Standard objective function: Constraints are defined for each incorrect labeling y over the set of documents x. [Tsochantaridis et al., 2005]

Minimizing Hinge Loss Suppose for incorrect y: Then: [Tsochantaridis et al., 2005]

Structural SVM for MAP Minimize subject to where ( y jk = {-1, +1} ) and Sum of slacks is smooth upper bound on MAP loss. [Yue et al., SIGIR 2007]

Too Many Constraints! For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g., An incorrect labeling would be any other ranking, e.g., This ranking has Average Precision of about 0.8 with (y) ¼ 0.2 Intractable number of rankings, thus an intractable number of constraints!

Cutting Plane Method Original SVM Problem Exponential constraints Most are dominated by a small set of important constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. [Tsochantaridis et al., 2005]

Cutting Plane Method Original SVM Problem Exponential constraints Most are dominated by a small set of important constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. [Tsochantaridis et al., 2005]

Cutting Plane Method Original SVM Problem Exponential constraints Most are dominated by a small set of important constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. [Tsochantaridis et al., 2005]

Cutting Plane Method Original SVM Problem Exponential constraints Most are dominated by a small set of important constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. [Tsochantaridis et al., 2005]

Finding Most Violated Constraint A constraint is violated when Finding most violated constraint reduces to Highly related to inference/prediction:

Finding Most Violated Constraint Observations MAP is invariant on the order of documents within a relevance class –Swapping two relevant or non-relevant documents does not change MAP. Joint SVM score is optimized by sorting by document score, w T x j Reduces to finding an interleaving between two sorted lists of documents [Yue et al., SIGIR 2007]

Finding Most Violated Constraint Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents [Yue et al., SIGIR 2007]

Finding Most Violated Constraint Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document [Yue et al., SIGIR 2007]

Finding Most Violated Constraint Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document Repeat for next non-relevant document [Yue et al., SIGIR 2007]

Finding Most Violated Constraint Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document Repeat for next non-relevant document Never want to swap past previous non-relevant document [Yue et al., SIGIR 2007]

Finding Most Violated Constraint Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document Repeat for next non-relevant document Never want to swap past previous non-relevant document Repeat until all non-relevant documents have been considered [Yue et al., SIGIR 2007]

Structural SVM for MAP Treats rankings as structured objects Optimizes hinge-loss relaxation of MAP –Provably minimizes the empirical risk –Performance improvement over conventional SVMs Relies on subroutine to find most violated constraint –Computationally compatible with linear discriminant

Structural SVM Recipe Joint feature map Inference method Loss function Loss-augmented (most violated constraint)

Need for Diversity (in IR) Ambiguous Queries –Users with different information needs issuing the same textual query –Jaguar or Apple –At least one relevant result for each information need Learning Queries –User interested in a specific detail or entire breadth of knowledge available [Swaminathan et al., 2008] –Results with high information diversity

Example Choose K documents with maximal information coverage. For K = 3, optimal set is {D1, D2, D10}

Diversity via Set Cover Documents cover information –Assume information is partitioned into discrete units. Documents overlap in the information covered. Selecting K documents with maximal coverage is a set cover problem –NP-complete in general –Greedy has (1-1/e) approximation [Khuller et al., 1997]

Diversity via Subtopics Current datasets use manually determined subtopic labels –E.g., Use of robots in the world today Nanorobots Space mission robots Underwater robots –Manual partitioning of the total information –Relatively reliable –Use as training data

Weighted Word Coverage Use words to represent units of information More distinct words = more information Weight word importance Does not depend on human labeling Goal: select K documents which collectively cover as many distinct (weighted) words as possible –Greedy selection yields (1-1/e) bound. –Need to find good weighting function (learning problem).

Example D1D2D3Best Iter D1 Iter 2 Marginal Benefit V1V2V3V4V5 D1XXX D2XXX D3XXXX WordBenefit V11 V22 V33 V44 V55 Document Word Counts

Example D1D2D3Best Iter D1 Iter 2--23D3 Marginal Benefit V1V2V3V4V5 D1XXX D2XXX D3XXXX WordBenefit V11 V22 V33 V44 V55 Document Word Counts

Related Work Comparison Essential Pages [Swaminathan et al., 2008] –Uses fixed function of word benefit –Depends on word frequency in candidate set Our goals –Automatically learn a word benefit function Learn to predict set covers Use training data Minimize subtopic loss –No prior ML approach (to our knowledge)

Linear Discriminant x = (x 1,x 2,…,x n ) - candidate documents y – subset of x V(y) – union of words from documents in y. Discriminant Function: (v,x) – frequency features (e.g., ¸ 10%, ¸ 20%, etc). Benefit of covering word v is then w T (v,x) [Yue & Joachims, ICML 2008]

Linear Discriminant Does NOT reward redundancy –Benefit of each word only counted once Greedy has (1-1/e)-approximation bound Linear (joint feature space) –Allows for SVM optimization (Used more sophisticated discriminant in experiments.) [Yue & Joachims, ICML 2008]

Weighted Subtopic Loss Example: –x 1 covers t 1 –x 2 covers t 1,t 2,t 3 –x 3 covers t 1,t 3 Motivation –Higher penalty for not covering popular subtopics # DocsLoss t1t1 31/2 t2t2 11/6 t3t3 21/3 [Yue & Joachims, ICML 2008]

Finding Most Violated Constraint Encode each subtopic as an additional word to be covered. Use greedy algorithm: (1-1/e)-approximation

TREC Experiments TREC 6-8 Interactive Track Queries Documents labeled into subtopics. 17 queries used, –considered only relevant docs –decouples relevance problem from diversity problem 45 docs/query, 20 subtopics/query, 300 words/doc Trained using LOO cross validation

TREC 6-8 Interactive Track Retrieving 5 documents

Can expect further benefit from having more training data.

Learning Set Cover Representations Given: –Manual partitioning of a space subtopics –Weighting for how items cover manual partitions subtopic labels + subtopic loss –Automatic partitioning of the space Words Goal: –Weighting for how items cover automatic partitions –The (greedy) optimal covering solutions agree

Summary Information Retrieval as Structured Prediction –Can be used to reason about complex retrieval goals –E.g., mean average precision, diversity –Software & papers available at Beyond Supervised Learning –Training data expensive to gather –Current work on interactive learning (with users) Thanks to Andrew McCallum, David Mimno & Marc Cartwright Work supported by NSF IIS , Microsoft Graduate Fellowship, and Yahoo! KTC Grant.

Extra Slides

Structural SVM Training STEP 1: Solve the SVM objective function using only the current working set of constraints. STEP 2: Using the model learned in STEP 1, find the most violated constraint from the global set of constraints. STEP 3: If the constraint returned in STEP 2 is violated by more than epsilon, add it to the working set. Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1. STEP 1-3 is guaranteed to loop for at most O(1/epsilon^2) iterations. [Tsochantaridis et al., 2005] *This is known as a cutting plane method.

Structural SVM Training Suppose we only solve the SVM objective over a small subset of constraints (working set). Some constraints from global set might be violated. When finding a violated constraint, only y is free, everything else is fixed –ys and xs fixed from training –w and slack variables fixed from solving SVM objective Degree of violation of a constraint is measured by:

SVM-map

Experiments Used TREC 9 & 10 Web Track corpus. Features of document/query pairs computed from outputs of existing retrieval functions. (Indri Retrieval Functions & TREC Submissions) Goal is to learn a recombination of outputs which improves mean average precision.

SVM-div

Result #18 Top of First Page Bottom of First Page Results From 11/27/2007 Query: Jaguar

More Sophisticated Discriminant Documents cover words to different degrees –A document with 5 copies of Microsoft might cover it better than another document with only 2 copies. Use multiple word sets, V 1 (y), V 2 (y), …, V L (y) Each V i (y) contains only words satisfying certain importance criteria. [Yue & Joachims, ICML 2008]

More Sophisticated Discriminant Separate i for each importance level i. Joint feature map is vector composition of all i Greedy has (1-1/e)-approximation bound. Still uses linear feature space. [Yue & Joachims, ICML 2008]

Essential Pages [Swaminathan et al., 2008] Benefit of covering word v with document x i Importance of covering word v Intuition: - Frequent words cannot encode information diversity. - Infrequent words do not provide significant information x = (x 1,x 2,…,x n ) - set of candidate documents for a query y – a subset of x of size K (our prediction).

Approximate Constraint Generation Theoretical guarantees no longer hold. –Might not find an epsilon-close approximation to the feasible region boundary. Performs well in practice.

TREC Experiments 12/4/1 train/valid/test split –Approx 500 documents in training set Permuted until all 17 queries were tested once Set K=5 (some queries have very few documents) SVM-div – uses term frequency thresholds to define importance levels SVM-div2 – in addition uses TFIDF thresholds

TREC Interactive Track Results MethodLoss Random0.469 Okapi0.472 Unweighted Model0.471 Essential Pages0.434 SVM-div0.349 SVM-div MethodsW / T / L SVM-div vs Ess. Pages 14 / 0 / 3 ** SVM-div2 vs Ess. Pages 13 / 0 / 4 SVM-div vs SVM-div2 9 / 6 / 2

Approximate constraint generation seems to work perform well.

Synthetic Dataset Trec dataset very small Synthetic dataset so we can vary retrieval size K 100 queries 100 docs/query, 25 subtopics/query, 300 words/doc 15/10/75 train/valid/test split

Consistently outperforms Essential Pages