Presentation on theme: "Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell."— Presentation transcript:
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell University Joint work with Thorsten Joachims
Need for Diversity (in IR) Ambiguous Queries –Different information needs using same query –“Jaguar” –At least one relevant result for each information need Learning Queries –User interested in “a specific detail or entire breadth of knowledge available” [Swaminathan et al., 2008] –Results with high information diversity
Optimizing Diversity Interest in information retrieval –[Carbonell & Goldstein, 1998; Zhai et al., 2003; Zhang et al., 2005; Chen & Karger, 2006; Zhu et al., 2007; Swaminathan et al., 2008] Requires inter-document dependencies –Impossible with standard independence assumptions –E.g., probability ranking principle No consensus on how to measure diversity.
This Talk A method for representing and optimizing information coverage Discriminative training algorithm –Based on structural SVMs Appropriate forms of training data –Requires sufficient granularity (subtopic labels) Empirical evaluation
Choose top 3 documents Individual Relevance:D3 D4 D1 Pairwise Sim MMR:D3 D1 D2 Best Solution:D3 D1 D5
How to Represent Information? Discrete feature space to represent information –Decomposed into “nuggets” For query q and its candidate documents: –All the words (title words, anchor text, etc) –Cluster memberships (topic models / dim reduction) –Taxonomy memberships (ODP) We will focus on words and title words.
Weighted Word Coverage More distinct words = more information Weight word importance Will work automatically w/o human labels Goal: select K documents which collectively cover as many distinct (weighted) words as possible –Budgeted max coverage problem (Khuller et al., 1997) –Greedy selection yields (1-1/e) bound. –Need to find good weighting function (learning problem).
Example D1D2D3Best Iter 1121110D1 Iter 2 Marginal Benefit V1V2V3V4V5 D1XXX D2XXX D3XXXX WordBenefit V11 V22 V33 V44 V55 Document Word Counts
Example D1D2D3Best Iter 1121110D1 Iter 2--23D3 Marginal Benefit V1V2V3V4V5 D1XXX D2XXX D3XXXX WordBenefit V11 V22 V33 V44 V55 Document Word Counts
How to Weight Words? Not all words created equal –“the” Conditional on the query –“computer” is normally fairly informative… –…but not for the query “ACM” Learn weights based on the candidate set –(for a query)
Prior Work Essential Pages [Swaminathan et al., 2008] –Uses fixed function of word benefit –Depends on word frequency in candidate set – - Local version of TF-IDF – - Frequent words low weight – (not important for diversity) – - Rare words low weight – (not representative)
Linear Discriminant x = (x 1,x 2,…,x n ) - candidate documents v – an individual word We will use thousands of such features
Linear Discriminant x = (x 1,x 2,…,x n ) - candidate documents y – subset of x (the prediction) V(y) – union of words from documents in y. Discriminant Function: Benefit of covering word v is then w T (v,x)
Linear Discriminant Does NOT reward redundancy –Benefit of each word only counted once Greedy has (1-1/e)-approximation bound Linear (joint feature space) –Suitable for SVM optimization
More Sophisticated Discriminant Documents “cover” words to different degrees –A document with 5 copies of “Thorsten” might cover it better than another document with only 2 copies.
More Sophisticated Discriminant Documents “cover” words to different degrees –A document with 5 copies of “Thorsten” might cover it better than another document with only 2 copies. Use multiple word sets, V 1 (y), V 2 (y), …, V L (y) Each V i (y) contains only words satisfying certain importance criteria. Requires more sophisticated joint feature map.
Conventional SVMs Input: x (high dimensional point) Target: y (either +1 or -1) Prediction: sign(w T x) Training: subject to: The sum of slacks upper bounds the accuracy loss
Structural SVM Formulation Input: x (candidate set of documents) Target: y (subset of x of size K) Same objective function: Constraints for each incorrect labeling y’. Score of best y at least as large as incorrect y’ plus loss Requires new training algorithm [Tsochantaridis et al., 2005]
Weighted Subtopic Loss Example: –x 1 covers t 1 –x 2 covers t 1,t 2,t 3 –x 3 covers t 1,t 3 Motivation –Higher penalty for not covering popular subtopics –Mitigates effects of label noise in tail subtopics # DocsLoss t1t1 31/2 t2t2 11/6 t3t3 21/3
Diversity Training Data TREC 6-8 Interactive Track –Queries with explicitly labeled subtopics –E.g., “Use of robots in the world today” Nanorobots Space mission robots Underwater robots –Manual partitioning of the total information regarding a query
Experiments TREC 6-8 Interactive Track Queries Documents labeled into subtopics. 17 queries used, –considered only relevant docs –decouples relevance problem from diversity problem 45 docs/query, 20 subtopics/query, 300 words/doc Trained using LOO cross validation
Can expect further benefit from having more training data.
Moving Forward Larger datasets –Evaluate relevance & diversity jointly Different types of training data –Our framework can define loss in different ways –Can we leverage clickthrough data? Different feature representations –Build on top of topic modeling approaches? –Can we incorporate hierarchical retrieval?
References & Code/Data “Predicting Diverse Subsets Using Structural SVMs” –[Yue & Joachims, ICML 2008] Source code and dataset available online –http://projects.yisongyue.com/svmdiv/ Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KTC Grant.