Diversified Retrieval as Structured Prediction

Name: Diversified Retrieval as Structured Prediction
Uploaded: 2017-08-24T04:31:24+00:00
Duration: PTM18S19
Channel: Patrick Knight
Description: Diversified Retrieval as Structured Prediction

Diversified Retrieval as Structured Prediction
Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell University Joint work with Thorsten Joachims

Need for Diversity (in IR)
Ambiguous Queries Different information needs using same query “Jaguar” At least one relevant result for each information need Learning Queries User interested in “a specific detail or entire breadth of knowledge available” [Swaminathan et al., 2008] Results with high information diversity References: A Swaminathan, C. Mathew and D. Kirovski. “Essential Pages.” MSR Technical Report, 2008.

Optimizing Diversity Interest in information retrieval
[Carbonell & Goldstein, 1998; Zhai et al., 2003; Zhang et al., 2005; Chen & Karger, 2006; Zhu et al., 2007; Swaminathan et al., 2008] Requires inter-document dependencies Impossible with standard independence assumptions E.g., probability ranking principle No consensus on how to measure diversity.

This Talk A method for representing and optimizing information coverage Discriminative training algorithm Based on structural SVMs Appropriate forms of training data Requires sufficient granularity (subtopic labels) Empirical evaluation

Choose top 3 documents Individual Relevance: D3 D4 D1 Pairwise Sim MMR: D3 D1 D2 Best Solution: D3 D1 D5

How to Represent Information?
Discrete feature space to represent information Decomposed into “nuggets” For query q and its candidate documents: All the words (title words, anchor text, etc) Cluster memberships (topic models / dim reduction) Taxonomy memberships (ODP) We will focus on words and title words.

Weighted Word Coverage
More distinct words = more information Weight word importance Will work automatically w/o human labels Goal: select K documents which collectively cover as many distinct (weighted) words as possible Budgeted max coverage problem (Khuller et al., 1997) Greedy selection yields (1-1/e) bound. Need to find good weighting function (learning problem).

Example V1 1 V2 2 V3 3 V4 4 V5 5 V1 V2 V3 V4 V5 D1 X D2 D3 Iter 1 12
Document Word Counts Word Benefit V1 1 V2 2 V3 3 V4 4 V5 5 V1 V2 V3 V4 V5 D1 X D2 D3 Marginal Benefit D1 D2 D3 Best Iter 1 12 11 10 Iter 2

Example V1 1 V2 2 V3 3 V4 4 V5 5 V1 V2 V3 V4 V5 D1 X D2 D3 Iter 1 12
Document Word Counts Word Benefit V1 1 V2 2 V3 3 V4 4 V5 5 V1 V2 V3 V4 V5 D1 X D2 D3 Marginal Benefit D1 D2 D3 Best Iter 1 12 11 10 Iter 2 -- 2 3

How to Weight Words? Not all words created equal
“the” Conditional on the query “computer” is normally fairly informative… …but not for the query “ACM” Learn weights based on the candidate set (for a query)

Prior Work Essential Pages [Swaminathan et al., 2008]
Uses fixed function of word benefit Depends on word frequency in candidate set - Local version of TF-IDF - Frequent words low weight (not important for diversity) - Rare words low weight (not representative)

Linear Discriminant x = (x1,x2,…,xn) - candidate documents
v – an individual word We will use thousands of such features

Linear Discriminant x = (x1,x2,…,xn) - candidate documents
y – subset of x (the prediction) V(y) – union of words from documents in y. Discriminant Function: Benefit of covering word v is then wT(v,x)

Linear Discriminant Does NOT reward redundancy
Benefit of each word only counted once Greedy has (1-1/e)-approximation bound Linear (joint feature space) Suitable for SVM optimization

More Sophisticated Discriminant
Documents “cover” words to different degrees A document with 5 copies of “Thorsten” might cover it better than another document with only 2 copies.

Documents “cover” words to different degrees A document with 5 copies of “Thorsten” might cover it better than another document with only 2 copies. Use multiple word sets, V1(y), V2(y), … , VL(y) Each Vi(y) contains only words satisfying certain importance criteria. Requires more sophisticated joint feature map.

Conventional SVMs Input: x (high dimensional point)
Target: y (either +1 or -1) Prediction: sign(wTx) Training: subject to: The sum of slacks upper bounds the accuracy loss

Structural SVM Formulation
Input: x (candidate set of documents) Target: y (subset of x of size K) Same objective function: Constraints for each incorrect labeling y’. Score of best y at least as large as incorrect y’ plus loss Requires new training algorithm [Tsochantaridis et al., 2005]

Weighted Subtopic Loss
Example: x1 covers t1 x2 covers t1,t2,t3 x3 covers t1,t3 Motivation Higher penalty for not covering popular subtopics Mitigates effects of label noise in tail subtopics # Docs Loss t1 3 1/2 t2 1 1/6 t3 2 1/3

Diversity Training Data
TREC 6-8 Interactive Track Queries with explicitly labeled subtopics E.g., “Use of robots in the world today” Nanorobots Space mission robots Underwater robots Manual partitioning of the total information regarding a query

Experiments TREC 6-8 Interactive Track Queries
Documents labeled into subtopics. 17 queries used, considered only relevant docs decouples relevance problem from diversity problem 45 docs/query, 20 subtopics/query, 300 words/doc Trained using LOO cross validation

TREC 6-8 Interactive Track
Retrieving 5 documents

Can expect further benefit from having more training data.

Moving Forward Larger datasets Different types of training data
Evaluate relevance & diversity jointly Different types of training data Our framework can define loss in different ways Can we leverage clickthrough data? Different feature representations Build on top of topic modeling approaches? Can we incorporate hierarchical retrieval?

References & Code/Data
“Predicting Diverse Subsets Using Structural SVMs” [Yue & Joachims, ICML 2008] Source code and dataset available online Work supported by NSF IIS , Microsoft Fellowship, and Yahoo! KTC Grant.

Extra Slides

Separate i for each importance level i. Joint feature map  is vector composition of all i Greedy has (1-1/e)-approximation bound. Still uses linear feature space.

Maximizing Subtopic Coverage
Goal: select K documents which collectively cover as many subtopics as possible. Perfect selection takes n choose K time. Basically a set cover problem. Greedy gives (1-1/e)-approximation bound. Special case of Max Coverage (Khuller et al, 1997)

Learning Set Cover Representations
Given: Manual partitioning of a space subtopics Weighting for how items cover manual partitions subtopic labels + subtopic loss Automatic partitioning of the space Words Goal: Weighting for how items cover automatic partitions The (greedy) optimal covering solutions agree

Essential Pages

Essential Pages x = (x1,x2,…,xn) - set of candidate documents for a query y – a subset of x of size K (our prediction). Benefit of covering word v with document xi Importance of covering word v Intuition: Frequent words cannot encode information diversity. Infrequent words do not provide significant information [Swaminathan et al., 2008]

Structural SVMs

Minimizing Hinge Loss Suppose for incorrect y’: Then:
References: Y. Yue, T. Finley, F. Radlinski, T. Joachims. “A Support Vector Method for Optimizing Average Precision.” In Proceedings of SIGIR ( [Tsochantaridis et al., 2005]

Finding Most Violated Constraint
A constraint is violated when Finding most violated constraint reduces to

Finding Most Violated Constraint
Encode each subtopic as an additional “word” to be covered. Use greedy prediction to find approximate most violated constraint.

Illustrative Example Original SVM Problem Structural SVM Approach
Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation.

Approximate Constraint Generation
Theoretical guarantees no longer hold. Might not find an epsilon-close approximation to the feasible region boundary. Performs well in practice.

Approximate constraint generation seems to work perform well.

Experiments

TREC Experiments 12/4/1 train/valid/test split
Approx 500 documents in training set Permuted until all 17 queries were tested once Set K=5 (some queries have very few documents) SVM-div – uses term frequency thresholds to define importance levels SVM-div2 – in addition uses TFIDF thresholds

TREC Results Method Loss Random 0.469 Okapi 0.472 Unweighted Model
0.471 Essential Pages 0.434 SVM-div 0.349 SVM-div2 0.382 Methods W / T / L SVM-div vs Ess. Pages 14 / 0 / 3 ** SVM-div2 vs Ess. Pages 13 / 0 / 4 SVM-div vs SVM-div2 9 / 6 / 2

Synthetic Dataset Trec dataset very small
Synthetic dataset so we can vary retrieval size K 100 queries 100 docs/query, 25 subtopics/query, 300 words/doc 15/10/75 train/valid/test split

Consistently outperforms Essential Pages

Diversified Retrieval as Structured Prediction

Similar presentations

Presentation on theme: "Diversified Retrieval as Structured Prediction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diversified Retrieval as Structured Prediction

Similar presentations

Presentation on theme: "Diversified Retrieval as Structured Prediction"— Presentation transcript:

Similar presentations

About project

Feedback