Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University.

Similar presentations


Presentation on theme: "An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University."— Presentation transcript:

1 An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

2 Supervised Learning Find function from input space X to output space Y such that the prediction error is low. Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters… x y 1 GATACAACCTATCCCCGTATATATATTCTA TGGGTATAGTATTAAATCAATACAACCTAT CCCCGTATATATATTCTATGGGTATAGTAT TAAATCAATACAACCTATCCCCGTATATAT ATTCTATGGGTATAGTATTAAATCAGATAC AACCTATCCCCGTATATATATTCTATGGGT ATAGTATTAAATCACATTTA x y x y 7.3

3 Part-of-Speech Tagging –Given a sequence of words x, predict sequence of tags y. –Dependencies from tag-tag transitions in Markov model.  Similarly for other sequence labeling problems, e.g., RNA Intron/Exon Tagging. The rain wet the cat x DetNV N y Examples of Complex Output Spaces

4 Natural Language Parsing –Given a sequence of words x, predict the parse tree y. –Dependencies from structural constraints, since y has to be a tree. The dog chased the cat x S VPNP DetNV NP DetN y

5 Examples of Complex Output Spaces Information Retrieval –Given a query x, predict a ranking y. –Dependencies between results (e.g. avoid redundant hits) –Loss function over rankings (e.g. Average Precision) SVM x 1.Kernel-Machines 2.SVM-Light 3.Learning with Kernels 4.SV Meppen Fan Club 5.Service Master & Co. 6.School of Volunteer Management 7.SV Mattersburg Online … y

6 Examples of Complex Output Spaces Multi-label Prediction Protein Sequence Alignment Noun Phrase Co-reference Clustering Rankings in Information Retrieval Inference in Graphical Models …and many more.

7 1 st Order Sequence Labeling Given: –scoring function S(x, y 1, y 2 ) –input example x = (x 1,…,x n ) Finds sequence y = (y 1,…,y n ) to maximize Solved with dynamic programming (Viterbi) “Hypothesis Function”

8 Some Formulation Restrictions Assume S is parameterized linearly by some weight vector w in R D. This means that “Hypothesis Function”

9 Joint Feature Map From last slide: Joint feature map: Our hypothesis function: “Linear Discriminant Function”

10 Structured Prediction Learning Problem Efficient Inference/Prediction –Viterbi in sequence labeling –CKY Parser for parse trees –Belief Propagation for Markov random fields –Sorting for ranking Efficient Learning/Training –Learn parameters w from training data {x i,y i } i=1..N –Solution: use Structural SVM framework –Can also use Perceptrons, CRFs, MEMMs, M 3 Ns etc.

11 Conventional SVMs Input: x (high dimensional point) Target: y (either +1 or -1) Prediction: sign(w T x) Training: subject to: The sum of slacks upper bounds the 0/1 loss!

12 Structural SVM Let x denote a structured input (sentence) Let y denote a structured output (POS tags) Standard objective function: Constraints are defined for each incorrect labeling y’ over each x. [Tsochantaridis et al., 2005] Score(y (i) )Score(y ’ )Loss(y ’ )Slack

13 Interpreting Constraints Suppose for incorrect y’: Then: Score(y (i) )Score(y ’ )Loss(y ’ )Slack [Tsochantaridis et al., 2005]

14 Adapting to Sequence Labeling Minimize subject to where and Sum of slacks upper bound loss. Too many constraints!

15 Structural SVM Training The trick is to not enumerate all constraints. Suppose we only solve the SVM objective over a small subset of constraints (working set). –This is efficient –Equivalent to solving standard SVM. But some constraints from global set might be violated.

16 Structural SVM Training STEP 1: Solve the SVM objective function using only working set of constraints W. STEP 2: Using the model learned in STEP 1, find the most violated constraint from the global set of constraints. STEP 3: If the constraint returned in STEP 2 is violated by more than ε, add it to W. Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1. STEP 1-3 is guaranteed to loop for at most O(1/epsilon) iterations. [Joachims et al., 2009] *This is known as a “cutting plane” method.

17 Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. *This is known as a “cutting plane” method.

18 Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. *This is known as a “cutting plane” method.

19 Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. *This is known as a “cutting plane” method.

20 Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. *This is known as a “cutting plane” method.

21 Finding Most Violated Constraint A constraint is violated when Finding most violated constraint reduces to Highly related to inference: “Loss augmented inference”

22 Sequence Labeling Revisited Finding most violated constraint… … can be solved using Viterbi!

23 Structural SVM Recipe Joint feature map Inference method Loss function Loss-augmented (most violated constraint)

24 Structural SVMs for Rankings Predicting rankings important in IR Must provide the four ingredients: –How to represent joint feature map Ψ? –How to perform inference? –What loss function to optimize for? –How to find most violated constraint?

25 Joint Feature Map for Ranking Let x = (x 1,…x n ) denote candidate documents Let y jk = {+1, -1} encode pairwise rank orders Feature map is pairwise feature difference of documents. Inference made by sorting on document scores w T x i

26 Multivariate Loss Functions Information Retrieval focused on ranking-centric performance measures –Normalized Discounted Cumulative Gain –Precision @ K –Mean Average Precision –Expected Reciprocal Rank These measures all depend on the entire ranking

27 Multivariate Loss Functions Information Retrieval focused on ranking-centric performance measures –Normalized Discounted Cumulative Gain –Precision @ K –Mean Average Precision –Expected Reciprocal Rank These measures all depend on the entire ranking

28 Mean Average Precision Consider rank position of each relevant doc –K 1, K 2, … K R Compute Precision@K for each K 1, K 2, … K R Average precision = average of P@K Ex: has AvgPrec of MAP is Average Precision across multiple queries/rankings

29 [Yue & Burges, 2007]

30 Structural SVM for MAP Minimize subject to where ( y jk = {-1, +1} ) and Sum of slacks is smooth upper bound on MAP loss. [Yue et al., SIGIR 2007]

31 Too Many Constraints! For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g., An incorrect labeling would be any other ranking, e.g., This ranking has Average Precision of about 0.8 with  (y’) ≈ 0.2 Intractable number of rankings, thus an intractable number of constraints!

32 Finding Most Violated Constraint Observations MAP is invariant on the order of documents within a relevance class –Swapping two relevant or non-relevant documents does not change MAP. Joint SVM score is optimized by sorting by document score, w T x j Reduces to finding an interleaving between two sorted lists of documents [Yue et al., SIGIR 2007]

33 Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents [Yue et al., SIGIR 2007]

34 Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document [Yue et al., SIGIR 2007]

35 Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document Repeat for next non-relevant document [Yue et al., SIGIR 2007]

36 Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document Repeat for next non-relevant document Never want to swap past previous non-relevant document [Yue et al., SIGIR 2007]

37 Finding Most Violated Constraint ► Start with perfect ranking Consider swapping adjacent relevant/non-relevant documents Find the best feasible ranking of the non-relevant document Repeat for next non-relevant document Never want to swap past previous non-relevant document Repeat until all non-relevant documents have been considered [Yue et al., SIGIR 2007]

38

39 Need for Diversity (in IR) Ambiguous Queries –Different information needs using same query –“Jaguar” –At least one relevant result for each information need Learning Queries –User interested in “a specific detail or entire breadth of knowledge available” [Swaminathan et al., 2008] –Results with high information diversity

40 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

41 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

42 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

43 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

44 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

45 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

46 Choose top 3 documents Individual Relevance:D3 D4 D1 Greedy Coverage Solution:D3 D1 D5

47 Example D1D2D3Best Iter 1121110D1 Iter 2 Marginal Benefit V1V2V3V4V5 D1XXX D2XXX D3XXXX WordBenefit V11 V22 V33 V44 V55 Document Word Counts

48 Example D1D2D3Best Iter 1121110D1 Iter 2--23D3 Marginal Benefit V1V2V3V4V5 D1XXX D2XXX D3XXXX WordBenefit V11 V22 V33 V44 V55 Document Word Counts

49 Prior Work Essential Pages [Swaminathan et al., 2008] –Uses fixed function of word benefit –Depends on word frequency in candidate set – - Local version of TF-IDF – - Frequent words low weight – (not important for diversity) – - Rare words low weight – (not representative)

50 Joint Feature Map x = (x 1,x 2,…,x n ) - candidate documents y – subset of x V(y) – union of words from documents in y. Joint feature map:  (v,x) – frequency features (e.g., >10%, >20%, etc). Benefit of covering word v is then w T  (v,x) [Yue & Joachims, ICML 2008]

51 Joint Feature Map Does NOT reward redundancy –Benefit of each word only counted once Greedy inference has (1-1/e)-approximation bound –Due to h(x) being monotone submodular Linear (joint feature space) –Allows for SVM optimization (Used more sophisticated discriminant in experiments.) [Yue & Joachims, ICML 2008]

52 Weighted Subtopic Loss Example: –x 1 covers t 1 –x 2 covers t 1,t 2,t 3 –x 3 covers t 1,t 3 Motivation –Higher penalty for not covering popular subtopics # DocsLoss t1t1 31/2 t2t2 11/6 t3t3 21/3 [Yue & Joachims, ICML 2008]

53 Finding Most Violated Constraint Encode each subtopic as an additional “word” to be covered. Use greedy algorithm:

54 TREC 6-8 Interactive Track Retrieving 5 documents

55 Sentiment Classification Product reviews Movie reviews Political speeches Discussion forums What is the sentiment, and why? –What are the supporting sentences?

56 Identifying the Supporting Sentences Suppose we could extract the supporting sentences How can we do this automatically? 87% accuracy 98% accuracy

57 Joint Feature Map x = (x 1,x 2,…,x n ) – sentences of an article s = subset of supporting sentences y = sentiment label {+1,-1} Joint feature map: [Yessenalina, Yue & Cardie, EMNLP 2010]

58 Inference h(x|w) = [Yessenalina, Yue & Cardie, EMNLP 2010]

59 Latent Variable SVMs Input: x = (x 1,x 2,…,x n ) – sentences of an article Output: y (either +1 or -1), s (supporting sentences) Training: subject to: Not convex! [Yu & Joachims, 2009] [Yessenalina, Yue & Cardie, 2010]

60 Training Using CCCP First initialize s (i) for each training instance (1) train: s.t.: (2) infer s (i) from w: Repeat (1) and (2) … [Yu & Joachims, 2009] [Yessenalina, Yue & Cardie, 2010] Requires finding most violated constraint

61 InitializationMethodAccuracy BaselineStandard SVM88.6% Annotated RationalesZaidan et al., 200792.2% SVM-sle93.2% Opinion FinderYessenalina et al., 200991.8% SVM-sle92.5% InitializationMethodAccuracy BaselineStandard SVM70.0% Opinion FinderThomas et al., 200671.3% Bansal et al., 200875.0% SVM-sle77.7% Movie Reviews Corpus Congressional Debates Corpus

62 Green sentences denote most positive supporting sentences Underlined sentences denote least subjective sentences

63 Structural SVMs Train parameterized structured prediction models. Requires 4 ingredients –Joint feature map (linear) –Inference –Loss function –Finding most violated constraint Applications in information retrieval –Optimizing Mean Average Precision in rankings –Optimizing diversity –Predicting sentiment with latent explanations Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KSC Award.


Download ppt "An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University."

Similar presentations


Ads by Google