Presentation is loading. Please wait.

Presentation is loading. Please wait.

“B Y THE U SER, F OR THE U SER, W ITH THE L EARNING S YSTEM ”: L EARNING F ROM U SER I NTERACTIONS Karthik Raman December 12, 2014 Joint work with Thorsten.

Similar presentations


Presentation on theme: "“B Y THE U SER, F OR THE U SER, W ITH THE L EARNING S YSTEM ”: L EARNING F ROM U SER I NTERACTIONS Karthik Raman December 12, 2014 Joint work with Thorsten."— Presentation transcript:

1 “B Y THE U SER, F OR THE U SER, W ITH THE L EARNING S YSTEM ”: L EARNING F ROM U SER I NTERACTIONS Karthik Raman December 12, 2014 Joint work with Thorsten Joachims, Pannaga Shivaswamy, Tobias Schnabel

2 2 A GE O F THE WEB & DATA Learning is important for today’s Information Systems: Search Engines Recommendation Systems Social Networks, News sites Smart Homes, Robots …. Difficult to collect expert-labels for learning: Instead: Learn from the user (interactions). User feedback is timely, plentiful and easy to get. Reflects user’s – not experts’ – preferences

3 3 I NTERACTIVE L EARNING W ITH U SERS Users and system jointly work on the task (same goal). System is not a passive observer of user. Complement each other Need to develop learning algorithms in conjunction with plausible models of user behavior. SYSTEM (e.g., Search Engine) USER(s) Takes Action (e.g., Present ranking) Interacts and Provides Feedback (e.g., User clicks) Good at computation Knowledge-Poor Poor at computation Knowledge-Rich

4 4 A GENDA F OR T HIS T ALK Designing algorithms, for interactive learning with users, that are applicable in practice and have theoretical guarantees. Outline: 1. Handling weak, noisy and biased user feedback (Coactive Learning). 2. Predicting complex structures: Modeling dependence across items/documents (Diversity).

5 5 A GENDA F OR T HIS T ALK Designing algorithms, for interactive learning with users, that are applicable in practice and have theoretical guarantees. Outline: 1. Handling weak, noisy and biased user feedback (Coactive Learning) [RJSS ICML’13]. 2. Predicting complex structures: Modeling dependence across items/documents (Diversity).

6 6 U SER F EEDBACK ? NOISE: May receive some clicks even if irrelevant. CONTEXT-BIAS: Click on document may just mean poor quality of surrounding documents. POSITION-BIAS: Has been shown to be better than docs above, but cannot say anything about docs below. Higher the document, the more clicks it gets. [Joachims et. al. TOIS ’07] Click! B UILDING S EARCH E NGINE F OR ARXIV

7 7 I MPLICIT F EEDBACK F ROM U SER Presented Ranking Improved Ranking Click!

8 8 C OACTIVE L EARNING M ODEL SYSTEM (e.g., Search Engine) USER Context x t e.g., Query Present Object y t (e.g., Ranking) Receive Improved Object ̅y t User has utility U(x t, y t ). COACTIVE: U(x t, ̅y t ) ≥ α U(x t, y t ). Feedback assumed by other online learning models: FULL INFORMATION: U(x t, y 1 ), U(x t, y 2 )... BANDIT: U(x t, y t ). OPTIMAL : y* t = argmax y U(x t,y)

9 9 P REFERENCE P ERCEPTRON 1. Initialize weight vector w. 2. Get context x and present best y (as per current w). 3. Get feedback and construct (move-to-top) feedback. 4. Perceptron update to w : w += Φ( Feedback) - Φ( Presented)

10 10 T HEORETICAL A NALYSIS Analyze the algorithm’s regret i.e., the total sub-optimality where y* t is the optimal prediction. Characterize feedback as α-Informative: Not an assumption: Can characterize all user feedback α indicates the quality of feedback, ξ t is the slack variable (i.e. how much lower is received feedback than α quality).

11 11 R EGRET B OUND F OR P REFERENCE P ERCEPTRON Slack component Converges as √T (Same rate as optimal feedback convergence) For any α and w * s.t.: the algorithm has regret: Independent of Number of Dimensions Changes gracefully with α.

12 12 H OW D OES I T D O IN P RACTICE ? Performed user study on full-text search on arxiv.org Goal: Learning a ranking function Win Ratio: Interleaved comparison with (non- learning) baseline. Higher ratio is better (1 indicates similar perf.) Preference Perceptron performs poorly and is not stable. Feedback received has large slack values (for any reasonably large α)

13 13 I LLUSTRATIVE E XAMPLE Say user is imperfect judge of relevance: 20% error rate. d1d1 d2d2 dNdN...... Only relevant doc. 1 w 1 T Feature Values 10 d1d1 01 d 2…N

14 14 I LLUSTRATIVE E XAMPLE Say user is imperfect judge of relevance: 20% error rate. Algorithm oscillates!! Averaging or regularization cannot help either. d1d1 d2d2 dNdN...... 1-0.6 w 1 T 2340.61017209218 dNdN d1d1 0.2-0.2 0.20.4-0.4 79 -0.10.1 MethodAvg. Rank of Rel Doc Preference Perceptron9.36 Averaged Preference Perceptron 9.37 3PR (Our Method)2.08 For N=10, Averaged over 1000 runs. Feature Values 10 d1d1 01 d 2…N

15 15 K EY I DEA : P ERTURBATION Algorithm is stable!! Swapping reinforces correct w at small cost of presenting sub-optimal object. d1d1 d2d2 dNdN...... 1 w 2 T d2d2 d1d1 6 1.4-1.41.8-1.881.4-1.4 What if we randomly swap adjacent pairs? E.g. The first 2 results Update only when lower doc. of pair clicked. Feature Values 10 d1d1 01 d 2…N

16 16 P ERTURBED P REFERENCE P ERCEPTRON FOR R ANKING (3PR) Can use constant p t = 0.5 or dynamically determine it. 1.Initialize weight vector w. 2.Get context x and find best y (as per current w). 3.Perturb y and present slightly different solution y’ Swap adjacent pairs with probability p t. 4.Observe user feedback. Construct pairwise feedback. 5.Perceptron update to w : w += Φ( Feedback) - Φ( Presented)

17 17 3PR R EGRET B OUND Better ξ t values (lower slacks) than preference perceptron at cost of a vanishing term. Under the α-Informative feedback characterization, we can bound the regret as:

18 18 H OW WELL DOES IT WORK ? Cumulative Win Ratio Number of Feedback Baseline 3PR Repeated arXiv study but now with 3PR

19 19 D OES T HIS W ORK ? Running for more than a year No manual intervention Cumulative Win Ratio Number of Feedback Baseline 3PR [Raman et al., 2013]

20 20 A GENDA F OR T HIS T ALK Designing algorithms, for interactive learning with users, that are applicable in practice and have theoretical guarantees. Outline: 1. Handling weak, noisy and biased user feedback (Coactive Learning) 2. Predicting complex structures: Modeling dependence across items/documents (Diversity) [RSJ KDD’12].

21 21 I NTRINSICALLY D IVERSE U SER Economy Sports Technology

22 22 C HALLENGE : R EDUNDANCY Lack of diversity leads to some interests of the user being ignored. Nothing about sports or tech. Economy Sports Tech

23 23 Extrinsic Diversity: Non-learning approaches: MMR (Carbonell et al SIGIR ‘98), Less is More (Chen et al. SIGIR ‘06) Learning approaches: SVM-Div (Yue, Joachims ICML ‘08) Require relevance labels for all user-document pairs Ranked Bandits (Radlinski et al. ICML’08): Use online learning: Array of (decoupled) Multi-Armed bandits. Learns very slowly in practice. Slivkins et al. JMLR ‘13 Couples arms together. Does not generalize across queries. Hard coded-notion of diversity. Cannot be adjusted. Linear Submodular Bandits (Yue et. al. NIPS’12) Generalizes across queries. Requires cardinal utilities. P REVIOUS W ORK

24 24 KEY: For a given query and word, the marginal benefit of additional documents diminishes. M ODELING D EPENDENCIES U SING S UBMODULAR FUNCTIONS  E.g.: Coverage Function  Use greedy algorithm:  At each iteration: Choose Document that Maximizes Marginal Benefit  Simple and efficient  Constant Factor approximation D1D1 D2D2 D3D3 D4D4

25 25 P REDICTING D IVERSE R ANKINGS Ranking economyusasoccertechnology d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordWeight economy1.5 usa1.2 soccer1.6 technology1.1 Diversity-Seeking User:

26 26 P REDICTING D IVERSE R ANKINGS : M AX ( X ) Ranking economyusasoccertechnology d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordWeight economy1.5 usa1.2 soccer1.6 technology1.5 Doc. Marginal Benefit d1d1 9.3 d2d2 6.8 d3d3 7.8 d4d4 6.0

27 27 P REDICTING D IVERSE R ANKINGS Ranking economyusasoccertechnology d1d1 3400 MAX of Column 3400 d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordWeight economy1.5 usa1.2 soccer1.6 technology1.5 Doc. Marginal Benefit d1d1 9.3 d2d2 6.8 d3d3 7.8 d4d4 6.0

28 28 P REDICTING D IVERSE R ANKINGS Ranking economyusasoccertechnology d1d1 3400 MAX of Column 3400 d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordWeight economy1.5 usa1.2 soccer1.6 technology1.5 Doc. Marginal Benefit d1d1 0.0 d2d2 3.2 d3d3 0.0 d4d4 6.0

29 29 P REDICTING D IVERSE R ANKINGS Ranking economyusasoccertechnology d1d1 3400 d4d4 0004 MAX of Column 3404 d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordWeight economy1.5 usa1.2 soccer1.6 technology1.5 Doc. Marginal Benefit d1d1 0.0 d2d2 3.2 d3d3 0.0 d4d4 6.0

30 30 P REDICTING D IVERSE R ANKINGS Ranking economyusasoccertechnology d1d1 3400 d4d4 0004 d2d2 0324 MAX of Column 3424 d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordWeight economy1.5 usa1.2 soccer1.6 technology1.5 Doc. Marginal Benefit d1d1 0.0 d2d2 3.2 d3d3 0.0 d4d4 Can also use other submodular functions which are less stringent for penalizing redundancy e.g. log(), sqrt()..

31 31 D IVERSIFYING P ERCEPTRON 1.Initialize weight vector w. 2.Get context x and find best y (as per current w): Using greedy algorithm to make prediction. 3.Observe user implicit feedback and construct feedback object. 4.Perceptron update to w : w += Φ( Feedback) - Φ( Presented) 5. Clip weights to ensure non-negativity. Click! Presented Ranking (y) Improved Ranking (y’)

32 32 Under same feedback characterization, can bound regret w.r.t. optimal solution: D IVERSIFYING P ERCEPTRON Term due to greedy approximation

33 33 Submodularity helps cover more intents. C AN WE L EARN TO D IVERSIFY ?

34 34 Robust and efficient: Robust to noise and weakly informative feedback. Robust to model misspecification. Achieves the performance of supervised learning: Despite not being provided the true labels and receiving only partial feedback. O THER RESULTS

35 35 O THER A PPLICATIONS OF C OACTIVE L EARNING

36 36 E XTRINSIC D IVERSITY : P REDICTING S OCIALLY B ENEFICIAL R ANKINGS Social Perceptron Algorithms. Improved convergence rates for single query diversification over state-of-the-art. First algorithm for (extrinsic) diversification across queries using human interaction data. [RJ ECML ‘14]

37 37 R OBOTICS : T RAJECTORY P LANNING Learn good trajectories for manipulation tasks on-the-fly. [Jain et. al. NIPS ‘13]

38 38 F UTURE D IRECTIONS

39 39 P ERSONALIZED E DUCATION Lot of student interactions in MOOCs: Lectures and Material Forum participation Peer Grading [RJ KDD ‘14. LAS ‘15] Question-Answering and Practicing Tests Goal: Maximize student learning of concepts Challenge: Test on concepts students have difficulties with. Keeping students engaged (motivated).

40 40 R ECOMMENDER S YSTEMS Collaborative filtering/matrix factorization. Challenges: Learn from observed user actions: Biased preferences vs. cardinal utilities. Bilinear utility models for leveraging feedback to help other users as well.

41 41 S HORT -T ERM P ERSONALIZATION This talk: Mostly about Long-Term Personalization. Can also personalize based on shorter-term context. Complex search tasks: Require multiple user searches. Example: Query like remodeling ideas often followed by queries like “cost of typical remodel” “kitchen remodel” “paint colors” etc.. [RBCT SIGIR ‘13] Challenge: Less signal to learn from.

42 42 S UMMARY Studied how to: Work with noisy, biased feedback. Modeling item dependencies and learning complex structures Designing algorithms for interactive learning with users that work well in practice and have theoretical guarantees. Robustness to noise, biases and model misspecification. Efficient algorithms that learn fast. End-to-end live evaluation. Theoretical analysis of algorithms (helps debugging)!

43 43 T HANK YOU ! Q UESTIONS ?

44 44 R EFERENCES A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: learning optimally diverse rankings over large document collections. JMLR, 2013. Y. Yue and C. Guestrin. Linear submodular bandits and their application to diversied retrieval. NIPS, 2012. F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. ICML, 2008. P. Shivaswamy and T. Joachims. Online structured prediction via coactive learning. ICML, 2012.

45 45 R EFERENCES (C ONTD.) T. Joachims, L. Granka, Bing Pan, H. Hembrooke, F. Radlinski, G. Gay. Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search. ACM TOIS, 2007. Y. Yue and T. Joachims. Predicting Diverse Subsets Using Structural SVMs. ICML, 2008. J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and reproducing summaries. SIGIR, 1998. H. Chen and D. Karger. Less is more: Probabilistic models for retrieving fewer relevant documents. SIGIR, 2006.

46 46 R EFERENCES (C ONTD.) Karthik Raman, Pannaga Shivaswamy and Thorsten Joachims. Online Learning to Diversify from Implicit Feedback. KDD 2012 Karthik Raman, Thorsten Joachims, Pannaga Shivaswamy and Tobias Schabel. Stable Coactive Learning via Perturbation. ICML 2013 Karthik Raman, Thorsten Joachims. Learning Socially Optimal Information Systems from Egoistic Users. ECML 2013

47 47 E FFECT OF S WAP P ROBABILITY Robust to change in swap. Even some swapping helps. Dynamic strategy performs best.

48 48 B ENCHMARK R ESULTS On Yahoo! search dataset. PrefP[pair] is 3PR w/o perturbation Performs well.

49 49 E FFECT OF N OISE Robust to noise: Minimal change in performance Other algorithms: more sensitive.

50 50 E FFECT O F P ERTURBATION Perturbation only has a small effect even for fixed p (p=0.5)

51 51 S TABILITY ON A R X IV Few common results in the top 10 after 100 learning iterations.

52 52 G ENERAL P ROOF T ECHNIQUE Bound the 2-norm of the weight vector (w T ). Relate the inner product of w * and w T to regret: Use the feedback characterization

53 53 C OACTIVE L EARNING IN REAL SYSTEMS

54 54 F EATURE A GGREGATION Ranking economyusasoccertechnology d1d1 3400 d4d4 0004 d2d2 0324 MAX of Column 3424 SQRT of Col. sum 1.732.651.412.82 Column sum 3728 d1d1 economy:3, usa:4, finance:2.. d2d2 usa:3, soccer:2,world cup:2.. d3d3 usa:4, politics:3, economy:2 … d4d4 gadgets:2, technology:4, ipod:2.. WordMAX Weight SQRTCOL SUM economy 1.53.70.5 usa 1.24.82.3 soccer 1.63.24.1 technology 1.54.90.4  Can combine different submodular functions.

55 55 55 G ENERAL S UBMODULAR U TILITY (CIKM’11) Given ranking θ = (d 1, d 2,…. d k ) and concave function g

56 56 U SE I MPLICIT F EEDBACK Click! Presented Ranking (y) Improved Ranking (y’)

57 57 R OBUSTNESS TO MODEL MISMATCH Works even if modeling function and user function mismatch.

58 58 E FFECT OF F EEDBACK Q UALITY

59 59 E FFECT OF F EEDBACK N OISE

60 60 C OMPARISON T O S UPERVISED

61 61 B ANDITS F OR R ANKING Top-K bandits problem: Each iteration play K distinct arms. Probabilistic Feedback: MAB assumes that feedback will be received every round. If feedback is not assured each round: If no feedback, then better to exploit. Key Ideas: Need dynamic “explore-exploit” tradeoff: Incorporate uncertainty of receiving feedback.


Download ppt "“B Y THE U SER, F OR THE U SER, W ITH THE L EARNING S YSTEM ”: L EARNING F ROM U SER I NTERACTIONS Karthik Raman December 12, 2014 Joint work with Thorsten."

Similar presentations


Ads by Google