Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.

Similar presentations


Presentation on theme: "Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam."— Presentation transcript:

1 Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam

2 Example task: Find best news articles based on user context; optimize click-through rate Example task: Tune ad display parameters (e.g., mainline reserve) to optimize revenue Example task: Improve ranking of QAC to optimize suggestion usage Typical approach: lots of offline tuning + AB testing.

3 [Kohavi et al. ’09, ‘12] Example: which search interface results in higher revenue?

4

5

6 Image adapted from: https://www.flickr.com/photos/prayitnophotography/

7

8 Address key challenge: how to balance exploration and exploitation – explore to learn, exploit to benefit from what has been learned. = Reinforcement learning problem where actions do not affect future states

9 Example

10

11 both arms are promising, higher uncertainty for C Bandit approaches balance exploration and exploitation based on expected payoff and uncertainty.

12 [Li et al. ‘12]

13 Contextual bandits [Li et al. ‘12] Example results: Balancing exploration and exploitation is crucial for good results.

14 1) Balance exploration and exploitation, to ensure continued learning while applying what has been learned 2) Explore in a small action space, but learn in a large contextual space

15 Illustrated Sutra of Cause and Effect "E innga kyo" by Unknown - Woodblock reproduction, published in 1941 by Sinbi-Shoin Co., Tokyo. Licensed under Public domain via Wikimedia Commons -

16 Problem: estimate effects of mainline reserve changes. [Bottou et. al ‘13]

17 controlled experiment counterfactual reasoning

18 Key idea: estimate what would have happened if a different system (distribution over parameter values) had been used, using importance sampling. Step 1: factorize based on known causal graph This works because: [Bottou et. al ‘13] Step 2: compute estimates using importance sampling Example distributions: [Precup et. al ‘00]

19 [Bottou et. al ‘13] Counterfactual reasoning allows analysis over a continuous range.

20 1) Leverage known causal structure and importance sampling to reason about “alternative realities” 2) Bound estimator error to distinguish between uncertainty due to low sample size and exploration coverage

21

22 Compare two rankings: 1)Generate interleaved (combined) ranking 2)Observe user clicks 3)Credit clicks to original rankers to infer outcome document 1 document 2 document 3 document 4 document 2 document 3 document 4 document 1 document 2 document 3 document 4 Example: optimize QAC ranking

23 Dueling bandit gradient descent (DBGD) optimizes a weight vector for weighted- linear combinations of ranking features. current best weight vector sample unit sphere to generate candidate ranker randomly generated candidate feature 1 feature 2 Relative listwise feedback is obtained using interleaving Learning approach [Yue & Joachims ‘09]

24 generate many candidates and select the most promising one feature 1 feature 2 [Hofmann et al. ’13c] Approach: candidate pre-selection (CPS)

25 informational click model [Hofmann et al. ’13b, Hofmann et al. ’13c] From earlier work: learning from relative listwise feedback is robust to noise. Here: adding structure further dramatically improves performance.

26 1) Avoid combinatorial action space by exploring in parameter space 2) Reduce variance using relative feedback 3) Leverage known structures for sample-efficient learning

27 Contextual bandits Systematic approach to balancing exploration and exploitation; contextual bandits explore in small action space but optimize in large context space. Counterfactual reasoning Leverages causal structure and importance sampling for “what if” analyses. Online learning to rank Avoids combinatorial explosion by exploring and learning in parameter space; uses known ranking structure for sample-efficient learning.

28 Applications Assess action and solution spaces in a given application, collect and learn from exploration data, increase experimental agility Try this (at home) Try open-source code samples; Living labs challenge allows experimentation with online learning and evaluation methods Challenge: labs.net/challenge/ labs.net/challenge/ Code: https://bitbucket.org /ilps/lerot https://bitbucket.org /ilps/lerot

29

30


Download ppt "Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam."

Similar presentations


Ads by Google