Presentation on theme: "Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam."— Presentation transcript:
Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam
Example task: Find best news articles based on user context; optimize click-through rate Example task: Tune ad display parameters (e.g., mainline reserve) to optimize revenue Example task: Improve ranking of QAC to optimize suggestion usage Typical approach: lots of offline tuning + AB testing.
[Kohavi et al. ’09, ‘12] Example: which search interface results in higher revenue?
Address key challenge: how to balance exploration and exploitation – explore to learn, exploit to benefit from what has been learned. = Reinforcement learning problem where actions do not affect future states
both arms are promising, higher uncertainty for C Bandit approaches balance exploration and exploitation based on expected payoff and uncertainty.
[Li et al. ‘12]
Contextual bandits [Li et al. ‘12] Example results: Balancing exploration and exploitation is crucial for good results.
1) Balance exploration and exploitation, to ensure continued learning while applying what has been learned 2) Explore in a small action space, but learn in a large contextual space
Illustrated Sutra of Cause and Effect "E innga kyo" by Unknown - Woodblock reproduction, published in 1941 by Sinbi-Shoin Co., Tokyo. Licensed under Public domain via Wikimedia Commons -
Problem: estimate effects of mainline reserve changes. [Bottou et. al ‘13]
controlled experiment counterfactual reasoning
Key idea: estimate what would have happened if a different system (distribution over parameter values) had been used, using importance sampling. Step 1: factorize based on known causal graph This works because: [Bottou et. al ‘13] Step 2: compute estimates using importance sampling Example distributions: [Precup et. al ‘00]
[Bottou et. al ‘13] Counterfactual reasoning allows analysis over a continuous range.
1) Leverage known causal structure and importance sampling to reason about “alternative realities” 2) Bound estimator error to distinguish between uncertainty due to low sample size and exploration coverage
Compare two rankings: 1)Generate interleaved (combined) ranking 2)Observe user clicks 3)Credit clicks to original rankers to infer outcome document 1 document 2 document 3 document 4 document 2 document 3 document 4 document 1 document 2 document 3 document 4 Example: optimize QAC ranking
Dueling bandit gradient descent (DBGD) optimizes a weight vector for weighted- linear combinations of ranking features. current best weight vector sample unit sphere to generate candidate ranker randomly generated candidate feature 1 feature 2 Relative listwise feedback is obtained using interleaving Learning approach [Yue & Joachims ‘09]
generate many candidates and select the most promising one feature 1 feature 2 [Hofmann et al. ’13c] Approach: candidate pre-selection (CPS)
informational click model [Hofmann et al. ’13b, Hofmann et al. ’13c] From earlier work: learning from relative listwise feedback is robust to noise. Here: adding structure further dramatically improves performance.
1) Avoid combinatorial action space by exploring in parameter space 2) Reduce variance using relative feedback 3) Leverage known structures for sample-efficient learning
Contextual bandits Systematic approach to balancing exploration and exploitation; contextual bandits explore in small action space but optimize in large context space. Counterfactual reasoning Leverages causal structure and importance sampling for “what if” analyses. Online learning to rank Avoids combinatorial explosion by exploring and learning in parameter space; uses known ranking structure for sample-efficient learning.
Applications Assess action and solution spaces in a given application, collect and learn from exploration data, increase experimental agility Try this (at home) Try open-source code samples; Living labs challenge allows experimentation with online learning and evaluation methods Challenge: labs.net/challenge/ labs.net/challenge/ Code: https://bitbucket.org /ilps/lerot https://bitbucket.org /ilps/lerot