1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Markov Decision Process

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.

Decision Theoretic Planning

Optimal Policies for POMDP Presented by Alp Sardağ.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

An Introduction to Markov Decision Processes Sarah Hickmott

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Markov Decision Processes

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Reinforcement Learning

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Distributed Q Learning Lars Blackmore and Steve Block.

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Introduction Many decision making problems in real life

Reinforcement Learning

1 Adaptable applications Towards Balancing Network and Terminal Resources to Improve Video Quality D. Jarnikov.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Solving POMDPs through Macro Decomposition

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Neural Networks Chapter 7

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

MDPs (cont) & Reinforcement Learning

An Optimal Distributed Call Admission control for Adaptive Multimedia in Wireless/Mobile Networks Reporter: 電機所鄭志川.

Distributed Q Learning Lars Blackmore and Steve Block.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Reinforcement Learning

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Making complex decisions

Analytics and OR DP- summary.

Markov Decision Processes

Reinforcement Learning

Markov Decision Processes

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Reinforcement learning

Dr. Unnikrishnan P.C. Professor, EEE

CS 188: Artificial Intelligence Fall 2007

Hidden Markov Models (cont.) Markov Decision Processes

Reinforcement Nisheeth 18th January 2019.

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens Philips Research Laboratories, the Netherlands

2 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments Overview

3 Introduction worst-case load ? How can we make the best of it? -Video processing in software -User expects high-quality output -Deadlines on the completion times of frames -Many video algorithms show a highly fluctuating load -Given a fixed processing-time budget, lower than worst-case load

4 Our Approach Introduction 1. Asynchronous, work-preserving processing Using buffers 2. Scalable Video Algorithm (SVA) Frames can be processed at different quality levels Trade-off picture quality and processing needs 3. Soft real-time task, hence we allow occasional deadline misses 4. QoS trade-off Deadline misses Picture quality Quality fluctuations QoS measure reflects user-perceived quality

5 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

6 ….. SVA-controller interaction SVA process frame ….. QoS Control Problem ….. quality level new frame Controller: select quality level:

7 QoS Control Problem Real-Time Processing Example time 1245 deadline miss time blocked

8 Revenue for processed frame QoS Control Problem Sum of: reward for selected quality level penalty for each deadline miss penalty for changing the quality level , , Current quality level Previous quality level Current quality level Deadline miss: 10,000 Revenue , = -10,092

9 QoS measure QoS Control Problem - Average revenue per frame - Reflects the user-perceived quality, provided that the revenue - parameters are well chosen - At each decision point, select the quality level - Goal: maximize the QoS measure - Difficult on-line problem: - what will the future bring? QoS Control Problem

10 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

11 Agent-Environment Interaction Reinforcement Learning Environment (SVA) Agent (Controller) Action State Action State Revenue Action Revenue State

12 Reinforcement Learning Agent’s goal -Maximize the expected return -Discounted return at time step t Selecting actions - Policy : - stores for each state a single action - to be chosen infinite time horizon discount parameter

13 Reinforcement Learning Markov Decision Process - We assume a memoryless state signal: state and action predict state and revenue - Hence, the reinforcement learning task is a - Markov Decision Process (MDP) - We assume a finite MDP: -Finite state set, finite action set -One-step dynamics:

14 Reinforcement Learning Value functions - State value of state under policy -Action value of state under policy -A policy is better than or equal to a policy if -We are looking for an optimal policy

15 Reinforcement Learning Solution approach - Compute an optimal policy OFFLINE (= before ) -Requires transition probabilities -Requires expected revenues -Algorithms: policy iteration, value iteration, ….. - Compute an optimal policy ONLINE, at the discrete time steps, - using the experienced states and revenues -Algorithms: SARSA, Q-Learning, …..

16 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

17 Select quality level State at start point transition probabilities, expected revenues Offline Approach State at next start point If the transition probabilities and expected revenues are known, - an optimal policy can be computed

18 Offline Approach - Decision moments = start points - State - progress interval (discrete!!!) - previous quality level - Action = select quality level - Transition probabilities and expected revenues: computed using processing-time statistics time Progress: measure for the amount of budget left until the deadline - of the frame to be processed

19 Offline Approach - Given this model, we use value iteration to compute - an optimal policy for a particular value of the budget

20 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

21 Online Approach - Based on learning Q-values - State = progress, previous quality level - Action = select quality level - At each decision point - Given the state transition, action, and revenue, - the controller first updates (learns) Q-value - Next, given state, the controller selects action for which is maximal -Default: one Q-value updated; also do exploring actions Q-Learning

22 progress previous quality level State Space - Progress (continuous) - Previous quality level - We learn Q-values - only for a small set - of states - To select the quality - level, given the state, - we interpolate between - the learned Q-values Online Approach

23 progress previous quality level Learning - Progress delta given for - current state - - Calculate delta for all - progress points Online Approach t t+1

24 progress Learning (ctnd) - Estimate effect of - other actions - Hence: all Q(s,a) - values updated - in each step! - (no exploration - needed) Online Approach action

25 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

26 Handling Load Fluctuations - Both approaches implicitly assume that processing times of - successive frames are mutually independent -Result: both approaches perform sub-optimal - TRUE for stochastic load fluctuations - NOT TRUE for structural load fluctuations

27 Scaled budget enhancement Handling Load Fluctuations 1.At each decision point, compute the complication factor = proc.time of frame / expected proc.time for applied quality level 2.Filter out the stochastic load fluctuations 3.Compute the scaled budget = budget / structural load

28 Scaled budget enhancement Handling Load Fluctuations Adapt offline strategy - Compute a policy for many different values of the budget b - During run time, at each decision point: - Compute the scaled budget - Compute the state of the SVA - Apply the policy corresponding to the scaled budget, - and use the state to select the quality level - Interpolate between policies Adapt online strategy - Add scaled budget directly to the state

29 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

30 Simulation Experiments -Scalable MPEG-2 decoder -TriMedia MHz platform -Quality levels (based on IDCT pruning): -Sequence `TV’ -(five episodes of ‘Allo ‘Allo, 230,936 frames, 2.5 hours) -Latency: 3 periods (= work ahead of at most 2 periods) -Control strategies: OFFLINE, OFFLINE*, ONLINE*, Q0,…,Q3 -For each control strategy, we simulate processing sequence `TV’ -for a fixed value of the processing-time budget -Revenues: based on input of video experts (slide 8)

31 Average Revenue Simulation experiments

32 Deadline misses Simulation experiments

33 Quality-level usage Simulation experiments

34 Budget usage Simulation experiments

35 Cross-trace simulations Simulation experiments

36 Introduction QoS Control Problem Reinforcement Learning Offline Approach Online Approach Handling Load Fluctuations Conclusion Simulation Experiments

37 Conclusion -Problem -Video processing algorithm with highly fluctuating load -Fixed processing-time budget, lower than worst-case needs -How to optimize the user-perceived quality? -Approach -Asynchronous work-preserving processing -Scalable video algorithm -QoS trade off: deadline misses, processing quality, quality fluctuations -Control strategies -Offline, online, scaled budget enhancement -Simulation experiments -OFFLINE* and ONLINE* perform close to optimum -OFFLINE* and ONLINE* are independent of the applied statistics