Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Similar presentations


Presentation on theme: "1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems."— Presentation transcript:

1 1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems Analysis Laboratory, Helsinki University of Technology www.sal.tkk.fi Ville.A.Mattila@tkk.fi

2 2 S ystems Analysis Laboratory Helsinki University of Technology Abstract Fighter aircraft are maintained periodically on the basis of cumulated usage hours. In a fleet of aircraft, the timing of the maintenance therefore depends on the allocation of flight time. The timing is also subject to a number of uncertainties such as failures of the aircraft. A fleet with limited maintenance resources is faced with a design problem in assigning the aircraft to flight missions so that the overall amount of maintenance needs will not exceed the maintenance capacity. We consider the assignment of aircraft to flight missions as a Markov Decision Problem over a finite time horizon. The average availability of aircraft is taken as the optimization criterion. We describe the fleet operations with a simulation model. An efficient assignment policy is solved using a Reinforcement Learning technique called Q-learning that presents actions to the simulation and observes the resulting system behavior. We compare the performance of the Q-learning algorithm to a set of heuristic assignment rules using problems involving varying number of aircraft and types of periodic maintenance. Moreover, we consider the possibilities of practical implementation of the produced solutions.

3 3 S ystems Analysis Laboratory Helsinki University of Technology 1. The Flight Time Allocation Problem

4 4 S ystems Analysis Laboratory Helsinki University of Technology Problem setting A fraction of aircraft assigned to flight missions Air base Flight missions Periodic maintenance after fixed number of flight hours end of day Mission-capable aircraft to base Limited maintenance capacity Which assignment preserves aircraft availability? start of day

5 5 S ystems Analysis Laboratory Helsinki University of Technology The flight time allocation problem The timing of periodic maintenance depends on the assignment of aircraft to flight missions, i.e., allocation of flight time →Problem: How to allocate flight time so that aircraft availability is preserved

6 6 S ystems Analysis Laboratory Helsinki University of Technology Availability as performance indicator Availability: The proportion of mission-capable aircraft to the total size of the fleet One of the primary performance indicators of operational capability in actual maintenance-related decision making We consider average availability over a finite time horizon –Need to study operational capability given certain initial state –Operational environment remains the same for a limited amount of time

7 7 S ystems Analysis Laboratory Helsinki University of Technology Difficulty of flight time allocation Uncertainties –Maintenance duration –Accumulated flight hours during missions –Unplanned maintenance through failure repairs –Unplanned maintenance through battle damage repairs Dimension of the problem –Potentially a large number of aircraft –Different types of periodic maintenance –Multiple, different level maintenance facilities

8 8 S ystems Analysis Laboratory Helsinki University of Technology 2. Problem formulation

9 9 S ystems Analysis Laboratory Helsinki University of Technology Formulation as a Markov Decision Problem 012m-1 m State of a single aircraft→ State of the fleet → the number of aircraft in each state 012m-1 m 012 m Days in use since last periodic maintenance Periodic maintenance Transition: assigned to fligh missions Transition: maintenance completed Action: The number of aircraft assigned to flight missions from each state Performance criterion: The number of aircraft in maintenance / total fleet size

10 10 S ystems Analysis Laboratory Helsinki University of Technology System state Denote the size of the fleet with N A single aircraft –State i  [0, m-1]: ‘the number of stages in use since last periodic maintenance’ –State m: ‘aircraft in maintenance’ –m equals the maintenance interval of the aircraft The aircraft fleet –State s=(s 0, s 1,…, s m ), where s i denotes the number of aircraft in state i

11 11 S ystems Analysis Laboratory Helsinki University of Technology Actions The number of aircraft assigned to perform flight missions d Action a=(a 0, a 1,…, a m-1 ) where a i is the number of assigned aircraft in state i The set of admissible actions in state s of the aircraft fleet: –At most d aircraft are assigned –If the number of available aircraft is less than d, all are assigned

12 12 S ystems Analysis Laboratory Helsinki University of Technology Simulation of the aircraft fleet Current state s, action a, resulting state s’ Maintenance capacity M, expected duration D State transitions –Completed maintenance:for k = 1 to min(M,s m ) draw z~U(0,1) if z < 1/D, s 0 ’=s 0 + 1, s m ’=s m - 1 –Usage of aircraft:for k = 1 to m-1 s k ’=s k ’+a k-1 ’ - a k ’

13 13 S ystems Analysis Laboratory Helsinki University of Technology Optimization criterion Immediate reward r(s,a,s’) is the aircraft availability in s’ Optimization criterion: average availability over finite number of stages

14 14 S ystems Analysis Laboratory Helsinki University of Technology 3. The reinforcement learning approach

15 15 S ystems Analysis Laboratory Helsinki University of Technology Learning optimal flight time allocation policy The learning algorithm Present action to the simulation and observe its benefit on the basis of the simulated response Simulation of the system Simulate the state transition and reward following the execution of the action actionsystem state reward Repeat Learned policy: actions that produce greatest immediate and expected future rewards

16 16 S ystems Analysis Laboratory Helsinki University of Technology Q-learning The value of executing action a in state s and following the optimal policy from then on is stored in the Q-factor Q(s,a) of the state-action pair The factors Q(s,a) are updated s follows: where  (0,1) is the step size and (0,1) the discount factor The learned policy:

17 17 S ystems Analysis Laboratory Helsinki University of Technology Details of the learning algorithm Action selection with probability e, select a for which Q(s,a) is highest, i.e., a greedy action with probability 1-e, select any other a randomly from A(s) Step size where V(s,a) denotes number of times pair (s,a) has been visited Discounting –Q-learning is actually a technique for discounted total reward –Can however optimize average reward, if is sufficiently high

18 18 S ystems Analysis Laboratory Helsinki University of Technology Heuristic policies Can represent efficient solution for many complex problems Can act as reference to the policy produced by Q-learning Two simple policies are considered –‘advance’: flight time is allocated to aircraft with least time to maintenance –‘postpone’: flight time is allocated to aircraft with most time to maintenance

19 19 S ystems Analysis Laboratory Helsinki University of Technology 4. Results

20 20 S ystems Analysis Laboratory Helsinki University of Technology Example problem Problem instance –N = 4 the number of aircraft –m = 2 maintenance interval –d = 1 number of aircraft to flight missions –M = 1 maintenance capacity –D = 2 expected duration of maintenance –L = 50 number of stages –Initial state s(0)=[1 2 1] Learning parameters –e = 0.9 probability of choosing a greedy action – = 0.98 the discount factor

21 21 S ystems Analysis Laboratory Helsinki University of Technology Convergence of average reward A convergent solution is obtained after 1000 state transitions –20 replications of the 50-day time period Average availablity over the time period outperforms simple heuristic policies

22 22 S ystems Analysis Laboratory Helsinki University of Technology Availability under the different policies The learned policy –Maintains higher availability than heuristic policies in the beginning –Matches the availability of the heuristics during later stages

23 23 S ystems Analysis Laboratory Helsinki University of Technology Characterizing the learned policy Since m was taken very small, the learned solution can be characterized with the ‘advance’ and the ‘postpone’ heuristic policies as follows: –if number of aircraft in maintenance is equal to or more than capacity: → ‘postpone’ –if maintenance facility is idle: if s 2 >1 → ‘advance’ else → ‘postpone’

24 24 S ystems Analysis Laboratory Helsinki University of Technology 5. Conclusions

25 25 S ystems Analysis Laboratory Helsinki University of Technology Contributions Insight to a difficult problem actually faced by fleet commanders Flight time allocation as a means for timing maintenance –Has not been considered as a dynamic problem to the best of our knowledge –Has not been considered with RL-techniques

26 26 S ystems Analysis Laboratory Helsinki University of Technology The reinforcement learning approach Results of the reinforcement learning approach for the studied problem instances are promising –A convergent policy is found –The obtained policy outperforms simple heuristic policies –Learning time is manageable for fleet sizes of up to 16 aircraft

27 27 S ystems Analysis Laboratory Helsinki University of Technology Extensions to the model A number of extensions to the presented model are likely required in order to describe more realistic scenarios Of particular interest are the effects of –Additional uncertainties such as battle damage –Operational environment that evolves through time –Violations of the Markovian property of states

28 28 S ystems Analysis Laboratory Helsinki University of Technology Analysis of obtained policies The purpose of studying the flight time allocation problem is to obtain new insight for the use of human decision makers Until now, Q-learning has been implemented as a look-up table version –Q-factors are stored explicitly → representation of learned requires large storage space –Post-learning analysis to build intuition of efficient policies Future challenge is to represent policies in compact form that allows both –Efficient learning –Intuitive representation to human decision makers


Download ppt "1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems."

Similar presentations


Ads by Google