Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A.Danko November 27, 2018.

Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX
A.Danko November 27, 2018

Yet another thread scheduler. Why?
The story begins with a customer: “We can use QNX! We need ARINC653!!!!!! HELP!” November 27, 2018

ARINC 653 Partition Scheduler and “special” IPC
Why? Shiny New Toy Partition scheduler (ARINC 653) Very popular in fixed military systems Each partition is guaranteed a percentage of CPU Priorities are only meaningful within a partition Shortcomings include Detailed RMA required to verify system Overload of IPC FIFO input queue Failures include denial of service and CPU quota exhaustion Monolithic design within one partition Hard to retrofit to existing 1-cpu applications. Inefficient use of total CPU. Runs idle when tasks are ready. Increased interrupt latency Does not address shared entities such as a file system Restrictive programming model. No DMA POSIX 30% OTHER 50% JAVA 20% ARINC 653 Partition Scheduler and “special” IPC November 27, 2018

Real-world examples of partitioning for QNX customers
Why? Real-world examples of partitioning for QNX customers Selling a portion of throughput Security: Untrusted Applications Router Customer 1 Customer 2 Car NAV 3rd party (malware?) etc … TCP/IP TCP/IP Protocol Protocol Radio 80% 20% Application Router Application Downloaded applications from the WEB cannot hurt the system Application Protocol 50% 50% Locked System Recovery Customer 2’s network load cannot hurt customer 1 HOG App bash 90% 10% Hard-wall scheduler not-required. Do we need any new scheduler? Emergency recovery shell November 27, 2018

Evolution of schedulers
Why? Evolution of schedulers Timeline priority pre-emptive Timeslicing Time-varying priority Really clever time-varying Fair Share scheduling Adaptive configuration Yes, but: System locks up Backhoes and Mother’s day Untuneable for more than 1 application. US Military Satcom Hard to manage share interactions. Not invented – until now. SCHED_FIFO SCHED_RR SCHED_SPORADIC November 27, 2018

Evolution: Lessons learned
Why? Evolution: Lessons learned Numerical priorities are chosen by applications but system scheduling behavior must be designed globally Degradation and overload: Priorities are not constants. Importance of work depends on circumstances. Modes: normal operation, restart, emergency maintenance Scheduling strategy needs to be based on unit of work, but what we have is communicating threads. must measure real-time behavior. 0.1 % accuracy Want to specify shares as global percentages Applications don’t get to pick their importance or shares. System engineers do. Need to throttle cpu usage without losing realtime latencies. November 27, 2018

Adaptive Partition Scheduling
Design What is Partitioning? General Answer Separation of work To isolate: cpu usage memory usage system resource usage Failures QNX Answer POSIX compatible design which can be applied to existing systems with little or no recoding A global hard real-time scheduler with overload protection and CPU guarantees Separation of work based on “working for common purpose” Runtime typed memory and kernel object guarantees and limits With full inheritance and accounting for all children Persistent storage (file system) guarantees and limits Process model for fault isolation Dynamic configuration Adaptive Partition Scheduling November 27, 2018

Principles Scheduler must not trigger an overload
Design Principles Scheduler must not trigger an overload Overhead may not increase with # of threads Real-time during underload Same behavior as today Real-time during overload At least for interrupt handling Must also be a fair-share scheduler global scheduler algorithm globally configured Must mesh with current QNX architecture Preemptive priority, individual thread scheduling Heavy use of message passing Easy to drop onto existing applications Can’t be a “bag on the side” Simple enough for customers to use Engineerable Reconfigure on the fly Offered load Throughput Insert picture of Juggling Watermelons here November 27, 2018

Overconstrained problem?
Nope: Implemented in QNX 6.3.2 Actually Works See “How it Works” in Part2. November 27, 2018

Design Adaptive Partition Scheduling Part 2: How it works. What it does: Counting time Who’s got time Real time Out of time Free time Borrowed time Equal time How it does it API Why is it secure? Why is it cool? November 27, 2018

Counting time What does 14% cpu mean? Accuracy:
Design Counting time What does 14% cpu mean? CPU usage is calculated over a sliding window. Accuracy: Counting ticks is not enough. “Micro-billing” is used to track actual CPU utilization even when threads don’t use their whole timeslice. micro- and nano-second resolution Threads are billed based on real usage, not statistics “windowsize” is configurable as an argument to kernel at boot Tradeoff maximum READY-state latency with accuracy of CPU budgeting 100ms window -> 1% accuracy or better. Internal arithmetic accurate to 0.5% or better Partition usage ns cpu time executed, during last sliding window, expressed as percentage Partition budget Guaranteed percentage of cpu time, balanced over sliding window T= now T= -100ms November 27, 2018

Who’s got time: Partition Membership
QNX Scheduler Partition Set of threads working for a common purpose Set of initial processes/threads designated by customer + all subsequent children Guest members Server’s cpu time billed to client Resmgr threads temporarily join partition of sender thread Not locked to a static set of code. OS services are part of whatever partition they need to be. hence the name “adaptive partition” November 27, 2018

Who’s got time: Partition Inheritance
Design Who’s got time: Partition Inheritance File System Process 6 6 6 7 - 11 10 Message 9 8 - 10 9 Message 4 9 - CPU budget available Receive Threads CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Resource manager threads work on behalf of sender Priority and adaptive partition in inherited on receive Execution time in server billed to client’s partition This allows proper accounting for shared resources November 27, 2018

Real time: Behavior under normal load
Design Real time: Behavior under normal load Blocked Ready 6 6 6 7 11 8 Running 10 10 9 9 4 CPU budget available CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Hard real-time scheduler under normal load Running thread selected as highest priority READY thread No delay on scheduling if adaptive partition has budget November 27, 2018

Out of time: Behavior under overload
Design Out of time: Behavior under overload Blocked Ready 6 6 6 7 11 8 Running 10 9 4 CPU budget available CPU budget exceeded Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Highest priority READY thread in Partition with budget runs No delay on scheduling if adaptive partition has budget November 27, 2018

Free Time: Behavior with unused CPU
Design Free Time: Behavior with unused CPU Blocked 6 6 6 6 Running 7 11 10 8 10 9 10 8 9 4 CPU budget exceeded CPU budget exceeded CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Adaptive Partition 3 If no partitions with remaining budget have READY threads, highest priority READY thread is selected to run from other partitions This allows “free” time to be given based upon priority “Free” time is still accounted and may have to be paid back (for example, if partition 3 becomes ready within 1 averaging window) November 27, 2018

Borrowed Time: Critical Threads
Design Borrowed Time: Critical Threads Blocked Ready 6 6 6 7 11 Critical Thread 8 Running 30 30 11 11 4 CPU budget available CPU budget exceeded Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Air Bag Control) Critical threads still run (based on priority) even if partition has no budget Critical threads provide deterministic scheduling even in overload Critical threads are given critical budget and can go into short-term debt Critical time is accounted and has to be repaid Exceeding critical budget is considered an error and causes notification/action November 27, 2018

Equal time. How to choose between partitions of equal priority
Design Equal time. How to choose between partitions of equal priority Unimportant? Many threads run at default priority, therefore equal priority Possible algorithms: - round robin - favor partition with most free time - favor longest waiter Requirement: Minimize latencies during underload WBN: divide free time by % cpu share. Solution: Interleave partitions by ratio of partition shares We found a clever way to do that, so it’s in the patent. November 27, 2018

How it does it uKernel libmod_aps.a Scheduler messaging Process
creation libmod_aps.a Per-partition Ready Q messaging for all partitions, p Def m(p) -> (bud(p)||crit(p), prio(p), run_t/wsize/bud(p)) Then schedule ps Def ps -> rdy(ps) and (m(ps) < m(pi)) For all i != s Scheduler clock intr handler ready() block() select_thread() November 27, 2018

Algorithm summary A partition sees real-time behaviour when under budget Only limited when another partition must get its guarantee Fair-share scheduling at or over budget Equal prio partitions are interleaved Budgets balanced in much less than windowsize Free time (above budget) is given out: By default: in real-time mode Optionally: by ratio of budgets Critical Thread run even if out of budget Criticality is inherited November 27, 2018

Overhead: Fancy, but is it fast?
Scheduling overhead increases with: - number of partitions - number of messages/sec - number of clock interrupts/sec, i.e. ClockPeriod() * does not increase with number of threads * Free or almost free operations: Inheriting partition as part of message receive Joining a thread to a partition Dynamically changing budgets Computational requirements 32 bit multiply, 64bit add *no floating point* *no divides* *no address space swapping* *short-circuit calculation of merit function* *no inter-cpu msging on SMP* *history-less algorithm* Overhead typically 1% of total cpu November 27, 2018

Design APIs Control of Adaptive Partitioning Scheduler is done through a kernel API API allows associating a thread with a partition Used to launch processes within a partition Children inherit parent’s partition Dynamic capabilities part of design Budgets may be changed at run time – instant effect Threads may join/unjoin partitions freely APIs to attach event triggered on critical budget overrun Selectable security API is restricted to privileged processes (root) Must be called from within default (system) partition Partitions are created with budget (normal and possibly critical) API provided to “lock down” partition configuration Prevent creation of new partitions or modification of budgets November 27, 2018

API 2: Launching applications
1. Build File schedaps MyPartition 20 [schedaps=MyPartition] /bin/myApp 2. Command line aps create –b20 MyPartition on –Xaps=MyPartition /bin/myApp 3. Momentics IDE4 Drag and drop 4. include <sys/sched_aps.h> Full programmatic interface: configure, get stats, launch, secure November 27, 2018

Why is AP Secure? AP enforces budgets every clock interrupt
Root can be required to do configuration changes Partition creation by subdivision of parent It’s not possible to create a sub-partition greater than a parent Not even root can violate this rule Configuration can be locked November 27, 2018

Why is this cool?: Engineerable
Design Why is this cool?: Engineerable Identifying units of work: Partition Inheritance Identify code that starts up applications Inheritance figures out the rest Filesystems etc do not require separately engineered cpu share Customer need not analyze budgets for OS components Global share management: % cpu cpu shares defined in units customers are used to: Percentage gets us off the hook for accounting for different clock speeds. Realtime when you need it: Critical Threads Interrupts and important event still get handled on time. Secure Budgets, especially critical budgets, are set globally by root, not by applications “to err is human, but …” November 27, 2018

Part 3. The Slick Demo November 27, 2018

Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A.Danko November 27, 2018.

Similar presentations

Presentation on theme: "Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A.Danko November 27, 2018."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A.Danko November 27, 2018.

Similar presentations

Presentation on theme: "Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A.Danko November 27, 2018."— Presentation transcript:

Similar presentations

About project

Feedback