Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang
Overview Today’s Trends Chip-multiprocessors (CMPs) Need Finer-grain parallelism for large-scale CMPs Existing fine ‐ grain schedulers Software ‐ only implement : Flexible, but slow and not scalable Hardware ‐ only implement : Fast, but inflexible and costly Approach of this paper Hardware ‐ aided user ‐ level schedulers H/W : Asynchronous direct messages (ADM) S/W : Scalable message-passing schedulers ADM schedulers are Fast like H/W, Flexible like S/W scheduler
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
Fine ‐ grain parallelism divide tasks in partition and distribute them Advantages serve more parallelism avoid load imbalance Disadvantages large scheduling overhead Poor locality
Work ‐ stealing scheduler 1 task queue / thread Threads dequeue and enqueue tasks from task queues If a thread runs out of work try to steal tasks from another thread’s queue
Components of Work-stealing Queues Policies effective load rebalancing method Communication
Comparison 2 scheme Cost of queues and policies is cheap Cost of communication through shared memory is increasingly expensive! Carbon [ISCA ’07] One hardware LIFO task queue / core Special instructions to enqueue/dequeue tasks Components Global Task Unit : Centralized queues for fast stealing Local Task Units : One small task buffer per core to hide GTU latency S/W scheduler H/W scheduler Useless if the H/W policy did not match! good bad
Approach to fine-grain scheduling S/W onlyH/W onlyH/W aided OpenMP, TBB, Clik, X10, … Carbon, GPUs, … Asynchronous Direct Messages (ADM) S/W queues & policies S/W communication H/W queues & policies H/W communication S/W queues & policies H/W communication High ‐ overhead Flexible No extra H/W Low ‐ overhead Inflexible Special ‐ purpose H/W Low ‐ overhead Flexible General ‐ purpose HW
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
Asynchronous Direct Messages Asynchronous Direct Messages (ADM) Messaging between threads tailored to scheduling and control Requirements Short messages for Low ‐ overhead Send from/receive to registers independent from coherence Simultanenous communication and computation Asynchronous messages with user-level interrupts General ‐ purpose H/W Generic interface allows reuse
ADM Microarchitecture One ADM unit / core Receive buffer holds messages until dequeued by thread Send buffer holds sent messages pending acknowledgement Thread ID Translation Buffer translates TID → core ID on sends Small structures (16 ‐ 32 entries) → scalable
ADM Instruction Set Architecture Send & receive are atomic (single instruction) Send completes when message is copied to send buffer Receive blocks if buffer is empty Peek doesn't block, enables polling Generate an user ‐ level interrupt when a message is received No stack switching, handler code partially saves context (used registers) → fast Interrupts can be disabled to preserve atomicity w.r.t. message reception InstructionDescription adm_send r1, r2 Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_peek r1, r2Returns source and message length at head of rx buffer adm_rx r1, r2Dequeues message at head of rx buffer adm_ei / adm_diEnable / disable receive interrupts
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
ADM Schedulers Message ‐ passing schedulers Replace other parallel runtime’s scheduler Application programmer’s responsibility Threads can perform 2 roles : Worker : Execute parallel phase, enqueue & dequeue tasks Manager : Control task stealing & parallel phase termination Centralized scheduler: 1 manager controls all workers
Centralized Scheduler: Updates Manager keeps approximate task counters for each worker Workers only notify manager at exponential thresholds
Centralized Scheduler: Steals Manager requests a steal from most tasks for load balancing
Hierarchical Scheduler Centralized scheduler : Does all communication through messages Enables directed stealing, task prefetching Does not scale over 16 threads Hierarchical scheduler Workers and managers form a tree
Hierarchical Scheduler: Steals Steals can affect multiple levels Scales to a lot of threads
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
Evaluation : Conditions Simulated machine: Tiled CMP 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64, 256 thread) 3 ‐ level cache hierarchy, directory coherence Benchmarks: Loop ‐ parallel : canneal, cg, gtfold Task ‐ parallel : maxflow, mergesort, ced, hashjoin
Evaluation : Results Carbon and ADM : Small overheads that scale ADM matches Carbon → No need for H/W scheduler Must not Implement all scheduling policies in H/W