Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis.

Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis E. Hadjidoukas Theodore S. Papatheodorou Dimitrios S. Nikolopoulos Ioannis E. Venetis Eleftherios D. Polychronopoulos High Performance Information Systems Laboratory Department of Computer Engineering and Informatics University of Patras, Greece ParCo’99, Delft, The Netherlands, August

Motivation Proliferation of IA-based SMPs for parallel and mainstream computing. Multithreading POSIX Threads c standard engineering and desktop applications Multiprogramming simultaneous execution of parallel and sequential jobs workload diversity Poor integration of multithreading with multiprogramming multiprogramming-oblivious runtime systems multithreading-oblivious operating system kernels poor performance of parallel programs in non-dedicated environments ParCo’99, Delft, The Netherlands, August

Adaptability of Parallel Applications in non-dedicated Multiprogrammed Environments
Lightweight communication path between the runtime system and the operating system Communication of critical scheduling events such as allocation and preemption of processors from the operating system to the application Communication of the application-level degree of parallelism to the operating system for guiding processor allocation One-to-one mapping: user-level threads to kernel threads, kernel threads to physical processors Fast resuming of maliciously preempted user-level threads that execute on the critical path of the application ParCo’99, Delft, The Netherlands, August

The Nanothreading Interface
Communication between the kernel and the runtime system through loads and stores in shared memory minimal overhead, no additional context switches & kernel crossings fast cloning of execution vehicles and processor assignment Polling of critical scheduling information from user space Actual number of allocated processors Actual state of the owned kernel threads Adaptation to kernel scheduler interventions automatic adjustment of thread granularity identification and resuming of preempted user-level threads that execute on the critical path minimization of idle time for maximum utilization Effective dynamic space and time sharing ParCo’99, Delft, The Netherlands, August

The Shared Arena Kernel Space Application User Space R/W segment
adaptive parallel programs R/W segment n_cpus_requested CPU Worker/ idler Worker/ idler Worker/ idler CPU VM page OS Scheduler CPU blocked/ preempted/ running blocked/ preempted/ running blocked/ preempted/ running CPU non adaptive parallel and sequential programs n_cpus_current n_cpus_preempted R/O segment ParCo’99, Delft, The Netherlands, August

General Functionality
Application communicates requests for processors EVs upcall to the user-level scheduler upon assignment of processors Notification of EV state at the program level from the runtime system workers, idlers Notification of EV state at the kernel level from the OS Running, Preempted Polling of the shared arena initiation of parallel execution phases “safe” execution points of the user-level scheduler Intra-program priority scheduling: Idlers handoff their CPU in favor of: preempted workers recently unblocked threads EVs belonging to other applications ParCo’99, Delft, The Netherlands, August

Kernel Implementation Issues in Linux 2.0
Shared Arena Pinned memory page application-side copy with R/W privileges trusted kernel-side copy with R/O privileges copy-on-write of the R/O fields to reduce TLB flushes EV cloning in the kernel batch creation of the EVs that serve a single nanothreading application instruction pointer set to upcall to the user-level scheduler Additional Functionality share groups binding of kernel threads to processors explicit blocking/unblocking through a counting semaphore ParCo’99, Delft, The Netherlands, August

Kernel Scheduler Modifications
Nanothreading scheduler Invoked upon changes of the workload of nanothreading jobs time quantum expiration two-level scheduling Three-phase scheduling Assignment of a number of runnable EVs to processors dynamic time/space sharing Indirect assignment of specific CPUs to nanothreading applications nanothreading applications compete with non-nanothreading applications processor locality Selection of the specific kernel threads to run on each physical processor affinity scheduling priority preempted workers  preempted idlers voluntarily suspended idlers ParCo’99, Delft, The Netherlands, August

Handoffs and Blocking of Threads in the Kernel
Handoff Scheduling triggered at idling points of the user-level scheduling loop equivalent to the third phase of the nanothreads scheduler may resume an EV from another program to maximize utilization (yielding) Blocking blocking activates local scheduling unblocked threads are resumed immediately or marked as high-priority preempted threads applications with blocked EVs run with lower priority ParCo’99, Delft, The Netherlands, August

Runtime System Modifications
Initialization shared arena setup communication of maximum processor requirements Polling the shared arena before initiating parallel execution Polling the shared arena at idling points handoff scheduling Non-blocking synchronization with concurrent queues immunity to preemptions of user-level threads from the OS ParCo’99, Delft, The Netherlands, August

Performance Evaluation
Quad Pentium Pro (Compaq ProLiant 5500) 4 Pentium Pro processors clocked at 200 Mhz. 512 Kbytes L2 cache per processor 512 Mbytes main memory Linux kernel version Nanothreads Runtime Library ( Multiprogrammed workloads multiple copies of SPLASH-2 LU, Volrend, FFT, Raytrace applications with task-queue or mast-slave execution paradigms 2,4, and 8-way multiprogramming ParCo’99, Delft, The Netherlands, August

multiprogramming degree
Results SPLASH-2 LU 35 30 25 20 avg. turnaround time (seconds) 15 Native Linux Kernel Nanothreading Kernel 10 5 1-way 2-way 4-way 8-way multiprogramming degree ParCo’99, Delft, The Netherlands, August

Results SPLASH-2 Volrend
90 80 70 60 50 avg. turnaround time (seconds) 40 Native Linux Kernel 30 Nanothreading Kernel 20 10 1-way 2-way 4-way 8-way multiprogramming degree ParCo’99, Delft, The Netherlands, August

Results SPLASH-2 Raytrace
180 160 140 120 100 avg. turnaround time (seconds) 80 Native Linux Kernel 60 Nanothreading Kernel 40 20 1-way 2-way 4-way 8-way multiprogramming degree ParCo’99, Delft, The Netherlands, August

multiprogramming degree
Results SPLASH-2 FFT 40 35 30 25 avg. turnaround time (seconds) 20 Native Linux Kernel 15 Nanothreading Kernel 10 5 1-way 2-way 4-way 8-way multiprogramming degree ParCo’99, Delft, The Netherlands, August

Ongoing and Future Work
Porting to 2.2 kernels Evaluation with non-homogeneous and I/O-centric workloads Integration with the OpenMP standard Integration with out-of-core mutlithreading runtime libraries POSIX threads c WWW servers Database servers JVM For more information: ParCo’99, Delft, The Netherlands, August

Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis.

Similar presentations

Presentation on theme: "Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis.

Similar presentations

Presentation on theme: "Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis."— Presentation transcript:

Similar presentations

About project

Feedback