The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

Slides:



Advertisements
Similar presentations
To Share or Not to Share? Ryan Johnson Nikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli, Anastasia Ailamaki,
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
张俊 BTLab Embedded Virtualization Group Outline  Introduction  Performance Analysis  PerformanceTuning Methods.
Department of Electrical and Computer Engineering University of Massachusetts, Amherst Xin Huang and Tilman Wolf A Methodology.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Department of Computer Science and Software Engineering
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
E-MOS: Efficient Energy Management Policies in Operating Systems
SEDA. How We Got Here On Tuesday we were talking about Multics and Unix. Fast forward years. How has the OS (e.g., Linux) changed? Some of Multics.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Chapter 4: Multithreaded Programming
Ioannis E. Venetis Department of Computer Engineering and Informatics
Processes and Threads Processes and their scheduling
Chapter 4: Multithreaded Programming
Process Management Presented By Aditya Gupta Assistant Professor
Ching-Chi Lin Institute of Information Science, Academia Sinica
Hyperthreading Technology
Intel® Parallel Studio and Advisor
NumaGiC: A garbage collector for big-data on big NUMA machines
Improved schedulability on the ρVEX polymorphic VLIW processor
Department of Computer Science University of California, Santa Barbara
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp. 32 nd Annual International Symposium on Computer Architecture

2 Performance asymmetry -Architectural differences -Micro-architectural parameters -Other -Heat: Thermal throttling Why need asymmetry now? -CMP/ Many cores as commodity systems -Run variety of workloads -Good serial performance and high throughput -Optimal energy consumption... difference in compute power of processors Assume an asymmetric multicore system FF SS

3 Asymmetry & MT workloads Performance Compute power Scalable? N procs. Diff configs. Need to utilize asymmetry. perform better Need predictable and robust performance FF SS SS SS Performance Same/Many Runs Stable? N procs. Same config.

4 The problems Algorithm, Correctness, Thread Partitioning Programmers Don’t reason about asymmetry Characteristics of threads Partitioning, Synchronization barriers, Interference, Lifetime Scheduling of threads OS Kernel, Library, Application, DB/Web servers, Managed runtime systems (Java,.NET)

5 Contributions Asymmetry negatively affects applications - Studied many workloads on real hardware - Observed unpredictable workload behavior This can be fixed by - Evaluating threads’ work partitioning -Scheduling of threads with asymmetry

6 Outline Asymmetry and Performance Evaluation Methodology Asymmetric Configurations Workloads and Results

7 Evaluation methodology Asymmetry in real hardware - Intel 4-way 3-GHz Xeon - Different cores run at different frequencies - Software controlled Benefits - Long real-time runs (no simulations) - Workloads are setup according to specs - Representative of other forms of asymmetry - Communication - Micro-architecture etc.

8 Configurations FF FF SS SS FF SS FF F S F S SS SymmetricAsymmetric all fastall slow1 slow2 slow3 slow F = Full frequency S = one-eighth of Full frequency (in talk and paper) S = one-fourth of Full frequency (in paper)

9 Studying impact 3 slow 2 slow1 slow all slow all fast Perf. Metric (Asymm) Same or Many runs Perf. Metric ScalabilityStability

10 SPECjbb SPECjAppServer Apache Zeus TPC-H SPECOMP H.264 PMake Workloads evaluated Middle-tier business apps. Throughput parallel Webservers Throughput parallel Task-based parallelization Embarrassingly parallel

11 SPECjbb SPECjAppServer Apache Zeus TPC-H SPECOMP H.264 PMake Impact of asymmetry Scalable StableWorkloads                   Fix 

12 Managed runtime system (BEA JRockit & Sun HotSpot) Windows 2003 and Linux 2 GCs- Parallel and Gen. Concurrent. Only Minor GC Upto 20 threads Minimal communication SPECjbb SPECjAppServer Apache Zeus TPC-H SPECOMP H.264 PMake Workloads

13 SPECjbb Stability (JRockit/Gencon GC) on 2 slow -Problem: Interference from runtime system (JVM, GC) 4 runs with kernel fix -Fix: Kernel scheduler moves jobs from slow to fast if free Scalable?  Stable? 

14 SPECjbb SPECjAppServer Apache Zeus TPC-H SPECOMP H.264 PMake Workloads Webserver on Linux Thread-based vs. Event-based model ApacheBench Raw perf. with static page Light and heavy loads

15 Apache -Problem: light load - threads can be on fast/slow -No issues under heavy load -Fixes: Kernel scheduler or shorter lifetime of threads Scalability & Stability (light load) Stable?  Scalable? 

16 Zeus Scalability & Stability -Under heavy and light loads: unpredictable -Superior perf. on symmetric configs. -Problem: Aggressive application-level scheduling Stable?  Scalable? 

17 SPECjbb SPECjAppServer Apache Zeus TPC-H SPECOMP H.264 PMake Workloads OMP: Scientific app. Loop-based parallelization Intel Fortran,OpenMP on Linux H.264: Media encoding OpenMP on Windows 2003 PMake: Parallel Make of Linux Kernel

18 SPECOMP -OpenMP schedules tasks assuming equal perf. procs. -Problem: Fast processors are held by slow Scalability with app. fix -Fix: Change scheduling of tasks to on-demand -Downside: Overheads Scalable?  Stable? 

19 H.264 & PMake -H.264 slows down significantly with 1 slow proc. -Speeds up with 1 fast proc. H.264PMake -PMake linearly scalable on all configurations Scalable?  Stable? 

20 App. fix Kernel fix Scalable Stable                   Fix  SPECjbb SPECjAppServer Apache Zeus TPC-H SPECOMP H.264 PMake Impact of asymmetry Query parallelization not aware of asymm. Intra-query parallelization worsens stability. OpenMP based parallelization with sync. barriers. Fast cores held by slow. Interference from runtime system. Garbage collector dependent. Concurrent GC causes more problems. Robust, multi-tier application. Feedback tunes the workload. Very responsive to interference, small heaps etc. Thread serves many requests to reduce overheads. Problems with light load. Threads can map to fast or slow proc. Superior perf. in symmetric system Unpredictable on asymm. with heavy and light loads. Independent application scheduling Robust application. Heavy utilization. Threads well-balanced and abundant. Multi-programming with several tasks. Migrate tasks from slow to fast core if one is free. Inspect runtime software, interference between threads (GC). Migrate tasks from slow to fast core if one is free. Or, Handle few requests and recycle threads. High overhead, low perf. Reconsider application scheduling Approx. application change by reducing degree of Parallelization. Fix application scheduler. Consider asymm. in query optimization engine. Assign tasks on-demand instead of up-front. Make OpenMP understand asymm.

21 Conclusions Asymmetric systems - Good for energy and performance - But can introduce unpredictability Software to understand asymmetry - Evaluate application’s work partitioning - Scheduling of tasks. Mostly no other changes. - May be, feedback based Suitable asymmetry - Many slow & few fast processors

Questions?