University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

School of Engineering & Technology Computer Architecture Pipeline.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.

Information Theory Based Parametric Network Consolidation Team Dark Knight Akhil Singhvi Anup Ganesh Avinash Varma Sushrith Hegde Vishaal Nagaraja.

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

University of Michigan Electrical Engineering and Computer Science 1 Practical Lock/Unlock Pairing for Concurrent Programs Hyoun Kyu Cho 1, Yin Wang 2,

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Thinking in Parallel Adopting the TCPP Core Curriculum in Computer Systems Principles Tim Richards University of Massachusetts Amherst.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.

Breakout Session 3 Stack of adaptive systems (with a view on self-adaptation)

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Scalable and Coordinated Scheduling for Cloud-Scale computing

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Hardware Support for On-Demand Software Analysis Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan December 8, 2011.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Sunpyo Hong, Hyesoon Kim

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Tuning Threaded Code with Intel® Parallel Amplifier.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Gwangsun Kim, Jiyun Jeong, John Kim

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Adaptive Cache Partitioning on a Composite Core

Performance Tuning Team Chia-heng Tu June 30, 2009

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Fine-grained vs Coarse-grained multithreading

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke University of Michigan May 20, 2014

University of Michigan Electrical Engineering and Computer Science 2 Parallel Programming Core1 Core2 Core3 Core4 Workload

University of Michigan Electrical Engineering and Computer Science 3 Workload Imbalance Among Threads Asymmetric S/W –Control flow divergence –Non-deterministic memory latencies –Synchronization operations Asymmetric H/W –Heterogeneous multicores –Core-to-core process variation

University of Michigan Electrical Engineering and Computer Science 4 Performance Impact of Asymmetric H/W Symmetric 8 Cores vs. 8 Cores w/ variations

University of Michigan Electrical Engineering and Computer Science 5 CPU Time Wasted for Synchronization HomogeneousHeterogeneous

University of Michigan Electrical Engineering and Computer Science 6 Thread Criticality due to Workload Imbalance T1 T2 T3 T4 T5 Idle Barrier time T1 T2 T3 T4 T5 time

University of Michigan Electrical Engineering and Computer Science 7 Accelerating Critical Path w/ Core Boosting T1 T2 T3 T4 T5 Idle Barrier time T1 T2 T3 T4 T5 time T1 T2 T3 T4 T5 time

University of Michigan Electrical Engineering and Computer Science 8 Modeling Workload Imbalance & Boosting

University of Michigan Electrical Engineering and Computer Science 9 Boosting Assignment Data parallel programs Pipeline parallel programs Worker Stage1 Stage2 Stage3 Stage4

University of Michigan Electrical Engineering and Computer Science 10 Boosting Data Parallel Programs Greedy scheduling

University of Michigan Electrical Engineering and Computer Science 11 Boosting Pipeline Parallel Programs Epoch-based scheduling –Monitors CPU utilization with H/W performance counter –Assigns boosting budget at the end of epoch

University of Michigan Electrical Engineering and Computer Science 12 Dynamic Core Boosting

University of Michigan Electrical Engineering and Computer Science 13 Progress Monitoring Example … pthread_barrier_wait(barrier); period = calc_period_LID_007(start, end); for ( i = start ; i < end ; i++ ) { … compute(…); if ( side_exit ) { SET_PROGRESS_TO(MAX_PROGRESS_007); break; } if ( ( ( end – i ) % period ) == 0 ) PROGRESS_STEP_FORWARD; } pthread_barrier_wait(barrier); …

University of Michigan Electrical Engineering and Computer Science 14 Evaluation Methodology Asymmetry emulation with Dynamic Binary Translation –Slow down proportionally instead of accelerating 8 cores with frequency variation – 1 core boosted, boosting rate = 1.5x Compares –Heterogeneous –Reactive –DCB

University of Michigan Electrical Engineering and Computer Science 15 Performance Improvement

University of Michigan Electrical Engineering and Computer Science 16 Synchronization Overheads

University of Michigan Electrical Engineering and Computer Science 17 Thread Arrival Time

University of Michigan Electrical Engineering and Computer Science 18 Conclusion DCB mitigates workload imbalance in performance asymmetric CMPs –Accelerating critical threads –Coordinating compiler, runtime, and architecture for near-optimal assignment Overall, improves performance by 33%, outperforming a reactive boosting scheme by 10%

University of Michigan Electrical Engineering and Computer Science 19 Thank you!

University of Michigan Electrical Engineering and Computer Science 20 Core Boosting with Frequency Scaling Transition time < 10ns [Dreslinski`12]

University of Michigan Electrical Engineering and Computer Science 21 Asymmetry Emulation with DBT

University of Michigan Electrical Engineering and Computer Science 22 Evaluation Platform Accuracy