Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Slides:



Advertisements
Similar presentations
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Advertisements

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.
Lecture 6: Multicore Systems
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Scalability-Based Manycore Partitioning Hiroshi Sasaki Kyushu University Koji Inoue Kyushu University Teruo Tanimoto The University of Tokyo Hiroshi Nakamura.
Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.
OpenFOAM on a GPU-based Heterogeneous Cluster
Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.
1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Process Scheduling III ( 5.4, 5.7) CPE Operating Systems
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
Embedded System Lab. 김해천 The TURBO Diaries: Application-controlled Frequency Scaling Explained.
Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Sunpyo Hong, Hyesoon Kim
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Jacob R. Lorch Microsoft Research
40% More Performance per Server 40% Lower HW costs and maintenance
Distributed Processors
Multi-core processors
Timothy Zhu and Huapeng Zhou
“Temperature-Aware Task Scheduling for Multicore Processors”
Computer Engg, IIT(BHU)
Ching-Chi Lin Institute of Information Science, Academia Sinica
Multi-core processors
Improved schedulability on the ρVEX polymorphic VLIW processor
Department of Computer Science University of California, Santa Barbara
Progress Report 2012/12/20.
Department of Computer Science University of California, Santa Barbara
CS Introduction to Operating Systems
Presentation transcript:

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute of Technology

2 Outline Background and Motivation Thread Interactions Dynamic Scheduling Asymmetry Aware Scheduling Conclusion and Future Work

3 Heterogeneous Architectures A particularly interesting class of parallel machines is Heterogeneous Architectures –Multiple types of Processing Elements (PEs) available on the same machine PE A PE B Interconnect

4 Heterogeneous Architectures Heterogeneous architectures are becoming very common IBM Cell processor Special Accelerator Fast core Slow core Slow core Slow core Slow core Focus of this talk Asymmetric Processors Fast core

5 Machine configurations All-slow (SMP)All processors running at their lowest frequency Half-half (AMP)Half of the processors running at their highest frequency, rest running at their lower frequency All-fast (SMP)All processors running at their highest frequency M-I experiments have 8 threads, M-II experiments have 16 threads AMPs emulated using SpeedStep/PowerNow Machine-I2 Socket 1.87 GHz Quad-core Intel Xeon 4MB L2 cache, 8GB RAM, 40GB HDD, RHEL 5 Machine-II4 Socket 2 GHz Quad-core AMD Opteron MB L3 cache, 32GB RAM, 1TB HDD, RHEL 4

6 Power Measurement Using Extech Power Analyzer Total system power consumption Experiment Machine Windows Machine Power Cable Serial Cable Power Socket

7 PARSEC Benchmark Suite Desktop-oriented multithreaded benchmark suite –Multithreaded –Animation, Data Mining, Financial Analysis –Pthreads, OpenMP

8 Performance of PARSEC benchmarks On average, performance of half-half is between that of all-slow and all-fast Execution Time slow-limitedmiddle-perfunstable

9 barrier (a) slow-limited (b) middle-perf(c) unstable Classification of Benchmarks

10 In half-half/all-slow, total energy consumption is higher even though average power consumed might be lower Energy Consumption of PARSEC Energy consumption slow-limitedmiddle-perf

11 Observations –Different applications behave differently on AMPs –Usually SMP with fast processors saves energy Behavior of Parsec Benchmarks

12 Why do different applications behave differently on AMPs?

13 Outline Background and Motivation Thread Interactions Dynamic Scheduling Asymmetry Aware Scheduling Conclusion and Future Work

14 Thread Interactions Sources of thread interactions Critical Sections Barriers

15 Case (a) Critical section Useful work Case (b) Waiting Critical Sections (CS) Waiting to enter CSs

16 Waiting for other threads to finish barrier Barriers barrier

17 Effect of Critical Section length CS limited application As critical section length increases, the average power consumed decreases Normalized Power Consumption

18 Effect of Critical Section length Normalized Execution Time CS limited application

19 Effect of Critical Section length Performance of AMPs sensitive to CS length Normalized Execution Time CS limited application

20 Effect of Critical Section length Energy consumption shows the same trend Normalized Energy Consumption CS limited application

21 Effect of Critical Section frequency Both length and frequency of CS affect performance and energy consumption As frequency increases, performance difference between half-half and all-fast reduces If majority of the execution time is spent waiting for locks, it is OK to have a few slow processors Results available in the paper

22 Effect of Barriers For few barriers, half-half performs similar to all- slow For large number of barriers, half-half performs similar to all-fast Results available in the paper

23 Outline Background and Motivation Thread Interactions Dynamic Scheduling Asymmetry Aware Scheduling Conclusion and Future Work

24 Motivation: better run-time adaptivity Each thread requests for more work after completing the assigned work OpenMP, Intel Thread Building Blocks Dynamic Scheduling

25 Dynamic Scheduling Can help improve performance and reduce energy consumption in AMPs Should be preferred to static and guided policies Machine configuration Normalized Execution Time Normalized Energy Consumption Static/Dynamic 1 GHz (SMP) GHz (SMP) GHz (SMP) GHz (SMP) GHz (SMP) GHz, 2 GHz (AMP)1.00/ / GHz, 2 GHz (AMP)0.83/ / GHz, 2 GHz (AMP)0.71/ / GHz, 2 GHz (AMP)0.59/ /0.63 Parallel-for application

26 Outline Background and Motivation Thread Interactions Dynamic Scheduling Asymmetry Aware Scheduling Conclusion and Future Work

27 Scheduling in AMPs Longest Job to a Fast Processor First (LJFPF) [Lakshminarayana’08] barrier Fast core Slow core

28 How Does the Scheduler Know Length of work? Current mechanism: application sends task length information On-going work: Prediction mechanism

29 LJFPF ITK: Medical image processing applications (OpenSource) MultiRegistration (Registration method) –kernel with 50 iterations –50 iterations divided among 8 threads Normalized Execution TimeNormalized Energy Consumption

30 Outline Background and Motivation Thread Interactions Dynamic Scheduling Asymmetry Aware Scheduling Conclusion and Future Work

31 Conclusion & Future Work Conclusion Evaluated the performance/energy consumption behavior of multithreaded applications in AMPs For symmetric workloads –With little thread interaction: SMP with fast processors –With a lot of thread interaction: AMP could be better For asymmetric threads – AMP could provide lowest energy consumption Future Work Predict application characteristics and use predicted information for thread scheduling on AMPs

32 Thank you!