Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.

Slides:



Advertisements
Similar presentations
Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.
Advertisements

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
High Performing Cache Hierarchies for Server Workloads
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Power Reduction Techniques For Microprocessor Systems
Power Management (Application of Autonomic Computing Concepts) Omer Rana.
st International Conference on Parallel Processing (ICPP)
OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Shimin Chen Big Data Reading Group Presented and modified by Randall Parabicoli.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU
History of Microprocessor MPIntroductionData BusAddress Bus
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Srihari Makineni & Ravi Iyer Communications Technology Lab
80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
Hardware. Control Process Unit(CPU) Contents Introduction Definition CPU Components of CPU Stages of the work of CPU CPU frequency CPU Cooling Conclusion.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Full and Para Virtualization
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
CSUDH Fall 2015 Instructor: Robert Spengler
Chap 4: Processors Mainly manufactured by Intel and AMD Important features of Processors: Processor Speed (900MHz, 3.2 GHz) Multiprocessing Capabilities.
E-MOS: Efficient Energy Management Policies in Operating Systems
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Virtual Machine in HPC PAK MARKTHUB (13M54040) 1 VIRTUAL MACHINE IN HPC.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice ProLiant G5 to G6 Processor Positioning.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Some challenges in heterogeneous multi-core systems
Lecture 2: Performance Today’s topics: Technology wrap-up
CARLA Buenos Aires, Argentina - Sept , 2017
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory

Introduction Want to save power without a performance hit Dynamic Frequency and Voltage Scaling –Slow down the CPU –Linear speed loss, quadratic CPU power drop –Efficient, but limited range –A number of fixed p-states CPU Packing –Run a workload on few CPU cores –Linear speed loss, linear CPU power drop –Less efficient, greater range –A number of fixed configurations Using both?

Hardware architecture 16 nodes: complete systems, each with: –2 CPU sockets per node (physical dies) –2 cores per socket (4 total cores per node) 4-level memory hierarchy –L1 & L2 cache Per-core –Local memory Per-socket –Remote memory Accessible via HyperTransport bus Socket 0 AMD64 Core 1 L1 Instr L1 Data L2 Memory (1GB) HyperTransport AMD64 Core 0 L1 Instr L1 Data L2 Socket 1 AMD64 Core 3 L1 Instr L1 Data L2 AMD64 Core 2 L1 Instr L1 Data L2 Memory (1GB)

P-states and configurations Scaling: –Entire socket must scale together –5 P-states: every 200Mhz from 1.8 to 1.0GHz Packing: –5 configurations: All four cores: ×4 Three cores: ×3 Cores 0 and 1: ×2 Cores 0 and 2: ×2* One core: ×1 –For multi-node tests, prepend number of nodes 4×2: 4 nodes, cores 0 and 1 active, 8 total cores –Packing results "simulate" full socket shutdown (subtract 20W) HyperTransport Socket Core 3Core 2 Memory Socket Core 1Core 0

Three application classes CPU-bound –No communication, fits in cache –100% CPU utilization –Similar to while(1){} High-Performance Computing (HPC) –Inter-node communication –Significant memory usage –Performance = Execution time Commercial –Constant servicing of remote requests –Possibly significant memory usage –Performance = Throughput

(1) CPU-bound workloads Workload –DAXPY: A small linear algebra kernel –Representative of entire class Scaling: –Linear slowdown –Quadratic power cut Packing: –×4 is most efficient –×2* is no good here –×3 is right out –Single-socket configs ×1 and ×2 save power, but kill performance Power (W) Different p-states Throughput

(2) HPC workloads Packing with fixed nodes Power Energy EDP Time ×2* has no effect LU ×2* speedup CG slowdown CG CPU utilization falls LU ×2* speedup

(2) HPC workloads Packing with fixed cores Power Energy EDP Time

(3) Commercial workloads Scale first, then pack Power (W) Throughput (replies/second)

Conclusions Packing less efficient than scaling –Therefore: Scale first, then pack Nothing can help CPU-bound apps Memory/IO bound workloads are scalable Resource utilization affects (predicts?) effectiveness of scaling and packing Business workloads can benefit from scaling/packing –Especially at low utilization levels

Future work How does resource utilization influence the effectiveness of scaling/packing? –A predictive model based on resource usage? –A power management engine based on resource usage? Dynamic packing –Virtualization allows live migration –Can this be used to do packing on the fly?