Erkan Çetiner. Outline Introduction Related Works Modeling Methodology Baseline Results DTM Techniques Conclusions.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Power Reduction Techniques For Microprocessor Systems

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Thermal-Scheduling For Ultra Low Power Mobile Microprocessor May, Thermal-Scheduling For Ultra Low Power Mobile Microprocessor George Cai 1 Chee.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Temperature-Aware Design Presented by Mehul Shah 4/29/04.

1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.

Power-aware Computing n Dramatic increases in computer power consumption: » Some processors now draw more than 100 watts » Memory power consumption is.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

Low Power Techniques in Processor Design

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

Multi Core Processor Submitted by: Lizolen Pradhan

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Multi-Core Architectures

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Hyper-Threading Technology Architecture and Microarchitecture

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

CMP Design Space Exploration Subject to Physical Constraints Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, Kevin Skadron HPCA’06 01/27/2010.

1 Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

CS203 – Advanced Computer Architecture

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS203 – Advanced Computer Architecture

Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron

Computer Structure Multi-Threading

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Computer Architecture Lecture 4 17th May, 2006

Adaptive Single-Chip Multiprocessing

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

8 – Simultaneous Multithreading

Presentation transcript:

Erkan Çetiner

Outline Introduction Related Works Modeling Methodology Baseline Results DTM Techniques Conclusions

INTRODUCTION

SMT(Simultaneous Multithreading) Allows instructions from multiple threads to be simultaneously fetched and executed in same pipeline Amortizing the cost by allowing more IPC(instruction per cycle) Even though SMT has shown energy efficiency for most workloads, the significant boost in IPC results in increased power dissipation & possible increased power density So thermal behavior & cooling costs are major concern

CMP(Core Multiprocessors) Instantiates multiple processor “cores” on a single die Each core has private branch predictors, first-level caches and a shares a second-level, on-chip cache For multiprogrammed workloads  it amortizes cost of die by allowing data sharing within a common L2 cache Like in SMT, CMP promise to boost in throughput The replication of cores means that area and power overhead to support extra threads is much greater with CMP than SMT For a given die size, a single-core SMT chip will therefore support a larger L2 size than a multicore chip Side effect for CMP  Each added cores on a chip increases power dissipation, so thermal behavior and cooling costs are also major concerns for CMP

Why Compare Those ? Both paradigms target increased througput for multithreaded and multi-programmed workloads, it is worthy to compare them to see the performance, energy and thermal conditions of them

RELATED WORKS

Research Areas Area overhead & energy efficiency of SMT Energy efficiency & several power-aware optimizations for a multithreaded Alpha processor Energy efficiency of SMT & CMP for Multimedia Workloads Hybrid Systems include SMT & CMP

MODELING METHODOLOGY

Microarchitecture & Performance Modeling Turando/Powertimer used  to model an out-of-order, superscalar processor with resource configuration similar to current generation multiprocessors

Microarchitecture & Performance Modeling SMT is modeled by duplicating data structures that correspond to duplicated resources and increasing the sizes of those shared critical resources like the register file Round-Robin policy is used at various pipeline stages for deciding which threads should go ahead It is difficult to compare performance of different CMP or SMP configurations  need a baseline

Benchmarks 15 SPEC2000 used – single thread benchmark Simpoint toolset used – get representative simulation points for 500 million instructions Trace Generation Tool used – generates final static traces by skipping the number of instructions given by Simpoint Finally 500 million instructions are simulated and captured Use pairs of single-thread benchmarks to form dual-thread SMT&CMP benchmark Categorization of Benchmarks High IPC(>0.9) Low IPC(<0.9) High Temperature(peak temperature>82°C) Low Temperature(peak temperature <82°C) Floating Benchmark Integer Benchmark

Power Model Base energy models are derived from circuit level power analysis In this research analysis performed at macro level Assumption  Uniform Leakage Power Density for all units on chip if they have same temperature(More accurate leakage power models resulted in more accurate conclusions)

Temperature Model HotSpot2.0 used  models temperature using a circuit of thermal resistances and capacitances that are derived from the layout of microarchitecture units Assumption Provide at least one temperature sensor for each microarchitecture block in floorplan

Chip Die Area & L2 Cache Size Selection Appropriate L2 cache size selection is very important Core area stays fixed in experiment The number of cores & L2 cache size determines total chip die area CMP requires additional chip area for second core, L2 cache size must be smaller to achieve equivalent die area

BASELINE RESULTS

Some statistics Chip area  210 mm² L2 Cache Sizes ST – 2MB SMT – 2MB CMP – 1MB

Performance & Energy CMP outperforms SMT for workloads with low L2 cache miss rates (87%-26%) SMT outperforms CMP for workloads with high miss rates(42%-22%)

Performance & Energy Power overhead of SMT (38%-46%) Main reasons for power growth  Increased resources it requires  Increased utilization due to additional simultaneous instruction throughput Power overhead for CMP(93%-71%) Main Reason  Addition of entire second processor By looking these metrics, CMP is most-energy efficient for benchmarks with low L2 cache miss rates SMP is most-energy efficient for benchmarks with high L2 cache miss rates

Performance & Energy With Smaller L2 Cache size & High Cache Miss Ratio  Program is memory bounded hence SMT is better in terms of performance & energy With Larger L2 Cache Size & Low Cache Miss Ratio  No memory-bound  CMP is better

Temperature Relatively similar temperature ratings

Temperature So why temperature increase for both of them ? SMT processor  the temperature hotspots are largely due to the higher utilization factor of certain structures like the integer register file CMP processor  integrated two cores and the total power of the chip nearly doubles and hence the total amount of heat being generated nearly doubles

DTM TECHNIQUES

DTM Constrained Techniques Reduce packaging costs Sustain thermal requirements of typical workloads  Set some DTM techniques when temperature exceeds the design set point

DTM Techniques Dynamic Voltage Scaling Fetch-Throttling Rename-Throttling Register-File Occupancy Throttling

Dynamic Voltage Scaling Cuts voltage& frequency in response to thermal violations Restores the high voltage & frequency when the temperature drops below the trigger threshold

Fetch-throttling Limits how often the fetch stage is allowed to proceed Reduces activity factors through pipeline Rename-throttling Limits number of instructions renamed each cycle

Register-File Occupancy-throttling Register file is hottest spot of all chip Its power is proportional to occupancy To reduce power of register file  limit the number of register entries to a fraction of full size All these techniques have a coomon property that  by limiting resources available to processors, these policies will cause the processor to slow down, thus consuming less power & finally cooling down to below the thermal trigger level

Performance of DTM For workloads with low or moderate miss ratios, CMP always gives the best performance regardless of the DTM technique For workloads that are memory bound, SMT always give better performance

Performance of DTM For CMP Register-throttling & fetch-throttling work equally well For SMT Register-throttling is the best technique  rename-throttling  global-fetch throttling

Energy of DTM Energy consumption is critical design criteria for : Battery life Energy utility costs (e.g. High-performance mobile laptops, servers designed for throughput oriented data centers like Google cluster architecture) Dominant trend is that global DTM techniques tenf to have superior energy-efficiency compared against to local techniques for most configuration Because global nature of DTM mechanism, larger portion of chip will be cooled, resulting in larger savings

SMT architecture is superior to ST architecture for all DTM techniques except for Rename-throttling

For CMP  In Low L2 miss rates, CMP is always superior to the SMT for all DTM configurations

CONCLUSIONS

Conclusions Both exhibit similar operating temperatures within current generation process technologies but heating behaviors are different : SMT  Heating is caused by localized heating within certain key microarchitecturral structures such as register file, due to increased utilization CMP  Heating is primarily caused by global impact of increased energy output CMP machines offer significantly more throughput than SMT machines for CPU-bound applications and this leads to significant energy-efficiency savings despite a substantial increase in power dissipation.

Conclusions In equal-area comparison loss of L2 cache size hurts the CMP’s performance for L2- bound applications CMP&SMT cores tend to perform better with different DTM techniques In performance oriented systems  Localized DTM techniques work better for SMT cores and global DTM techniques work better for CMP cores In energy-oriented systems  global DVS thermal management technique offer significant energy savings

REFERENCES Performance, energy, and thermal considerations for SMT and CMP architectures Yingmin Li Skadron, K. Brooks, D. Zhigang Hu Dept. of Comput. Sci., Virginia Univ., Charlottesville,VA, USA Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Venkatesan Packirisamy, Yangchun Luo, Wei-lung Hung, Antonia Zhai, and Pen-chung Yew

THANK YOU