Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Super computers Parallel Processing By Lecturer: Aisha Dawood.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Shashwat Shriparv InfinitySoft.

Hyper-Threading Technology Architecture and Microarchitecture

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Processor Level Parallelism 1

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

Lynn Choi School of Electrical Engineering

Simultaneous Multithreading

Lynn Choi School of Electrical Engineering

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Hyperthreading Technology

Computer Architecture: Multithreading (I)

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Adaptive Single-Chip Multiprocessing

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

CSC3050 – Computer Architecture

Levels of Parallelism within a Single Processor

Hardware Multithreading

Presentation transcript:

Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006

Trends in Emerging Systems   Industry and community could previously rely on increased frequency and microarchitecture innovations to steadily improve performance of computers each year   superscalar,   out-of-order issue,   on-chip caching,   deep pipelines supported by sophisticated branch predictors.

Trends in Emerging Systems (cont.)   Processor designers have found it increasingly difficult to:   manage power dissipation   chip temperature   current swings   design complexity   decreasing transistor reliability in designs   Physics problems- not necessarily innovation problems

Moore’s Law

Performance Increase of Workstations Less than 1.5x every 18 months

The Power Challenge

As long as there are sufficient TLP

MultiCore becomes mainstream   Commercial examples: CompanyChip IBM Power4, Power5, PPC970, Cell Sun SparcIV, SparcIV+, UltraSparc T1 Intel PentiumD, Core Duo, Conroe AMD Operon, Athlon X2, Turion X2 MS Xbox360 – 3 core PPC Raza XLR – 8 MIPS cores Broadcom Sibyte – multiple MIPS cores

Why MultiCore becomes mainstream   TLP vs. ILP Physical limitations has caused serious heat dissipation problems. Memory latency continues to limit single thread performance. Now designers try to push TLP (Thread Level Parallelism) rather than ILP or higher clock frequency. e.g. Sun UltraSparc T1 trades single thread performance for higher throughput to keep its server market. Server workloads are broadly characterized by high TLP, low ILP, and large working set.

Why MultiCore becomes mainstream   CMP with shared cache can reduce expensive coherence miss penalty   As L2/L3 caches become larger, coherence misses start to dominate the performance for server workloads.   SMP has been successfully used for years, so software is relatively mature for Multicore chips.   New applications tend to have high TLP e.g. media apps, server apps, games, network processing, … etc   One alternative to Multicore is SOC

Moore’s Law will continue to provide transistors Transistors will be used for more cores, caches, and new features. More cores for increasing TLP, Caches to address memory latency.

CMT (Chip Multi-Threading)   CMT processors support many simultaneous hardware threads of execution. – –SMT (Simultaneous Multi-Threading) – –CMP (i.e. multi-core)   CMT is about on-chip resource sharing – –SMT: threads share most resources – –CMP: threads share pins and bus to memory, may also share L2/L3 caches.

Single Instruction Issue Processors Time Reduced FU utilization due to memory latency or data dependency or branch misprediction

Superscalar Processors Time Superscalar leads to higher performance, but lower FU utilization.

SMT (Simultaneous Multi-Threading) Processors Time Maximize FU utilization by issuing operations from two or more threads. Example: Pentium IV Hyper-threading

Vertical Multi-Threading Time DCache miss occurs Stall cycles

Vertical Multi-Threading Time DCache miss occurs Switch to the 2 nd thread on a long latency event (e.g. L2 cache miss) Example: Montecito uses event driven MT

Horizontal MT Time Thread switch occurs on every cycle. Example: Sun Niagara (T1) with 4 threads per core

MT in Niagara T1 Time Thread switch occurs on every cycle. The processor issues a single operation per cycle.

MT in Niagara T2 Time Thread switch occurs on every cycle. The processor issues two operations per cycle.

CMT Evolution  Stanford Hydra CMP project starts putting 4 MIPS processors on one chip in 1996  DEC/Compaq Piranha project proposed to include 8 Alpha cores and a L2 cache on a single chip in  SUN’s MAJC chip was a dual-core processors with shared L1 cache, released in 1999  IBM Power4 is dual-core (2001), and Power5 dual-core, each core 2-way SMT.  SUN’s Gemini and Jaguar were dual core processors (in 2003), Panther (in 2005) with shared on-chip L2 cache, Niagara (T1 in 2006) is a 32-way CMT, with 8 cores, 4 threads per core.  Intel Montecito (Itanium2 follow-up) will have two cores, and two threads per core.

CMT Design Trends Jaguar 2003 Panther 2005 Niagara (T1) 2006

Multi-Core Software Support  Multi-Core demands Threaded Software  Importance of threading –Do nothing  OS is ready, background jobs can also benefit –Parallelize  Unlock the potential (apps, libraries, compiler generated threads)  Key Challenges –Scalability –Correctness –Ease of programming

Multi-Core Software Challenges  Scalability OpenMP (for SMP/CMP node), MPI (for clusters), or mixed  Correctness Various thread checker, thread profiler, performance analyzer, memory checker tools to simplify the creation and debugging of scalable thread safe code  Ease of programming New programming models (e.g. C++ template-based runtime library to simplify app writing with pre-built and tested algorithms and data structures) Transactional memory concept

CMT Optimization Challenges  Traditional optimization assumes all the resources in a processor can be used  Prefetch may take away buy bandwidth from the other core (and latency may be hidden anyway)  Code duplication/specialization may take away shared cache space.  Speculative execution may take away resource from a second thread.  Parallelization may reduce total throughput  Resource information is often determined at runtime. New  Resource information is often determined at runtime. New policies and mechanisms are needed to maximize total performance.

CMT Optimization Challenges  I-cache optimization issues In single thread execution, I-cache misses often come from conflicts between procedures. In multi-threaded execution, the conflicts may come from different threads.  Thread scheduling issues Should two threads be scheduled on two separate cores or on the same core with SMT? Schedule for performance or schedule for power? (balanced vs unbalanced scheduling)

Some emerging Issues  New low power, high performance cores Current cores reuse the same design from previous generation. This cannot last long since supply power scaling is not sufficient to meet the requirement. New designs are called for to get low power and high performance cores.  Off-chip Bandwidth How to keep up the needs for of off-chip bandwidth (double every generation)? Cannot rely on the increase of pins (increased at 10% per generation). Must increase the bandwidth per pin.

Some emerging Issues (cont.)  Homogeneous or heterogeneous cores for workloads with sufficient TLP, multiple simple cores can deliver superior performance. However, how to deliver robust performance for single thread jobs? A complex core + many simple cores?  Shared hardware accelerators network offload enginesnetwork offload engines cryptographic engines XML parsing or processing?XML parsing or processing? FFT acceleratorFFT accelerator

New Research Opportunities with CMP/CMT  Speculative Threads  With thread-level control speculation and runtime data dependence check to speed up single program execution.  Recent studies have shown ~20% of speed up potential at loop level thread speculation on sequential code.  Helper Threads  Using otherwise idle cores to run dynamic optimization threads, performance monitoring (or profiling) threads, or scout threads.

New Research Opportunities with CMP/CMT  Monitoring Threads  The monitoring threads can run on other cors to enforce the correct execution of the main thread.  The main thread turns itself into a speculative thread until the monitoring thread verify the execution meets the requirements. If verification failed, the speculative execution aborts.

New Research Opportunities (Transient Fault Detection/Tolerance)  Software Redundant Multi-Threading Using software controlled redundancy to detect and tolerate transient faults. Optimizations are critical to minimize communication and synchronizations. Redundant threads run on multi cores – this is different from SMT where one error may corrupt both threads.  Process Level Redundancy Only check on system calls to intercept faults that propagate to the output.

New Research Opportunities  For software debugging Running a different path on the other core to increase path coverage.

Future of CMP/CMT  Some companies already have 128/256 cores CMP on their roadmap. Not sure what will happen, future is hard to predict. High- end servers may be addressed by large scale CMP, but desktop and embedded market may be not (perhaps small scale or medium scale would be sufficient).  Today’s architectures are more likely be driven by software market than by hardware vendors. Itanium is one example. Even with Intel+HP, it has not been very successful. A successful product sells by itself.