Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.

Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic

2 To Be Tackled in Multithreading Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages

3 Threading Algorithms Time-slicing  A processor switches between threads in fixed time intervals.  High expenses, especially if one of the processes is in the wait state. Fine grain Switch-on-event  Task switching in case of long pauses  Waiting for data coming from a relatively slow source, CPU resources are given to other processes. Coarse grain

4 Threading Algorithms (cont.) Multiprocessing  Distribute the load over many processors  Adds extra cost Simultaneous multi-threading  Multiple threads execute on a single processor without switching.  Basis of Intel’s Hyper-Threading technology.

5 Hyper-Threading Concept At each point of time only a part of processor resources is used for execution of the program code of a thread. Unused resources can also be loaded, for example, with parallel execution of another thread/application. Extremely useful in desktop and server applications where many threads are used.

Quick Recall: Many Resources IDLE! From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. For an 8-way superscalar. Slide source: John Kubiatowicz

8 (a) A superscalar processor with no multithreading (b) A superscalar processor with coarse-grain multithreading (c) A superscalar processor with fine-grain multithreading (d) A superscalar processor with simultaneous multithreading (SMT) (a) (b)(c)(d)

9 Simultaneous Multithreading (SMT) Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple threads!  i.e., convert thread-level parallelism into more ILP  exploit following features of modern processors: multiple functional units  modern processors typically have more functional units available than a single thread can utilize register renaming and dynamic scheduling  multiple instructions from independent threads can co-exist and co-execute!

10 Hyper-Threading Architecture First used in Intel Xeon MP processor Makes a single physical processor appear as multiple logical processors. Each logical processor has a copy of architecture state. Logical processors share a single set of physical execution resources

11 Hyper-Threading Architecture Operating systems and user programs can schedule processes or threads to logical processors as if they were in a multiprocessing system with physical processors. From an architecture perspective we have to worry about the logical processors using shared resources.  Caches, execution units, branch predictors, control logic, and buses.

Power 5 dataflow... Why only two threads?  With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck Cost:  The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

13 Advantages Extra architecture only adds about 5% to the total die area. No performance loss if only one thread is active. Increased performance with multiple threads Better resource utilization.

14 Disadvantages To take advantage of hyper-threading performance, serial execution can not be used.  Threads are non-deterministic and involve extra design  Threads have increased overhead Shared resource conflicts

Multicore Multiprocessors on a single chip 15

CS267 Lecture 6 16 Basic Shared Memory Architecture Processors all connected to a large shared memory  Where are caches? Now take a closer look at structure, costs, limits, programming P1 interconnect memory P2 Pn

Slide source: John Kubiatowicz What About Caching??? Want High performance for shared memory: Use Caches!  Each processor has its own cache (or multiple caches)  Place data from memory into cache  Writeback cache: don’t send all writes over bus to memory Caches Reduce average latency  Automatic replication closer to processor  More important to multiprocessor than uniprocessor: latencies longer Normal uniprocessor mechanisms to access data  Loads and Stores form very low-overhead communication primitive Problem: Cache Coherence! I/O devices Mem P 1 $ $ P n Bus

Example Cache Coherence Problem I/O devices Memory P 1 $$ $ P 2 P 3 5 u = ? 4 u u :5 1 u 2 u 3 u = 7 Things to note :  Processors could see different values for u after event 3  With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when How to fix with a bus: Coherence Protocol  Use bus to broadcast writes or invalidations  Simple protocols rely on presence of broadcast medium Bus not scalable beyond about 64 processors (max)  Capacity, bandwidth limitations Slide source: John Kubiatowicz

CS267 Lecture 6 Limits of Bus-Based Shared Memory I/OMEM ° ° ° PROC cache PROC cache ° ° ° Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate => 80 MB/s inst BW per processor => 60 MB/s data BW per processor  140 MB/s combined BW Assuming 1 GB/s bus bandwidth  8 processors will saturate bus 5.2 GB/s 140 MB/s

21 Cache Organizations for Multi-cores L1 caches are always private to a core L2 caches can be private or shared Advantages of a shared L2 cache:  efficient dynamic allocation of space to each core  data shared by multiple cores is not replicated  every block has a fixed “home” – hence, easy to find the latest copy Advantages of a private L2 cache:  quick access to private L2 – good for small working sets  private bus to private L2  less contention

A Reminder: SMT (Simultaneous Multi Threading) SMT vs. CMP

A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97 For Same area (a billion tr. DRAM area) Superscalar and SMT: Very Complex Wide Advanced Branch prediction Register Renaming OOO Instruction Issue Non-Blocking data caches Superscalar (SS) SMT CMP

SS and SMT vs. CMP CPU Cores: Three main hardware design problems (of SS and SMT): Area increases quadratically with core complexity Number of Registers O(Instruction window size) Register ports - O(Issue width) CMP solves this problem (~ linear Area to Issue width) Longer Cycle Times Long Wires, many MUXes and crossbars Large buffers, queues and register files Clustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties) CMP allows small cycle time (with little effort) Small and fast Relies on software to schedule - Poor ILP Complex Design and Verification

SS and SMT vs. CMP Memory: 12 issue SS or SMT require multiport data cache (4-6 ports) 2 X 128 Kbyte (2 cycle latency) CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport) Shared memory: write through caches SMT CMP

Performance comparison Compress: (Integer apps) Low ILP and no TLP Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand) + SMT utilizes core resources better + But CMP has 16 issue slots instead of 12 Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)

CMP Motivation How to utilize available silicon?  Speculation (aggressive superscalar)  Simultaneous Multithreading (SMT, Hyperthreading)  Several processors on a single chip What is a CMP (Chip MultiProcessor)?  Several processors (several masters)  Both shared and distributed memory architectures  Both homogenous and heterogeneous processor types Why?  Wire Delays  Diminishing of Uniprocessors  Very long design and verification times for modern processors

A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97 TLP and PLP become widespread in future applications Various Multimedia applications Compilers and OS  Favours CMP CMP: Better performance with simple hardware Higher clock rates, better memory bandwidth Shorter pipelines SMT: has better utilizations but CMP has more resources (no wide-issue logic) Although CMP bad for no TLP and ILP (compress), SMT and SS not much better

A Reminder: SMT (Simultaneous Multi Threading) SMT CMP Pool of execution units (Wide machine) Several Logical processors Copy of State for each Mul. Threads are running concurrently Better utilization and Latency Tolerance Simple Cores Moderate amount of parallelism Threads are running concurrently on different cores

30 SMT Dual-core: all four threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1Thread 3 Thread 2Thread 4

Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.

Similar presentations

Presentation on theme: "Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.

Similar presentations

Presentation on theme: "Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic."— Presentation transcript:

Similar presentations

About project

Feedback