Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology 20042005 2007 Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

Similar presentations


Presentation on theme: "1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology 20042005 2007 Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores."— Presentation transcript:

1 1 Pipelining for Multi- Core Architectures

2 2 Multi-Core Technology 20042005 2007 Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores + Cache Core Cache 2 or more cores + Cache 2X more cores

3 3 Why multi-core ? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits: –heat problems –Clock problems –Efficiency (Stall) problems Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult –issue 3 or 4 data memory accesses per cycle, –rename and access more than 20 registers per cycle, and –fetch 12 to 24 instructions per cycle. Many new applications are multithreaded General trend in computer architecture (shift towards more parallelism)

4 4 Instruction-level parallelism Parallelism at the machine-instruction level The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

5 5 Thread-level parallelism (TLP) This is parallelism on a more coarser scale Server can serve each client in a separate thread (Web server, database server) A computer game can do AI, graphics, and sound in three separate threads Single-core superscalar processors cannot fully exploit TLP Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

6 6 What applications benefit from multi-core? Database servers Web servers (Web commerce) Multimedia applications Scientific applications, CAD/CAM In general, applications with Thread-level parallelism (as opposed to instruction- level parallelism) Each can run on its own core

7 7 More examples Editing a photo while recording a TV show through a digital video recorder Downloading software while running an anti-virus program “Anything that can be threaded today will map efficiently to multi-core” BUT: some applications difficult to parallelize

8 8 Core 2 Duo Microarchitecture

9 9 BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 1: floating point Without SMT, only a single thread can run at any given time

10 10 Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 2: integer operation

11 11 SMT processor: both threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 1: floating pointThread 2: integer operation

12 12 But: Can’t simultaneously use the same functional unit BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 1Thread 2 This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE

13 13 Multi-core: threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2

14 14 BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 3 Thread 4 Multi-core: threads can run on separate cores

15 15 Combining Multi-core and SMT Cores can be SMT-enabled (or not) The different combinations: –Single-core, non-SMT: standard uniprocessor –Single-core, with SMT –Multi-core, non-SMT –Multi-core, with SMT: our fish machines The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads Intel calls them “hyper-threads”

16 16 SMT Dual-core: all four threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1Thread 3 Thread 2Thread 4

17 17 Multi-Core and caches coherence memory L2 cache L1 cache C O R E 1C O R E 0 L2 cache memory L2 cache L1 cache C O R E 1C O R E 0 L2 cache Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D L3 cache A design with L3 caches Example: Intel Itanium 2

18 18 The cache coherence problem Since we have private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores

19 19 The cache coherence problem Suppose variable x initially contains 15213 Core 1Core 2Core 3Core 4 One or more levels of cache Main memory x=15213 multi-core chip

20 20 The cache coherence problem Core 1 reads x Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

21 21 The cache coherence problem Core 2 reads x Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

22 22 The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches

23 23 The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip

24 24 The Memory Wall Problem

25 25 Memory Wall µProc 60%/yr. (2X/1.5yr ) DRAM 9%/yr. (2X/10 yrs) 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law”

26 26 Latency in a Single PC Memory Access Time CPU Time Ratio THE WALL

27 27 Pentium 4 Cache hierarchy Processor L1 I (12Ki)L1 D (8KiB) L2 cache (512 KiB) L3 cache (2 MiB) Cycles: 2 Cycles: 19 Cycles: 43 Memory Cycles: 206

28 28 Technology Trends Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Generations YearSize Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 120 ns 1996 64 Mb 110 ns 1998 128 Mb 100 ns 2000 256 Mb 90 ns 2002 512 Mb 80 ns 2006 1024 Mb 60ns 16000:1 4:1 (Capacity) (Latency)

29 29 Processor-DRAM Performance Gap Impact: Example To illustrate the performance impact, assume a single-issue pipelined CPU with CPI = 1 using non-ideal memory. The minimum cost of a full memory access in terms of number of wasted CPU cycles: CPU CPU Memory Minimum CPU cycles or Year speed cycle Access instructions wasted MHZ ns ns 1986: 8 125 190 190/125 - 1 = 0.5 1989: 33 30 165 165/30 -1 = 4.5 1992: 60 16.6 120 120/16.6 -1 = 6.2 1996: 200 5 110 110/5 -1 = 21 1998: 300 3.33 100 100/3.33 -1 = 29 2000: 1000 1 90 90/1 - 1 = 89 2003: 2000.5 80 80/.5 - 1 = 159

30 30 Main Memory Main Memory Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a periodic data refresh (~every 8 msec). Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM) Size: DRAM/SRAM ­ 4-8, Cost & Cycle time: SRAM/DRAM ­ 8-16 Main memory performance : –Memory latency : Access time: The time it takes between a memory access request and the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable) –Memory bandwidth: The maximum sustained data transfer rate between main memory and cache/CPU.

31 31 Architects Use Transistors to Tolerate Slow Memory Cache –Small, Fast Memory –Holds information (expected) to be used soon –Mostly Successful Apply Recursively –Level-one cache(s) –Level-two cache Most of microprocessor die area is cache!


Download ppt "1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology 20042005 2007 Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores."

Similar presentations


Ads by Google