1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)

1 Lecture 14: Virtual Memory Topics: virtual memory (Section 5.4) Reminders: midterm begins at 9am, ends at 10:40am.

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 27: Disks, Reliability, SSDs, Processors Topics: HDDs, SSDs, RAID, Intel and IBM case studies Final exam stats:  Highest 91, 18 scores of 82+

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

1 Lecture 16: Virtual Memory Topics: virtual memory, improving TLB performance (Sections )

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Large Caches, Virtual Memory

Lynn Choi School of Electrical Engineering

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Cache Hierarchies

Flow Path Model of Superscalars

Lecture: SMT, Cache Hierarchies

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Levels of Parallelism within a Single Processor

Lecture 23: Cache, Memory, Virtual Memory

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture: SMT, Cache Hierarchies

Lecture: Branch Prediction

Lecture: Cache Innovations, Virtual Memory

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 15: Memory Design

Lecture: SMT, Cache Hierarchies

Lecture: Cache Hierarchies

Levels of Parallelism within a Single Processor

Lecture 14: Cache Performance

Presentation transcript:

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

2 Prefetching Hardware prefetching can be employed for any of the cache levels It can introduce cache pollution – prefetched data is often placed in a separate prefetch buffer to avoid pollution – this buffer must be looked up in parallel with the cache access Aggressive prefetching increases “coverage”, but leads to a reduction in “accuracy”  wasted memory bandwidth Prefetches must be timely: they must be issued sufficiently in advance to hide the latency, but not too early (to avoid pollution and eviction before use)

3 Stream Buffers Simplest form of prefetch: on every miss, bring in multiple cache lines When you read the top of the queue, bring in the next line L1 Stream buffer Sequential lines

4 Stride-Based Prefetching For each load, keep track of the last address accessed by the load and a possibly consistent stride FSM detects consistent stride and issues prefetches init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) tagprev_addrstridestate PC

5 Compiler Optimizations Loop interchange: loops can be re-ordered to exploit spatial locality for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2 * x[i][j]; is converted to… for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2 * x[i][j];

6 Exercise for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } for (jj=0; jj<N; jj+= B) for (kk=0; kk<N; kk+= B) for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } yz yzx x Re-organize data accesses so that a piece of data is used a number of times before moving on… in other words, artificially create temporal locality

7 Exercise for (jj=0; jj<N; jj+= B) for (kk=0; kk<N; kk+= B) for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } yzx yzx yzx yzxyzx

8 Exercise Original code could have 2N 3 + N 2 memory accesses, while the new version has 2N 3 /B + N 2 for (i=0;i<N;i++) for (j=0;j<N;j++) { r=0; for (k=0;k<N;k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } for (jj=0; jj<N; jj+= B) for (kk=0; kk<N; kk+= B) for (i=0;i<N;i++) for (j=jj; j< min(jj+B,N); j++) { r=0; for (k=kk; k< min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } yz yzx x

9 Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on:  simple cores (low power, design complexity, can accommodate more cores)  fine-grain multi-threading (to tolerate long memory latencies) Case Study I: Sun’s Niagara

10 Niagara Overview

11 SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores

12 Case Study II: Intel Pentium 4 Pursues high clock speed, ILP, and TLP CISC instrs are translated into micro-ops and stored in a trace cache to avoid translations every time Uses register renaming with 128 physical registers Supports up to 48 loads and 32 stores Rename/commit width of 3; up to 6 instructions can be dispatched to functional units every cycle Simple instruction has to traverse a 31-stage pipeline Combining branch predictor with local and global histories 16KB 8-way L1; 4-cyc for ints, 12-cyc for FPs; 2MB 8-way L2, 18-cyc

13 Clock Rate Vs. CPI: AMD Opteron Vs P4 2.8 GHz AMD Opteron vs. 3.8 GHz Intel P4: Opteron provides a speedup of 1.08

14 Case Study III: Intel Core Architecture Single-thread execution is still considered important   out-of-order execution and speculation very much alive  initial processors will have few heavy-weight cores To reduce power consumption, the Core architecture (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages) Many transistors invested in a large branch predictor to reduce wasted work (power) Similarly, SMT is also not guaranteed for all incarnations of the Core architecture (SMT makes a hotspot hotter)

15 Case Study IV: More Processors Intel Nehalem: successor to Core: to be released in late’08 with 8 cores (45 nm tech); 4-wide issue, support for SMT AMD Barcelona: 4 cores, issue width of 3, each core has private L1 (64 KB) and L2 (512 KB), shared L3 (2 MB), 95 W (AMD also has announcements for 3-core chips) Sun Niagara2: 8 threads per core, up to 8 cores, W, GHz, 4 MB L2 (8 banks), 8 FP units Sun Rock: in development (to be released in 2008), has support for “scout threads”, and transactional memory, 16 cores, 2 threads/core IBM Power6: 2 cores, 4.7 GHz, each core has a private 4 MB L2

16 Example Look-Up PTEs Virtual Memory Physical Memory TLBTLB Virtual page abc  Physical page xyz If each PTE is 8 bytes, location of PTE for abc is at virtual address abc/8 = lmn Virtual addr lmn  physical addr pqr

17 Alpha Address Mapping Unused bitsLevel 1Level 2Level 3Page offset Virtual address 13 bits10 bits 21 bits Page table base register PTE L1 page table + PTE L2 page table + 10 bits PTE L3 page table + 32-bit physical page numberPage offset 45-bit Physical address

18 Alpha Address Mapping Each PTE is 8 bytes – if page size is 8KB, a page can contain 1024 PTEs – 10 bits to index into each level If page size doubles, we need 47 bits of virtual address Since a PTE only stores 32 bits of physical page number, the physical memory can be addressed by at most 32 + offset First two levels are in physical memory; third is in virtual Why the three-level structure? Even a flat structure would need PTEs for the PTEs that would have to be stored in physical memory – more levels of indirection make it easier to dynamically allocate pages

19 Title Bullet