1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 14: Virtual Memory Topics: virtual memory (Section 5.4) Reminders: midterm begins at 9am, ends at 10:40am.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 27: Disks, Reliability, SSDs, Processors Topics: HDDs, SSDs, RAID, Intel and IBM case studies Final exam stats:  Highest 91, 18 scores of 82+

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 17: Virtual Memory, Large Caches

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Lecture 19: Virtual Memory

Niagara: a 32-Way Multithreaded SPARC Processor

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

1 Lecture 16: Virtual Memory Topics: virtual memory, improving TLB performance (Sections )

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

COSC3330 Computer Architecture

ECE232: Hardware Organization and Design

Lecture: Large Caches, Virtual Memory

Lecture: Large Caches, Virtual Memory

Lynn Choi School of Electrical Engineering

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Large Caches, Virtual Memory

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture: SMT, Cache Hierarchies

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Lecture 23: Cache, Memory, Virtual Memory

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture: SMT, Cache Hierarchies

Alpha Microarchitecture

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Levels of Parallelism within a Single Processor

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Presentation transcript:

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students  1 st 3 problems and 7 th problem: gimmes  4 th problem (LSQ): half the students got full points  5 th problem (cache hierarchy): 1 correct solution  6 th problem (coherence): most got more than 15 points  8 th problem (TM): very few mentioned frequent aborts, starvation, and livelock  9 th problem (TM): no one got close to full points  10 th problem (LRU): 1 “correct” solution with the tree structure

2 Finals Discussion: LSQ, Caches, TM, LRU

3 Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on:  simple cores (low power, design complexity, can accommodate more cores)  fine-grain multi-threading (to tolerate long memory latencies) Case Study I: Sun’s Niagara

4 Niagara Overview

5 SPARC Pipe No branch predictor Low clock speed (1.2 GHz) One FP unit shared by all cores

6 Case Study II: Sun’s Rock 16 cores, each with 2 thread contexts 10 W per core (14 mm 2 ), each core is in-order and 2.3 GHz (10-12 FO4)(65 nm), total of 240 W and 396 mm 2 ! New features: scout threads that prefetch while the main thread is stalled on memory access, support for HTM (lazy versioning and eager CD) Each cluster of 4 cores shares a 32KB I-cache, two 32KB D-caches (one D-cache for two cores), and 2 FP units. Caches are 4-way p-LRU. L2 cache is 4-banked 8-way p-LRU and 2 MB. Clusters are connected with a crossbar switch Good read:

7 Rock Overview

8 Case Study III: Intel Pentium 4 Pursues high clock speed, ILP, and TLP CISC instrs are translated into micro-ops and stored in a trace cache to avoid translations every time Uses register renaming with 128 physical registers Supports up to 48 loads and 32 stores Rename/commit width of 3; up to 6 instructions can be dispatched to functional units every cycle Simple instruction has to traverse a 31-stage pipeline Combining branch predictor with local and global histories 16KB 8-way L1; 4-cyc for ints, 12-cyc for FPs; 2MB 8-way L2, 18-cyc

9 Clock Rate Vs. CPI: AMD Opteron Vs P4 2.8 GHz AMD Opteron vs. 3.8 GHz Intel P4: Opteron provides a speedup of 1.08

10 Case Study IV: Intel Core Architecture Single-thread execution is still considered important   out-of-order execution and speculation very much alive  initial processors will have few heavy-weight cores To reduce power consumption, the Core architecture (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages) Many transistors invested in a large branch predictor to reduce wasted work (power) Similarly, SMT is also not guaranteed for all incarnations of the Core architecture (SMT makes a hotspot hotter)

11 Case Study V: Intel Nehalem Quad core, each with 2 SMT threads ROB of 96 in Core 2 has been increased to 128 in Nehalem; ROB dynamically allocated across threads Lots of power modes; in-built power control unit 32KB I&D L1 caches, 10-cycle 256KB private L2 cache per core, 8MB shared L3 cache (~40 cycles) L1 dTLB 64/32 entries (page sizes of 4KB or 4MB), 512-entry L2 TLB (small pages only)

MC1MC2MC3 Core 1Core 2 Core 3Core 4 DIMM Socket 1 MC1MC2MC3 Core 1Core 2 Core 3Core 4 DIMM Socket 2 MC1MC2MC3 Core 1Core 2 Core 3Core 4 DIMM Socket 3 MC1MC2MC3 Core 1Core 2 Core 3Core 4 DIMM Socket 4 QPI Nehalem Memory Controller Organization

13 Flash Memory Technology cost-effective enough that flash memory can now replace magnetic disks on laptops (also known as solid-state disks) Non-volatile, fast read times (15 MB/sec) (slower than DRAM), a write requires an entire block to be erased first (about 100K erases are possible) (block sizes can be KB)

14 Advanced Course Spr’09: CS 7810: Advanced Computer Architecture  co-taught by Al Davis and me  lots of multi-core topics: cache coherence, TM, networks  memory technologies: DRAM layouts, new technologies, memory controller design  Major course project on evaluating original ideas with simulators (can lead to publications)  One programming assignment, take-home final

15 Case Studies: More Processors AMD Barcelona: 4 cores, issue width of 3, each core has private L1 (64 KB) and L2 (512 KB), shared L3 (2 MB), 95 W (AMD also has announcements for 3-core chips) Sun Niagara2: 8 threads per core, up to 8 cores, W, GHz, 4 MB L2 (8 banks), 8 FP units IBM Power6: 2 cores, 4.7 GHz, each core has a private 4 MB L2

16 Alpha Address Mapping Unused bitsLevel 1Level 2Level 3Page offset Virtual address 13 bits10 bits 21 bits Page table base register PTE L1 page table + PTE L2 page table + 10 bits PTE L3 page table + 32-bit physical page numberPage offset 45-bit Physical address

17 Alpha Address Mapping Each PTE is 8 bytes – if page size is 8KB, a page can contain 1024 PTEs – 10 bits to index into each level If page size doubles, we need 47 bits of virtual address Since a PTE only stores 32 bits of physical page number, the physical memory can be addressed by at most 32 + offset First two levels are in physical memory; third is in virtual Why the three-level structure? Even a flat structure would need PTEs for the PTEs that would have to be stored in physical memory – more levels of indirection make it easier to dynamically allocate pages

18 Title Bullet