High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Lecture 19: Virtual Memory
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Lecture 17 Final Review Prof. Mike Schulte Computer Architecture ECE 201.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Advanced Architectures
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Virtual Memory 4 classes to go! Today: Virtual Memory.
Lecture 14: Reducing Cache Misses
Systems Architecture II
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
CSC3050 – Computer Architecture
Presentation transcript:

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was derived from material in the text (HPC Chap. 1-2).

High Performance Computing2 Introduction Objective: To review recent developments in the design of high performance microprocessors. To indicate how these features effect program performance. An example program will be used to illustrate benchmarking techniques and the effect of compiler optimizations and code organization on performance. We will indicate how changes in software can improve performance by better utilizing the underlying hardware. Our goal for the course is to understand this behavior. Topics –pipelining –instruction level parallelism, superscalar and out of order execution –Memory Hierarchy: cache, virtual memory

High Performance Computing3 RISC vs. CISC CISC: instruction set made up of powerful instructions close to primitives in a high-level language such as C or FORTRAN RISC: low level instructions are emphasized. RISC is a label most commonly used for a set of instruction set architecture characteristics chosen to ease the use of aggressive implementation techniques found in high-performance processors (John Mashey) Prevalence began in mid-1980s (earlier example CDC 6600) when more transistors and better compilers became available. Trade complex instructions for faster clock rate and more room for extra registers, cache and advanced performance techniques.

High Performance Computing4 Characterizing RISC Instruction pipelining Pipelining floating point execution Uniform instruction length Delayed branching Load/Store architecture Simple addressing modes

High Performance Computing5 Pipelining Instruction pipelining –Instruction Fetch –Instruction Decode –Operand Fetch –Execute –Writeback IFIDFEW IFIDFEW IFIDFEW

High Performance Computing6 Branches and Hazards If a branch is executed the pipeline may need to be flushed since the wrong instructions may have been started. IFIDFEW IFIDFEW IFIDFEW IFIDFEW IFIDFE guess sure

High Performance Computing7 Advanced Techniques Superscalar Processors –issue more than one instruction per cycle –can’t have dependencies or hardware conflict –for example can execute an add simultaneously with a mult Superpipeling –more stages in the pipeline Out of order and speculative execution –maintain semantics but allow instructions to be computed in different order –may need to guess which instruction to execute –depends on difference between computation and execution

High Performance Computing8 Post-RISC Pipeline IFID IRB E RR R Instruction Reorder Buffer Rename Registers Branch Prediction

High Performance Computing9 Memory Hierarchy SRAM vs. DRAM –small fast memory vs. large slow memory –principle of locality Registers Cache (level 1) Cache (level 2) Main memory Disk

High Performance Computing10 Memory Access Speed on DEC Alpha Clock Speed 500 MHz (= 2 ns clock rate) Registers (2 ns) L1 On-Chip (4 ns) L2 On-Chip (5 ns) L3 Off-Chip (30 ns) Memory (220 ns)

High Performance Computing11 Common Framework for Memory Hierarchies Question 1: Where can a block be placed? –One place (direct mapped), a few places (set associative), or any place (fully associative) Question 2: How is a block found? –There are four methods: indexing, limited search, full search, or table lookup Question 3: Which block should be replaced on a cache miss? –Typically least recently used or random block Question 4: What happens on writes? –Write-through or write-back

High Performance Computing12 Mapping to Cache Cache - the level of the memory hierarchy between the CPU and main memory. Direct-Mapped Cache - memory mapped to one location in cache (Block address) mod (Number of block in cache) Number of blocks is typically a power of two  cache location obtained from low-order bits of address

High Performance Computing13 Locating an Data in the Cache Compare cache index (mapping) to Tag (high- order bits) to see if element is currently in cache Valid bit used to indicate whether data in cache is valid A hit occurs when the data is in cache, otherwise it is a miss The extra time required when a cache miss occurs is called the miss penalty

High Performance Computing14 Example 32-word memory 8-word cache

High Performance Computing15 Cache Organization Since cache is smaller than memory more than one address must map to same line in cache Direct-Mapped Cache –address mod cache size (only one location when memory address gets mapped to) Fully Associative Cache –address can be mapped anywhere in cache –need tag and associative search to find if element in cache Set-Associative Cache –compromise between two extremes –element can map to several locations

High Performance Computing16 Model for Cache Misses Compulsory misses –These are cache misses caused by the first access to a block that has never been in the cache Capacity misses (cold-start) –These are cache misses caused when the cache cannot contain all the blocks needed during execution of a program. Conflict misses (collision) –These are cache misses that occur in a set associative or direct mapped cache when multiple blocks compete for the same set. These misses are eliminated with a fully associative cache.

High Performance Computing17 Measuring Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles)  Clock-cycle time Memory stall clock cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = Reads/program  Read miss rate  Read miss penalty Write-stall cycles = (Writes/program  Write miss rate  Write miss penalty) + Write buffer stalls (assumes write-through cache) Write buffer stalls should be negligible and write and read miss penalties equal (cost to fetch block from memory) Memory stall clock cycles = Mem access/program  miss rate  miss penalty

High Performance Computing18 Virtual Memory Decouple physical addresses (memory locations) from addresses used by a program. Programmer sees a large memory with the same virtual addresses independent of where the program is actually placed in memory. –Virtual to physical mapping performed via a page table –Since page tables can be in virtual memory, there could be several table lookups for a single memory reference. –TLB (translation lookaside buffer) is a cache to store commonly used virtual to physical maps. Page Fault –when page is not in memory it must be brought in (from disk) –very slow (usually occurs with OS intervention)

High Performance Computing19 Improving Memory Performance Larger and wider caches Cache bypass Interleaved and pipelined memory systems Prefetching Post-RISC effects on memory New memory trends