Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

COMP375 Computer Architecture and Organization Senior Review.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Lecture 12 Reduce Miss Penalty and Hit Time
1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.
Computer Organization and Architecture
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Computer Architecture I - Class 9
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
Memory Organization.
An Example Implementation
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
An Example Implementation  In principle, we could describe the control store in binary, 36 bits per word.  We will use a simple symbolic language to.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
The Microarchitecture Level
2007 Sept. 14SYSC 2001* - Fall SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache principles 4.3 Cache design 4.4 Examples.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Lecture 20 Last lecture: Today’s lecture: Types of memory
Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
COSC2410: LAB 19 INTRODUCTION TO MEMORY/CACHE DIRECT MAPPING 1.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Computer Organization
Chapter 1 Computer System Overview
Cache Memory.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
William Stallings Computer Organization and Architecture 8th Edition
CAM Content Addressable Memory
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
CACHE MEMORY.
Computer Organization and ASSEMBLY LANGUAGE
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter 6 Memory System Design
How can we find data in the cache?
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Instruction Execution Cycle
Computer Architecture
CS-447– Computer Architecture Lecture 20 Cache Memories
CSC3050 – Computer Architecture
Chapter 1 Computer System Overview
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Chapter 11 Processor Structure and function
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock cycle can be shorter and/or Overlap the execution of instructions (pipelining)

Path length ( 2 ) The path length is the number of clock cycles needed to execute a set of operations How to improve? For example reduce the number of instruction cycles necessary for fetching instructions

Reducing the path length #1 ( 3 ) Merge the Interpreter Loop (Main1) into the end of each microcode sequence

Reducing the path length #2 (1) ( 4 ) Go from a two-bus design to a three-bus design

Reducing the path length #2 (2) ( 5 ) Go from a two-bus design to a three-bus design Improvement #1

Reducing the path length #3 ( 6 ) Have instructions fetched from memory by a specialized functional unit (IFU = Instruction Fetch Unit) Note that, for every instruction, the following may happen The PC is passed through the ALU and incremented 2. The PC is used to fetch the next byte in the instruction stream 3. Operands are read from memory 4. Operands are written to memory 5. The ALU does a computation and the results are stored back

Instruction Fetch Unit 1 ( 7 ) Problem: Some instructions must explicitly fetch additional operands from memory, one byte at time. The ALU is then used almost every instruction just for incrementing the PC, or assemble the additional fields. Solution: Introduce a second (simplified) ALU just for fetching and processing instructions = Instruction Fetch Unit.

Instruction Fetch Unit 2 ( 8 ) Instruction MAR: holds the address for fetching data (is auto-incremented after every fetch) Data from memory is read in 4-byte blocks, and pushed into a shift register (6 bytes deep)

Instruction Fetch Unit 3 ( 9 ) MBR1 and MBR2 are automatically loaded by the IFU MBR2: oldest 2 bytes read from memory MBR1: oldest byte read from memory

Instruction Fetch Unit 4 ( 10 ) Instructions requiring one operand (1 byte), can read it from the MBR1 register Instruction with a 16-bit operand (for example an offset) can get it directly from MBR2. When a read from MBR1 or MBR2 occurs, the shift-register is 'shifted' right and new values for MBR1 and MBR2 are automatically set.

Instruction Fetch Unit 5 ( 11 ) The IFU keeps the PC up-to-date: increments by 1 on MBR1 read, by 2 on MBR2 read. The IFU assures that the value pointed by the PC register is in MBR1 Write PC signal also enables writing to the IMAR (IMAR cannot be accessed by the microcode)

Instruction Fetch Unit 6 ( 12 ) The IFU behavior can be described by mean of a FSM State number represents the still available bytes in the shift register A Word fetching is automatically done by the IFU when there are less than 3 bytes left in the shift register. The IFU also senses when MBR1 or MBR2 are read.

Instruction Fetch Unit 7 ( 13 ) What happens when a word if fetched? 4 consecutive bytes are read from memory starting from the address in the IMAR Bytes read are pushed in the shift register IMAR is incremented by 1 (Note: the IMAR references to a word-addressed memory, increasing by 1 is jumping 4 bytes ahead)

Instruction Fetch Unit 8 ( 14 ) Jumps in the code: When the PC is changed by the microcode (i.e. when there is a jump) the new value is also written in the IMAR, then: 1. The shift-register is flushed 2. A new 4-byte word is fetched from memory and pushed in the shift-register

MIC-2 ( 15 )

Pipelining ( 16 )

Stage 1 ( 17 )

Stage 2 ( 18 )

Stage 3 ( 19 )

Stage 4 ( 20 )

...further improvements? ( 21 ) Until now we've tried to get better performance by mean of architectural improvements We will now see an example of implementation improvement: cache memory

Motivation ( 22 ) The problem: The recent high rate of growth in processor speed is not accompanied by a corresponding speed up in memories. The solution: Caches, which hold the most recently used memory words in a small, fast memory.

Cache levels ( 23 ) Split cache

Cache policy 1 ( 24 ) Spatial Locality: Caches bring more data than requested; for example: if data at address A is requested, also A+1, A+2,... are read and stored in the cache Useful in case of sequential access

Cache policy 2 ( 25 ) Temporal Locality: The most recently used values are kept in the cache. Useful in cases where the same variables are accessed multiple times within a loop

Cache model ( 26 ) Memory is divided into fixed-size blocks called cache lines (4 to 64 consecutive bytes) Lines are numbered starting from 0 (the number depends on the size of the cache line) At some instant some lines can be in the cache.

Cache hit or miss ( 27 ) When a memory address is accessed, the memory controller determines if the word referenced is currently in the cache: If so the value can be used (cache HIT) If not, some line entries are removed and the needed line is fetched from memory or a lower level cache (cache MISS)

Direct mapped caches 1 ( 28 ) Each entry can hold one cache line (ex. 32 bytes; so total size is 64KB) Valid bit determines if the entry is valid (at boot every line is marked as invalid) Data field contains a copy of the data in memory (ex. 32 bytes) A memory address is decomposed as follows: - TAG is to distinguish cache lines (16 bits) - LINE references a cache entry (11 bits; 2^11=2048 possible entries) - WORD and BYTE are used to reference data (3+2 bits; 2^5 = 32)

Direct mapped caches 2 ( 29 ) Given a memory address there is only one place to look for it in the cache (given by the LINE field; 2^11 = 2048 possibilities); note that there exist multiple lines that can be stored in the same cache entry. Several 32 byte wide cache line (every 2^16=64KB) 'share' the same cache entry because they have the same LINE field (although they have distinct TAGs) 16-bit TAGs + 11-bit LINE from the memory address uniquely identify 32- bytes wide memory chunks called cache lines. The LINE field determines where data is to be stored in the cache.

Direct mapped caches 3 ( 30 ) When memory is accessed: The LINE field is used to see which cache entry to check If the VALID bit is 1 the memory controller compares the TAG bits from the memory address with the one of the entry: if they're equal the remaining fields (WORD, BYTE) are used to access the corresponding data. If the data is not valid or the TAG does not correspond, data is to be fetch from memory and stored to cache. If the data in cache entry to be substituted has been modified, changes must be written back to memory before overwriting the cache entry.

n-way Set-Associative Caches 1 ( 31 ) The LINE part of the memory address is used to reference a cache entry n memory chunks with the same LINE but different TAGs can co-exist in the cache

n-way Set-Associative Caches 2 ( 32 ) When memory is accessed: The LINE field is used to see which cache entry to check For every sub-entry ( n-way associative cache as n sub-entries; 4 in the previous example) check if at least one is valid. Check if within the valid entries, one matches the TAG field: if so access the data in the cache. If the data is not valid or the TAG does not correspond, data is to be fetch from memory and stored to cache, eventually replacing an existing entry (Problem: replacement algorithm? ex. Least Recently Used) If the data in cache is modified, changes must be written also to memory : there exist two policies for this: write through (data is immediately written also in memory), write back (write in memory is deferred)

Caches ( 33 ) Direct mapped cache are the most common: collisions can be avoided easily (and even easier if the data and instruction caches are separated) Set-associative caches require more circuitry: two-way and four-way caches perform make this extra circuitry worthwhile

IMPORTANT ( 34 ) Don't forget to prepare your questions for next week's Q&A session !

... ( 35 ) “There are no Stupid Questions!"