Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.

Similar presentations


Presentation on theme: "Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock."— Presentation transcript:

1 Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock cycle can be shorter and/or Overlap the execution of instructions (pipelining)

2 Path length ( 2 ) The path length is the number of clock cycles needed to execute a set of operations How to improve? For example reduce the number of instruction cycles necessary for fetching instructions

3 Reducing the path length #1 ( 3 ) Merge the Interpreter Loop (Main1) into the end of each microcode sequence

4 Reducing the path length #2 (1) ( 4 ) Go from a two-bus design to a three-bus design

5 Reducing the path length #2 (2) ( 5 ) Go from a two-bus design to a three-bus design Improvement #1

6 Reducing the path length #3 ( 6 ) Have instructions fetched from memory by a specialized functional unit (IFU = Instruction Fetch Unit) Note that, for every instruction, the following may happen... 1. The PC is passed through the ALU and incremented 2. The PC is used to fetch the next byte in the instruction stream 3. Operands are read from memory 4. Operands are written to memory 5. The ALU does a computation and the results are stored back

7 Instruction Fetch Unit 1 ( 7 ) Problem: Some instructions must explicitly fetch additional operands from memory, one byte at time. The ALU is then used almost every instruction just for incrementing the PC, or assemble the additional fields. Solution: Introduce a second (simplified) ALU just for fetching and processing instructions = Instruction Fetch Unit.

8 Instruction Fetch Unit 2 ( 8 ) Instruction MAR: holds the address for fetching data (is auto-incremented after every fetch) Data from memory is read in 4-byte blocks, and pushed into a shift register (6 bytes deep)

9 Instruction Fetch Unit 3 ( 9 ) MBR1 and MBR2 are automatically loaded by the IFU MBR2: oldest 2 bytes read from memory MBR1: oldest byte read from memory

10 Instruction Fetch Unit 4 ( 10 ) Instructions requiring one operand (1 byte), can read it from the MBR1 register Instruction with a 16-bit operand (for example an offset) can get it directly from MBR2. When a read from MBR1 or MBR2 occurs, the shift-register is 'shifted' right and new values for MBR1 and MBR2 are automatically set.

11 Instruction Fetch Unit 5 ( 11 ) The IFU keeps the PC up-to-date: increments by 1 on MBR1 read, by 2 on MBR2 read. The IFU assures that the value pointed by the PC register is in MBR1 Write PC signal also enables writing to the IMAR (IMAR cannot be accessed by the microcode)

12 Instruction Fetch Unit 6 ( 12 ) The IFU behavior can be described by mean of a FSM State number represents the still available bytes in the shift register A Word fetching is automatically done by the IFU when there are less than 3 bytes left in the shift register. The IFU also senses when MBR1 or MBR2 are read.

13 Instruction Fetch Unit 7 ( 13 ) What happens when a word if fetched? 4 consecutive bytes are read from memory starting from the address in the IMAR Bytes read are pushed in the shift register IMAR is incremented by 1 (Note: the IMAR references to a word-addressed memory, increasing by 1 is jumping 4 bytes ahead)

14 Instruction Fetch Unit 8 ( 14 ) Jumps in the code: When the PC is changed by the microcode (i.e. when there is a jump) the new value is also written in the IMAR, then: 1. The shift-register is flushed 2. A new 4-byte word is fetched from memory and pushed in the shift-register

15 MIC-2 ( 15 )

16 Pipelining ( 16 )

17 Stage 1 ( 17 )

18 Stage 2 ( 18 )

19 Stage 3 ( 19 )

20 Stage 4 ( 20 )

21 ...further improvements? ( 21 ) Until now we've tried to get better performance by mean of architectural improvements We will now see an example of implementation improvement: cache memory

22 Motivation ( 22 ) The problem: The recent high rate of growth in processor speed is not accompanied by a corresponding speed up in memories. The solution: Caches, which hold the most recently used memory words in a small, fast memory.

23 Cache levels ( 23 ) Split cache

24 Cache policy 1 ( 24 ) Spatial Locality: Caches bring more data than requested; for example: if data at address A is requested, also A+1, A+2,... are read and stored in the cache Useful in case of sequential access

25 Cache policy 2 ( 25 ) Temporal Locality: The most recently used values are kept in the cache. Useful in cases where the same variables are accessed multiple times within a loop

26 Cache model ( 26 ) Memory is divided into fixed-size blocks called cache lines (4 to 64 consecutive bytes) Lines are numbered starting from 0 (the number depends on the size of the cache line) At some instant some lines can be in the cache.

27 Cache hit or miss ( 27 ) When a memory address is accessed, the memory controller determines if the word referenced is currently in the cache: If so the value can be used (cache HIT) If not, some line entries are removed and the needed line is fetched from memory or a lower level cache (cache MISS)

28 Direct mapped caches 1 ( 28 ) Each entry can hold one cache line (ex. 32 bytes; so total size is 64KB) Valid bit determines if the entry is valid (at boot every line is marked as invalid) Data field contains a copy of the data in memory (ex. 32 bytes) A memory address is decomposed as follows: - TAG is to distinguish cache lines (16 bits) - LINE references a cache entry (11 bits; 2^11=2048 possible entries) - WORD and BYTE are used to reference data (3+2 bits; 2^5 = 32)

29 Direct mapped caches 2 ( 29 ) Given a memory address there is only one place to look for it in the cache (given by the LINE field; 2^11 = 2048 possibilities); note that there exist multiple lines that can be stored in the same cache entry. Several 32 byte wide cache line (every 2^16=64KB) 'share' the same cache entry because they have the same LINE field (although they have distinct TAGs) 16-bit TAGs + 11-bit LINE from the memory address uniquely identify 32- bytes wide memory chunks called cache lines. The LINE field determines where data is to be stored in the cache.

30 Direct mapped caches 3 ( 30 ) When memory is accessed: The LINE field is used to see which cache entry to check If the VALID bit is 1 the memory controller compares the TAG bits from the memory address with the one of the entry: if they're equal the remaining fields (WORD, BYTE) are used to access the corresponding data. If the data is not valid or the TAG does not correspond, data is to be fetch from memory and stored to cache. If the data in cache entry to be substituted has been modified, changes must be written back to memory before overwriting the cache entry.

31 n-way Set-Associative Caches 1 ( 31 ) The LINE part of the memory address is used to reference a cache entry n memory chunks with the same LINE but different TAGs can co-exist in the cache

32 n-way Set-Associative Caches 2 ( 32 ) When memory is accessed: The LINE field is used to see which cache entry to check For every sub-entry ( n-way associative cache as n sub-entries; 4 in the previous example) check if at least one is valid. Check if within the valid entries, one matches the TAG field: if so access the data in the cache. If the data is not valid or the TAG does not correspond, data is to be fetch from memory and stored to cache, eventually replacing an existing entry (Problem: replacement algorithm? ex. Least Recently Used) If the data in cache is modified, changes must be written also to memory : there exist two policies for this: write through (data is immediately written also in memory), write back (write in memory is deferred)

33 Caches ( 33 ) Direct mapped cache are the most common: collisions can be avoided easily (and even easier if the data and instruction caches are separated) Set-associative caches require more circuitry: two-way and four-way caches perform make this extra circuitry worthwhile

34 IMPORTANT ( 34 ) Don't forget to prepare your questions for next week's Q&A session !

35 ... ( 35 ) “There are no Stupid Questions!"


Download ppt "Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock."

Similar presentations


Ads by Google