Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Performance Parameters in Computer Architecture:

Similar presentations


Presentation on theme: "Basic Performance Parameters in Computer Architecture:"— Presentation transcript:

1 Basic Performance Parameters in Computer Architecture:

2

3

4

5

6

7

8 Good Old Moore’s Law: (Technology vs Architects)
For every months, 2x transistors on the same chip area. Processor Speed Doubles every months Energy Operation halves every months Memory Capacity doubles every 18–24 months

9 Instructions/Sec  2x / 2 yrs Memory Capacity  2x / 2 yrs
Memory Latency  1.1x /2 yrs

10 Cache Magic

11 Parameters for Metric and Evaluation:
What does better mean in Computer Architecture? Is the speed (GHz) or the Memory size(GB)? Latency and Throughput are two key performance parameters. Latency: time taken from start to end for a process Throughput: Number of computations per second (#/second)

12 Comparing Performance of CPU X and Y:
X is N times faster than Y  Speedup = N N = Speed [X] / Speed of [Y] N = (Latency [Y]) / (Latency [X]) N = (Throughput [X]) / (Throughput of [Y]) Comparing Performance of CPU X and Y:

13 Introduction to Caches:

14 Locality Principle: Which of these are not good examples of locality? Rained 3 times today, likely to rain again. Ate dinner at 7pm last week, probably will eat dinner around 7pm this week It was New Years Eve yesterday, probably it will be new years eve today. Things that will happen soon are likely to be close to things that just happened.

15 Memory Locality:

16 Accessed Address X recently
Likely to Access X again soon Likely to Access address close to X too

17 Temporal & Spatial Locality Implementation:
for (j = 0; j < 1000 ; j++) print arr[j]

18 Locality to enhance Data Access:
Library : Repository to store data, large but slow to access Library Accesses have temporal and spatial locality A student 1. Will go to library find information, go home (Does not benefit from locality) 2. Borrow the book 3. Take all books and build a library at home (Expensive and high latency)

19 Cache Lookups: Fast Small Not Everything will fit Access:
Cache Hit : Found in the cache  FAST  Cache Miss : Not in Cache, Access RAM, slow memory  : Copy this location to Cache

20 Cache Performance: Hit Time  Should be low; small and fast cache
Miss Penalty  Main Memory Access Time, Large (10-100s cycles) MISS TIME = HIT TIME + MISS PENALTY (RAM Access time when Cache Miss) Miss Rate  Should be low; large and/or smart cache Hit Time  Should be low; small and fast cache Average Memory Access Time (AMAT) AMAT = HIT TIME + MISS RATE x MISS PENALTY

21 Cache Size in Real Processors:
Complication : Several Caches in the Processor L1 Cache  Directly service all RD/WR requests from Processor Size: KB  Large enough to get ~ 90% hit rate  Small enough to hit in 1 – 3 cycles

22

23 Cache Organization: How to determine HIT or MISS ?
How to determine what to kick out ? Address from Data Has to be large enough to satisfy spatial locality, if more than 1 block needs to be replaced when cache miss. HIT DATA (Bytes in each entry) Block Size / Line Size 32 to 128 bytes Block size can’t be as large as 1 KB, as precious Cache memory will remain unused.

24 Blocks in Cache and Main Memory:
A line is a Cache slot where a Memory block can effectively fit. 4 8 12 BLOCK 16 20 24 28 32 36 40 44 LINE Each Memory location has a capacity of 4 bytes and block size in 16 bytes.

25 Block Offset and Block Number:
4 8 12 BLOCK 16 20 24 28 32 36 40 44 LINE Address Processor wants to Cache Cache Data Block Block Number Offset 1. Block # tells which block is tried to be found in Cache. 2. Block offset to get the correct data, tells where within a block we are. Block Size = 16 bytes; 2^4

26 Cache Block Number Quiz:
32 Byte Block Size 16 bit address created by the Processor What is the block number corresponding to the above address? What is the block offset?

27 Cache Tag(Compares data b/w Block & Cache):
Cache Tag has block # tells which block is in Cache Data # matches Line in Cache; determines which line is present in Cache. Compare Block# with each Tag. Cache Hit if match produces 1. Thereafter, the offset will tell which line contains the data to be supplied to the Processor. Cache Data Cache Tag Block # in cache = = = 1 = In Cache Miss, the Data is put in Cache and Block # is put in the corresponding Tag. Processor Generated Address Block # Offset

28 Hit  (Tag == Block#) and Valid = 1
Valid Bit: During Boot up, no data from Cache needed. Garbage Data accessed if Memory Block and TAG match. Cache Data Garbage Data not brought from RAM. Cache TAG (Initial) VALID Any initial value at the Cache Tag will be problematic, not just zero. Therefore Valid bit = 0 Hit  (Tag == Block#) and Valid = 1 0X C

29 Fully Associative: Any block can be in any Cache Line, N lines with N comparisons (Extreme flexible form of Set Associative) Set Associative: N Lines where a block can be (Middle Ground) Direct Mapped: A block can go into 1 line (Extreme rigid form of Set Associative) Types of Caches:

30 Direct Mapped Cache: B Memory 1 2 3 4 5 6 B Cache 1 2 3
1 2 3 4 5 6 B Cache 1 2 3 Each block of Memory maps to single location, Blocks match the lines sequentially. Offset: Where in the block are we, if block found Index: Where in the Cache, block can be found (2 bit) Processor Generated Address Block # Index Block Offset TAG

31 Adv./Disadvantages of Direct Mapped Cache:
Looks only in one place, 1:1 mapping:  Fast: Since one location checked, less traffic, Hit Time (good)  Cheap: Less complex design, one tag and valid bit comparator  Energy Efficient: Lesser Power Dissipation due to smaller design Blocks must go in one place:  Frequent accesses to A B A B which map to same place in cache (Line)  Simultaneous kicking out of A & B. Conflict over one spot  Therefore, a high miss rate

32

33 Set Associative Caches:
N – Way Set Associative  Block can be in one of N lines in any SET SET 0 SET 1 SET 2 SET 3 Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7 2 Way Set Associative, N = 2, 2 lines / block Within a set, there are 2 lines which can contain the block Few bits of block address allocated for which set the block will go

34 Offset, Index Tag for Set Associative Caches:
Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7 TAG INDEX OFFSET Determines which set to access (2 bits) Where in the block we are Direct Mapped Cache of the same size will have a smaller TAG ?

35

36

37 Fully Associative Cache:
No index bits, as destination can be in any of the cache. Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 TAG Offset

38

39 Offset = log2(block size)
Cache Summary: Direct Mapped  1 Way Set Associative Fully Associative  N Way Set Associative, N = # of lines, no sets TAG INDEX OFFSET Index = log2(sets) Offset = log2(block size)

40 Cache Replacement: Cache Miss during a full set  Need a new block in set Which block to kick out?  Random  FIFO: Kick out which has been in the longest  LRU : Kick out block not been used for the longest

41 Implementing LRU : Implements Locality
Block TAG VALID LRU Counter A B C D Maintaining count is complicated. For N – way set associative Cache, we have N counters with size log2(N) – bit Counters. Here we have 4, 2–bit counters to count from 0 to 3. E B C D Cost: N log2(N) Counters Energy: Change N Counters on each access (even Cache hits)

42 Write Policy of Caches:
Do we insert blocks we write (Write Miss)/Allocate Policy ?  Write Allocate: Bring block into cache (helps locality RD/WR)  No Write Allocate: Do not bring block into cache when written. Do we write just to Cache or also to Memory ? (Cache Hit)  Write Through: Update Memory immediately  Write Back: Write to Cache, only write to RAM when Cache block replaced (High Locality WR will only update Cache frequently)

43 Write Back Caches: Dirty Bit  1  Block is dirty (need to write back on replacement) Dirty Bit  0  Block is clean (not written since brought from RAM) Multiple Writes on Read inefficient / Efficient implementation? Add a dirty bit to Cache Blocks we didn’t write, when replaced  No need to write to RAM Blocks we did write, when replaced  Write to Memory (RAM)

44

45 Multi-Level Caches (Cache Hierarchy):

46

47

48

49 Reducing the AMAT: Miss in L1 Cache  goes to a higher level Cache
Reduce Hit Time Reduce Miss Rate Reduce Miss Penalty Multi-Level Cache Hierarchy: Miss in L1 Cache  goes to a higher level Cache L1 Miss Penalty != Memory Latency (RAM Access) L1 Miss Penalty = L2 Hit Time + L2 Miss Rate x L2 Miss Penalty Can have L3, L4 etc. Reducing the AMAT:

50 AMAT with Cache Hierarchies:
AMAT = L1 Hit Time + L1 Miss Rate x L1 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate x L2 Miss Penalty L2 Miss Penalty = L3 Hit Time + L3 Miss Rate x L3 Miss Penalty LLC Miss Penalty = Main Memory Latency AMAT with Cache Hierarchies:

51 Multi – Level Cache Performance:
16 Kb 128 Kb No Cache L1 = 16 Kb L2 = 128 Kb Hit Time 2 10 100 2 cycles for L1 hit 12 cycles for L2 hit Hit Rate 90% 97.5% 100% 90% for L1 75% for L2 AMAT x 100 = 12 x 100 = 12.5 5.5 Net AMAT  x ( x 100) = 5.5 AMAT = Hit Time + Miss Rate x Miss Penalty

52 Hit Rate in L1/L2 Cache: 16 Kb 128 Kb No Cache L1 = 16 Kb L2 = 128 Kb
Hit Time 2 10 100 2 cycles for L1 hit 12 cycles for L2 hit Hit Rate 90% 97.5% 100% 90% for L1 75% for L2 AMAT x 100 = 12 x 100 = 12.5 5.5 Global Hit Rate Local Hit Rate

53 Global vs Local Hit Rate:
Global Hit Rate  1 – Global Miss Rate Global Miss Rate  No. of misses in this Cache / No of All Memory Accesses Local Hit Rate  No. of hits / No. of accesses to this Cache Misses per 1000 instructions (MPKI) Global vs Local Hit Rate:

54 Inclusion Property in Caches:
Assuming Block in L1-Cache May or may not be in L2 – Cache Has to be in L2 – Cache  Inclusion Cannot be in L2 – Cache  Exclusion Misses from L1-Cache May or may not be in L2 – Cache, unless specified as Inclusion/Exclusion

55 Inclusion Property Accesses:
RAM(Main Memory) A (Writes to L1, L2) B (Writes to L1, L2) C (Writes to L1, L2) D (Writes to L1, L2) E (Writes to L1, L2) L2 – Cache A(2) (1) (0) E B(3) (2) (1) B X(0)C(3) (2) C X(1) (0)D(3) D Processor accesses A,B,C from RAM (Cache Miss): L1 – Cache A(0)(1) (0)(1) (0)(1) A B(1)(0)C(1)(0)D(1)(0) E Cache Hierarchy doesn’t guarantee inclusion  RAM Access E  L1 access != L2 access Add Inclusion bit in L2  L2 = 1 when block exists in L1  prevents LRU block replacement Solution:

56 Intel Montecito Chip:

57 Shared vs Private Cache in Multi-Core:
256 KB of L2-Cache Banks and 1MB of Monolithic L2-Cache?

58 Why Private L1 – Cache only?
Shared L1 Cache: Multiple Accesses on Multi-Core  Creates too much traffic congestion  Build Multi-Port Cache to reduce traffic. Single Cycle Access: If P4 Accesses L1 location  higher Latency/Clock Cycles

59 Shared L2 – Cache disadvantages:
Miss on L2 Private != RAM, check other L2 for latest value in Private L2 In Shared L2  1 copy of A location  Miss on L2  RAM access; Congestion on shared L2 -Bus Higher Latency  More Wire Delay to Cache controller [20 cycles] Wait for Priority Requests from other Cores [20+5 cycles] vs [10 cycles] for Private L2 - Cache Shared L2 – Cache disadvantages:

60 Shared L2 – Cache Merits:
Higher Hit Rates  Private L2 has duplicates (<1MB capacity) A’s in multiple locations (Lower HR) Static Allocation in Private L2, 256 KB/core, P1[385KB] > L2[256]  High MR, P2[150KB] < L2[256KB]  Resources Under Utilized Dynamic Memory Allocation in shared L2, Memory assigned as per requirements [Core Demands] Potential faster Cache Coherence (easier to locate data on miss)

61 Memory in Modern Processor (L1 Cache?):


Download ppt "Basic Performance Parameters in Computer Architecture:"

Similar presentations


Ads by Google