SE-292 High Performance Computing

SE-292 High Performance Computing
Memory Hierarchy R. Govindarajan L1

Memory Hierarchy Lec14

Memory Organization Memory hierarchy CPU registers Cache memory
few in number (typically 16/32/128) subcycle access time (nsec) Cache memory on-chip memory 10’s of KBytes (to a few MBytes) of locations. access time of a few cycles Main memory 100’s of MBytes storage access time several 10’s of cycles Secondary storage (like disk) 100’s of GBytes storage access time msec Lec14

Cache Memory; Memory Hierarchy
Recall: In discussing pipeline, we assumed that memory latency will be hidden so that it appears to operate at processor speed Cache Memory: HW that makes this happen Design principle: Locality of Reference Temporal locality: least recently used objects are least likely to be referenced in the near future Spatial locality: neighbours of recently reference locations are likely to be referenced in the near future Lec14

Cache Memory Exploits This
Cache: Hardware structure that provides memory contents the processor references directly (most of the time) fast address Main Memory CPU Lec14 Cache data

Cache Design Cache Lookup Logic `Do I Have It’? Logic Cache RAM
Fast Memory Cache Directory address A Table of `Addresses I Have’ Typical size: 32KB Lec14

How to do Fast Lookup? Search Algorithms
Hashing: Hash table, indexed into using a hash function Hash function on address A. Which bits? address A Lec14 msbs lsbs For a small program, everything would index into the same place (collision) A and neighbours possibly differ only in these bits; should be treated as one

Summing up Cache organized in terms of blocks, memory locations that share the same address bits other than lsbs. Main memory too. Address used as Index into directory block offset tag Lec14

How It Works : Direct Mapping . . . Case 1: Cache hit
Main Memory . . . Case 1: Cache hit Case 2: Cache miss Cache Memory ... address tag index offset CPU Lec14 Same: Hit Not Same: Miss

Cache Terminology Cache hit: A memory reference where the required data is found in the cache Cache Miss: A memory reference where the required data is not found in the cache Hit Ratio: # of hits / # of memory references Miss Ratio = (1 - Hit Ratio) Hit Time: Time to access data in cache Miss Penalty: Time to bring a block to cache Lec14

Cache Organizations Where can a block be placed in the cache?
Direct mapped, Set Associative How to identify a block in cache? Tag, valid bit, tag checking hardware Replacement policy? LRU, FIFO, Random … What happens on writes? Hit: When is main memory updated? Write-back, write-through Miss: What happens on a write miss? Write-allocate, write-no-allocate Lec14

Block Placement: Direct Mapping
A memory block goes to the unique cache block (memory block no.) mod (# cache blocks) memory block number 1 2 3 4 5 6 7 Cache 1 2 3 4 5 6 7 14 6 8 9 10 11 12 13 14 15 L15 14

Identifying Memory Block (DM Cache)
Assume 32-bit address space, 16 KB cache, 32byte cache block size. Offset field -- to identify bytes in a cache line Offset Bits = log (32) = 5 bits No. of Cache blocks = 16KB/ 32 = 512 Index Bits = log (522) = 5 bits Tag -- identify which memory block is in this cache block -- remaining bits (= 18bits) L15 Tag 18 bits Index 9 bits Offset 5 bits

Accessing Block (DM Cache)
Tag 18 bits Index 9 bits Offset 5 bits Tag V D L15 = Data Cache Hit AND

Block Placement: Set Associative
A memory block goes to unique set, and within the set to any cache block memory block number 1 2 3 4 5 6 7 3 7 11 15 1 2 3 4 5 6 7 Set 3 Set 2 Set 0 Set 1 8 9 10 11 12 13 14 15 L15 7 15

Identifying Memory Block (Set Associative Cache)
Assume 32-bit address space, 16 KB cache, 32byte cache block size, 4-way set-associative. Offset field -- to identify bytes in a cache line Offset Bits = log (32) = 5 bits No. of Sets = Cache blocks / 4 = 512/4 = 128 Index Bits = log (128) = 7 bits Tag -- identify which memory block is in this cache block -- remaining bits (= 20 bits) L15 Tag 20 bits Index 7 bits Offset 5 bits

Accessing Block (2-w Set-Associative)
Tag 19 bits Index 8 bits Offset 5 bits Tag V D = OR = Data L15 Cache Hit Tag V D

Block Replacement Direct Mapped: No choice is required
Set-Associative: Replacement strategies First-In-First-Out (FIFO) simple to implement Least Recently Used (LRU) complex, but based on (temporal) locality, hence higher hits Random L15

Block Replacement… Hardware must keep track of LRU information
Separate valid bits for each word (or sub-block) of cache can speedup access to the required word on a cache miss Data OR Cache Hit Tag 19 bits Index 8 bits Offset 5 bits V D = L 4 bits L15

Write Policies When is Main Memory Updated on Write Hit?
Write through: Writes are performed both in Cache and in Main Memory + Cache and memory copies are kept consistent -- Multiple writes to the same location/block cause higher memory traffic -- Writes must wait for longer time (memory write) Solution: Use a Write Buffer to hold these write requests and allow processor to proceed immediately L15

Write Policies… Write back: writes performed only on cache. Modified blocks are written back in memory on replacement o Need for dirty bit with each cache block + Writes are faster than with write through + Reduced traffic to memory -- Cache & main memory copies are not always the same -- Higher miss penalty due to write-back time L15

Write Policies… What happens on a Write Miss?
Write-Allocate: allocate a block in the cache and load the block from memory to cache. Write-No-Allocate: write directly to main memory. Write allocate/no-allocate is orthogonal to write-through/write-back policy. Write-allocate with write-back Write-no-allocate with write-through: ideal if mostly-reads-few-writes on data L15

What Drives Computer Architecture?
2000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 1982 Processor 60%/yr. (2X/1.5yr) Memory 9%/yr. (2X/10 yrs) Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law” 1 10 100 1000 DRAM CPU L15 Year

Memory Hierarchy Level 2 Cache Level 1 Cache Cache L3, L4… Cache
Main (Primary) Memory L15 Secondary Memory

Cache and Programming Objective: Learn how to assess cache related performance issues for important parts of our programs Will look at several examples of programs Will consider only data cache, assuming separate instruction and data caches Data cache configuration: Direct mapped 16 KB write back cache with 32B block size L15 Tag : 18b Index: 9b Offset: 5b

Example 1: Vector Sum Reduction
double A[2048]; sum=0.0; for (i=0; i<2048, i++) sum = sum +A[i]; To do analysis, must view program close to machine code form (to see loads/stores) Loop: FLOAD F0, 0(R1) FADD F2, F0, F2 ADDI R1, R1, 8 BLE R1, R3, Loop L15

Example 1: Vector Sum Reduction
To do analysis: Observe loop index i, sum and &A[i] are implemented in registers and not load/stored inside loop Only A[i] is loaded from memory Hence, we will consider only accesses to array elements L15

Example 1: Reference Sequence
load A[0] load A[1] load A[2] … load A[2047] Assume base address of A (i.e., address of A[0]) is 0xA000, Cache index bits: (value = 256) Size of an array element (double) = 8B So, 4 consecutive array elements fit into each cache block (block size is 32B) A[0] – A[3] have index of 256 A[4] – A[7] have index of 257 and so on L15

Example 1: Cache Misses and Hits
Hit ratio of our loop is 75% -- there are 1536 hits out of 2048 memory accesses A[0] 0xA000 256 Miss Cold start A[1] 0xA008 Hit A[2] 0xA010 A[3] 0xA018 A[4] 0xA020 257 A[5] 0xA028 A[6] 0xA030 A[7] 0xA038 .. A[2044] 0xDFE0 255 A[2045] 0xDFE8 A[2046] 0xDFF0 A[2047] 0xDFF8 A[0] 0xA000 256 A[1] 0xA008 A[2] 0xA010 A[3] 0xA018 A[4] 0xA020 257 A[5] 0xA028 A[6] 0xA030 A[7] 0xA038 .. A[2044] 0xDFE0 255 A[2045] 0xDFE8 A[2046] 0xDFF0 A[2047] 0xDFF8 This is entirely due to spatial locality of reference. Cold start miss: we assume that the cache is initially empty. Also called a Compulsory Miss If the loop was preceded by a loop that accessed all array elements, the hit ratio of our loop would be 100%, 25% due to temporal locality and 75% due to spatial locality L15

Example 1 with double A[4096]
Why should it make a difference? Consider the case where the loop is preceded by another loop that accesses all array elements in order The entire array no longer fits into the cache – cache size: 16KB, array size: 32KB After execution of the previous loop, the second half of the array will be in cache Analysis: our loop sees misses as we just saw Called Capacity Misses as they would not be misses if the cache had been big enough L15

Example 2: Vector Dot Product
double A[2048], B[2048], sum=0.0; for (i=0; i<2048, i++) sum = sum +A[i] * B[i]; Reference sequence: load A[0] load B[0] load A[1] load B[1] … Again, size of array elements is 8B so that 4 consecutive array elements fit into each cache block Assume base addresses of A and B are 0xA000 and 0xE000 L15

Example 2: Cache Hits and Misses
Conflict miss: a miss due to conflicts in cache block requirements from memory accesses of the same program A[0] 0xA000 256 Miss Cold start B[0] 0xE000 A[1] 0xA008 Conflict B[1] 0xE008 A[2] 0xA010 B[2] 0xE010 A[3] 0xA018 B[3] 0xE018 .. B[1023] 0xFFF8 511 Hit ratio for our program: 0% Source of the problem: the elements of arrays A and B accessed in order have the same cache index L15 Hit ratio would be better if the base address of B is such that these cache indices differ

Example 2 with Padding Assume that compiler assigns addresses as variables are encountered in declarations To shift base address of B enough to make cache index of B[0] different from that of A[0] double A[2052], B[2048]; Base address of B is now 0xE020 0xE020 is Cache index of B[0] is 257; B[0] and A[0] do not conflict for the same cache block Whereas Base address of A is 0xA000 which is – cache index is 256 Hit ratio of our loop would then be 75% L15

Example 2 with Array Merging
What if we re-declare the arrays as struct {double A, B;} array[2048]; for (i=0; i<2048, i++) sum += array[i].A*array[i].B; Hit ratio: 75% L15

Example 3: DAXPY Double precision Y = aX + Y, where X and Y are vectors and a is a scalar double X[2048], Y[2048], a; for (i=0; i<2048;i++) Y[I] = a*X[I]+Y[I]; Reference sequence load X[0] load Y[0] store Y[0] load X[1] load Y[1] store Y[1] … Hits and misses: Assuming that base addresses of X and Y don’t conflict in cache, hit ratio of 83.3% L17

Example 4: 2-d Matrix Sum Reference Sequence:
double A[1024][1024], B[1024][1024]; for (j=0;j<1024;j++) for (i=0;i<1024;i++) B[i][j] = A[i][j] + B[i][j]; Reference Sequence: load A[0,0] load B[0,0] store B[0,0] load A[1,0] load B[1,0] store B[1,0] … Question: In what order are the elements of a multidimensional array stored in memory? L17

Storage of Multi-dimensional Arrays
Row major order Example: for a 2-dimensional array, the elements of the first row of the array are followed by those of the 2nd row of the array, the 3rd row, and so on This is what is used in C Column major order A 2-dimensional array is stored column by column in memory Used in FORTRAN A A L17

Example 4: 2-d Matrix Sum Reference Sequence:
double A[1024][1024], B[1024][1024]; for (j=0;j<1024;j++) for (i=0;i<1024;i++) B[i][j] = A[i][j] + B[i][j]; Reference Sequence: load A[0,0] load B[0,0] store B[0,0] load A[1,0] load B[1,0] store B[1,0] … L17

Example 4: Hits and Misses
Reference order and storage order for an array are not the same Our loop will show no spatial locality Assume that packing has been to eliminate conflict misses due to base addresses Reference Sequence: load A[0,0] load B[0,0] store B[0,0] load A[1,0] load B[1,0] store B[1,0] … Miss(cold), Miss(cold), Hit for each array element Hit ratio: 33.3% Question: Will A[0,1] be in the cache when required? B L17

Example 4 with Loop Interchange
double A[1024][1024], B[1024][1024]; for (i=0;i<1024;i++) for (j=0;j<1024;j++) B[i][j] = A[i][j] + B[i][j]; Reference Sequence: load A[0,0] load B[0,0] store B[0,0] load A[0,1] load B[0,1] store B[0,1] … Hit ratio: 83.3% A B L17

Is Loop Interchange Always Safe?
for (i=2047; i>1; i--) A for (j=1; j<2048; j++) for (i=1; i<2048; i++) for (j=1; j<2048; j++) for (i=1; i<2048; i++) A[i][j] = A[i+1][j-1] + A[i][j-1]; A[1,1] = A[2,0]+A[1,0] A[2,1] = A[3,0]+A[2,0] … A[1,2] = A[2,1]+A[1,1] A[1,1] = A[2,0]+A[1,0] A[1,2] = A[2,1]+A[1,1] … A[2,1] = A[3,0]+A[2,0] L17

Example 5: Matrix Multiplication
double X[N][N], Y[N][N], Z[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) X[i][j] += Y[i][k] * Z[k][j]; Reference Sequence: X Y Z X[i][j] Y[i][k] Z[k][j] / Dot product inner loop Lec18 Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0] … X[0,0], Y[0,0], Z[0,1], Y[0,1], Z[1,1], Y[0,2], Z[2,1] … X[0,1], … Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0] … X[1,0], Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0] … X[0,0], Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0] … X[1,0], Y[2,0], Z[0,0], Y[2,1], Z[1,0], Y[2,2], Z[2,0] … X[2.0], …

Example 5: Matrix Multiplication
double X[N][N], Y[N][N], Z[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) X[i][j] += Y[i][k] * Z[k][j]; Reference Sequence: X Y Z X[i][j] Y[i][k] Z[k][j] / Dot product inner loop Lec18 Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0] … X[0,0], Y[0,0], Z[0,1], Y[0,1], Z[1,1], Y[0,2], Z[2,1] … X[0,1], … Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0] … X[1,0], Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0] … X[0,0], Y[1,0], Z[0,1], Y[1,1], Z[1,1], Y[1,2], Z[2,1] … X[0,1], Y[2,0], Z[0,2], Y[2,1], Z[1,2], Y[2,2], Z[2,2] … X[0,2], …

With Loop Interchanging
Can interchange the 3 loops in any way Example: Interchange i and k loops For inner loop: Z[k][j] can be loaded into register once for each (k,j), reducing the number of memory references double X[N][N], Y[N][N], Z[N][N]; for (k=0; k<N; k++) for (j=0; j<N; j++) for (i=0; i<N; i++) X[i][j] += Y[i][k] * Z[k][j]; X[i][j] Y[i][k] Z[k][j] Lec18

Let’s try some Loop Unrolling Instead
double X[N][N], Y[N][N], Z[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k+=2) X[i][j] += Y[i][k]*Z[k][j] + Y[i][k+1]*Z[k+1][j]; Unroll k loop for (k=0; k<N; k++) X[i][j] += Y[i][k] * Z[k][j]; Lec18 Exploits spatial locality for array Z?

double X[N][N], Y[N][N], Z[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j+=2) for (j=0; j<N; j++) Unroll j loop for (k=0; k<N; k++) { X[i][j] += Y[i][k]*Z[k][j]; X[I][j+1] += Y[I][k]*Z[k][j+1]; } for (k=0; k<N; k++) X[i][j] += Y[i][k] * Z[k][j]; Lec18 Exploits spatial locality for array Z

double X[N][N], Y[N][N], Z[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j+=2) for (k=0; k<N; k+=2){ X[i][j] += Y[i][k]*Z[k][j] +Y[i][k+1]*Z[k+1][j]; X[i][[j+1] += Y[i][k]*Z[k][j+1] +Y[i][k+1]*Z[k+1][j+1]; } for (j=0; j<N; j++) Unroll j loop Unroll k loop for (k=0; k<N; k++) X[i][j] += Y[i][k] * Z[k][j]; Lec18 Exploits spatial locality for array Z Exploits temporal locality for array Y Blocking or Tiling

Blocking/Tiling X Y x Z 0,0 0,0x0,0 + 0,1x1,0 1,0 1,0x0,0 + 1,1x1,0 Idea: Since problem is with accesses to array Z, make full use of elements of Z when they are brought into the cache 0,0 0,1 0,0 0,1 Y Z X Lec18 1,0 1,1 1,0 1,1

Blocked Matrix Multiplication
J for (J=0; J<N; J+=B) for (K=0; K<N; K+=B) Y Z K for (i=0; i<N; i++) for (j=J; j<min(J+B,N); j++){ for (k=K, r=0; k<min(K+B,N); k++) r += Y[i][k] * Z[k][j]; X[i][j] += r; } Lec18

Some Homework 1. Read N9, N10 2. Implementing Matrix Multiplication
Objective: Best programs for multiplying 1024x1024 double matrices on any 2 different machines that you normally use. Techniques: Loop interchange, blocking, etc Criterion: Execution time Report: Program and execution times Lec18

Plan for Remaining 13 Lectures
Reality Check Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: Do modern processors use pipelining of the kind that we studied? Timing, Profiling, File Systems Parallel architecture and programming Evaluating a computer system Lec18

Reality Check Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: What about multiple levels in caches? Question 3: Do modern processors use pipelining of the kind that we studied? Lec18

Virtual Memory System To support memory management when multiple processes are running concurrently. Page based, segment based, ... Ensures protection across processes. Address Space: range of memory addresses a process can address (includes program (text), data, heap, and stack) 32-bit address  4 GB with VM, address generated by the processor is virtual address

Page-Based Virtual Memory
A process’ address space is divided into a no. of pages (of fixed size). A page is the basic unit of transfer between secondary storage and main memory. Different processes share the physical memory Virtual address to be translated to physical address.

Virtual Pages to Physical Frame Mapping
Process 1 Main Memory Process k

Page Mapping Info (Page Table)
A page is mapped to any frame in the main memory. Where to store/access the mapping? Page Table Each process will have its own page table! Address Translation: virtual to physical address translation

Address Translation Virtual address (issued by processor) to Physical Address (used to access memory) Virtual Page No. 18-bits Offset 14 bits Phy. Page # Pr D V Physical Frame No. Virtual Address Physical Address Page Table

Memory Hierarchy: Secondary to Main Memory
Analogous to Main memory to Cache. When the required virtual page is not in main memory: Page Hit When the required virtual page is not in main memory: Page fault Page fault penalty very high (~10’s of mSecs.) as it involves disk (secondary storage) access. Page fault ratio shd. be very low (10-4 to 10-5) Page fault handled by OS.

Page Placement A virtual page is placed anywhere in physical memory (fully associative). Page Table keeps track of the page mapping. Separate page table for each process. Page table size is quite large! Assume 32-bit address space and 16KB page size. # of entries in page table = 232 / 214 = 218 = 256K Page table size = 256 K * 4B = 1 MB = 64 pages! Page table itself may be paged (multi-level page tables)!

Page Identification Use virtual page number to index into page table.
Accessing page table causes one extra memory access! Virtual Page No. 18-bits Offset 14 bits Phy. Page # Pr V D Physical Frame No. 18-bits Offset 14 bits

Page Replacement Write Policies
Page replacement can use more sophisticated policies than in Cache. Least Recently Used Second-chance algorithm Recency vs. frequency Write Policies Write-back Write-allocate

Translation Look-Aside Buffer
Accessing the page tables causes one extra memory access! To reduce translation time, use translation look-aside buffer (TLB) which caches recent address translations. TLB organization similar to cache orgn. (direct mapped, set-, or full-associative). Size of TLB is small ( entries). TLBs is important for fast translation.

Translation using TLB Assume 128 entry 4-way associative TLB 14 bits
Virtual Page No. Offset Physical Frame No. = Tag 13bits Ind. 5bits Phy. Page # Pr V D TLB Hit P. P. # Pr D V

Q1: Caches and Address Translation
Physical Addressed Cache Virtual Address Physical Address MMU Cache Virtual Address Virtual Address Physical Address Cache MMU Lec18 (if cache miss) (to main memory) Virtual Addressed Cache

Which is less preferable?
Physical addressed cache Hit time higher (cache access after translation) Virtual addressed cache Data/instruction of different processes with same virtual address in cache at the same time … Flush cache on context switch, or Include Process id as part of each cache directory entry Synonyms Virtual addresses that translate to same physical address More than one copy in cache can lead to a data consistency problem Lec18

Another possibility: Overlapped operation
MMU Virtual Address Physical Address Tag comparison using physical address Indexing into cache directory using virtual address Cache Lec18 Virtual indexed physical tagged cache

Addresses and Caches `Physical Addressed Cache’
Physical Indexed Physical Tagged `Virtual Addressed Cache’ Virtual Indexed Virtual Tagged Overlapped cache indexing and translation Virtual Indexed Physical Tagged Physical Indexed Virtual Tagged (?) Lec18

Physical Indexed Physical Tagged Cache
16KB page size 64KB direct mapped cache with 32B block size Virtual Page No 18 bits Page Offset 14 bits Virtual Address MMU = Physical Cache 5 Lec18 Physical Page No 18 bits Cache Tag 16 bits C-Index 11 bits Page Offset 14 bits C offset Physical Address

Virtual Index Virtual Tagged Cache
5 VPN 18 bits C-Index 11 bits C offset Virtual Address = MMU Lec18 Hit/Miss PPN 18 bits Page Offset 14 bits Physical Address

Virtual Index Physical Tagged Cache
5 VPN 18 bits C-Index 11 bits C offset Virtual Address = MMU Lec18 Physical Address Cache Tag 16 bits P-Offset 14 bits

Multi-Level Caches Small L1 cache -- to give a low hit time, and hence faster CPU cycle time . Large L2 cache to reduce L1 cache miss penalty. L2 cache is typically set-associative to reduce L2-cache miss ratio! Typically, L1 cache is direct mapped, separate I and D cache orgn. L2 is unified and set-associative. L1 and L2 are on-chip; L3 is also getting in on-chip.

Multi-Level Caches CPU MMU Acc. Time 2 - 4 ns L2 Unified Cache
I-Cache L1 D-Cache L2 Unified Cache Acc. Time 16-30 ns Acc. Time 100 ns Memory

Cache Performance One Level Cache
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Two-Level Caches Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

Translation using TLB Assume 128 entry 4-way associative TLB 14 bits
Virtual Page No. Offset Physical Frame No. = Tag 13bits Ind. 5bits Phy. Page # Pr V D TLB Hit P. P. # Pr D V

Putting it Together: Alpha 21264
48-bit virtual addr. and 44-bit physical address. 64KB 2-way assoc. L1 I-Cache with 64byte blocks ( 512 sets) L1 I-Cache is virtually index and tagged (address translation reqd. only on a miss) 8-bit ASID for each process (to avoid cache flush on context switch)

Alpha 21264 8KB page size ( 13 bit page offset)
128 entry fully associative TLB 8MB Direct mapped unified L2 cache, 64B block size Critical word (16B) first Prefetch next 64B into instrn. prefetcher

21264 Data Cache L1 data cache uses virtual addr. index, but physical addr. tag Addr. translation along with cache access. 64KB 2-way assoc. L1 data cache with write back.

Q2: High Performance Pipelined Processors
Pipelining Overlaps execution of consecutive instructions Performance of processor improves Current processors use more aggressive techniques for more performance Some exploit Instruction Level Parallelism - often, many consecutive instructions are independent of each other and can be executed in parallel (at the same time) Lec18

Instruction Level Parallelism Processors
Challenge: identifying which instructions are independent Approach 1: build processor hardware to analyze and keep track of dependences Superscalar processors: Pentium 4, RS6000,… Approach 2: compiler does analysis and packs suitable instructions together for parallel execution by processor VLIW (very long instruction word) processors: Intel Itanium Lec18

ILP Processors (contd.)
Pipelined IF WB MEM EX ID Superscalar IF WB MEM EX ID VLIW/EPIC IF WB MEM EX ID

Multicores Multiple cores in a single die
Early efforts utilized multiple cores for multiple programs Throughput oriented rather than speedup-oriented! Can they be used by Parallel Programs?

Assignment #2 1. Learn about the loop unrolling that gcc can do for you. We have unrolled the DAXPY loop 2 times to perform the computation of 2 elements in each loop iteration. Study the effects of increasing the degree of loop unrolling. DAXPY Loop: double a, X[16384], Y[16384], Z[16384]; for (i = 0 ; i < 16384; i++) Z[i] = a * X[i] + Y[i] ; 2. Understand the static instruction scheduling performed by the compiler in the above code with and without the Optimization flags. 3. 4. Do Problem 5.18 (page ) in H&P, Compute Architecture Book, Ed.4. 4. Implement Matrix Multiplication of 4096x4096 double matrices on any 2 different machines that you normally use. Apply Loop interchange, blocking, etc to reduce the Execution time (Due: Oct. 14, 2010)

SE-292 High Performance Computing

Similar presentations

Presentation on theme: "SE-292 High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SE-292 High Performance Computing

Similar presentations

Presentation on theme: "SE-292 High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback