Chapter 4 Memory Design: SOC and Board-Based Systems

Chapter 4 Memory Design: SOC and Board-Based Systems
Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)

Cache and Memory cache memory performance cache partitioning
multi-level cache memory off-die memory designs

Outline for memory design

Area comparison of memory tech.

System environments and memory

Performance factors Factors: physical word size block / line size
Virtual address Factors: physical word size processor  cache block / line size cache  memory cache hit time cache size, organization cache miss time memory and bus virtual-to-real translation time number of processor requests per cycle

Design target miss rates
beyond 1MB double the size half the miss rate

System effects limit hit rate
operating System affects the miss ratio about 20% increase so does multiprogramming (M) miss rates may not be affected by increased cache size Q = no. instructions between task switches

System Effects Cold-Start
short transactions are created frequently and run quickly to completion Warm-Start long processes are executed in time slices COLD

Some common cache types

Multi-level caches: mostly on die
useful for matching processor to memory generally at least 2-level For microprocessors L1 at frequency of pipeline and L2 at slower latency often use 3-level Size limited by access time and improved cycle times

Cache partitioning: scaling effect on cache access time
access time to a cache is approximately access time (ns) = ( f +( f) C) x ( (1 - 1/A)) where f is the feature size in microns C is the cache capacity in K bytes A is the associativity, e.g. direct map A = 1 for example, at f = 0.1u, A = 1 and C = 32 (KB) the access time is 1.00 ns problem with small feature size: cache access time, not cache size

Minimum cache access time 1 array, larger sizes use multiple arrays (interleaving)
L3: multiple 256KB arrays L2 usually less than 512KB (interleaved from smaller arrays) L1 usually less than 64kB

Analysis: multi-level cache miss rate
L2 cache analysis by statistical inclusion if L2 cache > 4 x size of the L1 cache then assume statistically: contents of L1 lies in L2 relevant L2 miss rates local miss rate: No. L2 misses / No. L2 references global Miss Rate: No. misses / No. processor ref. solo Miss Rate: No. misses without L1/No. proc. ref. Inclusion => solo miss rate = global miss rate miss penalty calculation L1 miss rate x (miss in L1, hit in L2 penalty) plus L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)

Multi-level cache example
Memory L1 L2 Miss Rate 4% 1% - delays: Miss in L1, Hit in L2 2 cycles Miss in L1, Miss in L2 15 cycles - assume one reference/instruction L1 delay is 1 ref/instr x .04 misses/ref x 2 cycles/miss = 0.08 cpi L2 delay is 1 ref/instr x .01 misses/ref x (15-2) = 0.13 cpi Total effect of 2 level system is = 0.29 cpi

Memory design logical inclusion embedded RAM off-die: DRAM
basic memory model Strecker’s model

Physical memory system

Hierarchy of caches Name ? Size Access Transfer size L0 Registers
<256 words <1 cycle word L1 Core local <64K <4 cycle Line L2 On Chip <64M <30 cycle L3 DRAM on Chip <1G <60 cycle >= Line M0 Off Chip Cache M1 Local Main Memory <16G <150 cycle M2 Cluster Memory

Hierarchy of caches Working Set – how much memory an “iteration” requires if it fits in a level then that will be the worst case if it does not, hit rate typically determines performance double the cache level size half the miss rate – good rule of thumb if 90% hit rate, 10x memory access time, performance 50% and that’s for 1 core

Logical inclusion multiprocessors with L1 and L2 caches
Important: L1 cache does NOT contain a line sufficient to determine L2 cache does not have the line need to ensure all the contents of L1 are always in L2 this property: Logical Inclusion

Logical inclusion techniques
passive control Cache size, organization, policies no. L2 sets no. L1 sets L2 set size L1 set size compatible replacement algorithms but: highly restrictive and difficult to guarantee active whenever a line is replaced or invalidated in the L2 ensure it is not present in L1 or it is evicted from L1

Memory system design outline
memory chip technology on-die or off die static versus dynamic: SRAM versus DRAM access protocol: talking to memory synchronous vs asynchronous DRAMs simple memory performance model Strecker’s model for memory banks

Why BIG memory?

Memory many times, computation limited by memory
not processor organization or cycle time memory: characterized by 3 parameters size access time: latency cycle time: bandwidth

Embedded RAM

Embedded RAM density (1)

Embedded RAM density (2)

Embedded RAM cycle time

Embedded RAM error rates

Off-die Memory Module module contains the DRAM chips that make up the physical memory word if the DRAM is organized 2n words x b bits and the memory has p bits/ physical word then the module has p/b DRAM chips. total memory size is then 2n words x p bits Parity or Error-Correction Code (ECC) generally required for error detection and availability

Simple asychronous DRAM array
DRAM cell Capacitor: store charge for 0/1 state Transistor: switch capacitor to bit line Charge decays => refresh required DRAM array Stores 2n bits in a square array 2n/2 row lines connect to data lines 2n/2 column bit lines connect to sense amplifiers

DRAM basics Row read is destructive Sequence
Read row into SRAM from dynamic memory(>1000 bits) Select word (<64 bits) Write Word into row (writing) Repeat till done with row WRITE back row into dynamic memory

DRAM timing row and column addresses muxed
row and column Strobes for timing

Increase DRAM bandwidth
Burst Mode aka page mode, nibble mode, fast page mode Synchronous DRAM (SDRAM) DDR SDRAM DDR1 DDR2 DDR3

(Dual Data Rate Synchronous DRAM)
DDR SDRAM (Dual Data Rate Synchronous DRAM)

Burst mode burst mode most DDR SDRAMs: multiple rows can be open
save most recently accessed row (“page”) only need column row + CAS to access within page most DDR SDRAMs: multiple rows can be open address counter in each row for sequential accesses only need CAS (DRAM) or bus clock (SDRAM) for sequential accesses

Configuration parameters
Parameters for typical DRAM chips used in a 64-bit module

DRAM timing

Physical memory system

Basic memory model assume that n processors B(n,m) Tc
each make 1 request per Tc to one of m memories B(n,m) number of successes Tc memory cycle time to the memory one processor making n requests per Tc behaves as n processors making 1 request per Tc

Achieved vs. offered bandwidth
offered request rate rate at which processor(s) would make requests if memory had unlimited bandwidth and no contention

Basic terms B = B(m,n) or B(m)
number of requests that succeed each Tc (= average number of busy modules) B: bandwidth normalized to Tc Ts: more generalized term for service time Tc = Ts BW: achieved bandwidth in requests serviced per second BW = B / Ts = B(m,n)/ Ts

Modeling + evaluation methodology
relevant physical parameters for memory word size module size number of modules cycle time Tc (=Ts) find the offered Bandwidth number of requests/Ts find the bottleneck performance limited by most restrictive service point

Strecker’s model: compute B(m,n)
model description each processor generates 1 reference per cycle requests randomly/uniformly distributed over modules any busy module serves 1 request all unserviced requests are dropped each cycle assume there are no queues B(m,n) = m[1 - (1 - 1/m)n] relative Performance Prel = B(m,n) / n

Deriving Strecker’s model
Prob[given processor not reference module] = (1 – 1/m) Prob[no processor references module] = P[idle] = (1 – 1/m)n Prob[module busy] = 1 - (1 – 1/m)n average number of busy modules is B(m,n) B(m,n) = m[1 - (1 - 1/m)n]

Example 1 2 dual core processor dice share memory
Ts = 24 ns each die has 2 processors sharing 4MB L2 miss rate is misses reference each processor makes 3 4 GHz 2 x 2 x 3 x =0.012 refs/cyc Ts = 4 x 24 cycles n = processor requests / Ts; if m= 4 success rate B(m,n) = B(4,1.152) = 0.81 Relative Performance = B/n = .81/1.152 =0.7

Example 2 8-way interleaved associative data cache
processor issues 2LD/ST per cycle each processor: data reference per cycle = 0.6 n = 2 ; m = 8 B(m,n) = B(8,1.2) = 1.18 Relative Performance = B/n = 1.18/1.2 = 0.98

Summary cache memory chip technology static versus dynamic:
performance, cache partitioning, multi-level cache memory chip technology on-die or off die static versus dynamic: SRAM versus DRAM access protocol: talking to memory synchronous vs asynchronous DRAMs simple memory performance model Strecker’s model for memory banks

Chapter 4 Memory Design: SOC and Board-Based Systems

Similar presentations

Presentation on theme: "Chapter 4 Memory Design: SOC and Board-Based Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4 Memory Design: SOC and Board-Based Systems

Similar presentations

Presentation on theme: "Chapter 4 Memory Design: SOC and Board-Based Systems"— Presentation transcript:

Similar presentations

About project

Feedback