The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

The many-core architecture 1

The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores, no caches Perform tasks given by the Scheduler Report back to Scheduler when done Memory Banks Interleaved addresses among the cores Processor-to-Memory Network Propagates read/write commands from Cores to Memory Bufferless Collision o n >1 Read/Write requests to same bank at same time No collision for 2 Reads from the same address Return NACK to cores (one succeeds, others fail) Core retries after NACK Access time Fixed (same to all addresses) in base system Variable in this research 2

The Memory Network 3

The Memory Network - Collision 4

The Memory Banks - Interleaved 64 banks 4B example 5

Research Question: Non equi-distant memory The base system [Bayer Ginosar 91] uses equi-distant memory Clock cycle time good for Processor cycle Access to farthest memory bank Slow clock (Freq 1) Access to memory takes 2 cycles (one cycle to memory & one back) But the Cores can work faster… Higher clock frequency faster processors, higher performance Some memory access is shorter 6

Mem Access in The Circular Layout Model Frequency increase by 2 (Freq 2) Access near memory (<radius/2) in 1 cycle Only ¼ of mem area Access far memory (>radius/2) in 2 cycles Average cycles per memory access Average time per memory access in relation to slow frequency (was 2) Frequency increase by N (Freq N) Average cycles per memory access Average time per memory access in relation to slow frequency 7

Frequency ++  Access time -- 8

Mem Access in The Rectangular Layout Model The more banks, the fewer collisions 9

Memory Access Time The closer the memory bank to the requesting core, the less cycles it takes: Access time: Freq 1: 2 (1 cycle one-way) Freq 2: 2 (2 cycles one-way) Freq 4: 2 (4 cycles one-way) Freq 8: 2 (8 cycles one-way) Access time: Freq 1: 2 (1 cycle one-way) Freq 2: 1 (1 cycles one-way) Freq 4: 1 (2 cycles one-way) Freq 8: 2/3 (3 cycles one-way) 10

Memory Access Time Matrix Freq 4 (roundtrip) 11

Memory Access Time Matrix Freq 8 (roundtrip) 12

Tested Parameters Cores: Fixed 256 Frequency 1, 2, 4, 8 Results are compared to Freq 1 Memory Banks: 128, 256, 512 The more banks, the fewer collisions ?? 13

Synthetic Program Task Map Three variants Same block diagram Vary in: number of duplications distribution of memory addresses Variants: Serial Most cores access same memory address High rate of collisions. Normal Uniform distribution of memory addresses Parallel Many more duplications Cores busier, less idle time 14

Actual test programs Three programs JPEG Linear Solver Mandelbrot fractals Executed by sw simulator on a single core, generated traces Traces processed by many-core architecture simulator

Results 16

JPEG Mem index Freq 1,2,4,8 512,256,128 mem banks proc index time

serial parallel typical JPEG 1 frame Linear solver

Decomposed Contributions Three factors affect speedup Processors executing faster (freq=1,2,4,8) Network latency shorter Far blocks take same long time Nearer blocks reachable at shorter latencies Memories allow faster access (freq=1,2,4,8) Need to separate 3 contributions By modified simulations By re-computing (manipulating) the results

Contribution of processors Simulation Processors at freq=1,2,4,8 Network single cycle (equi-distant), freq=1 Memories at freq=1 Computing Use data of Freq=1 for everything Dividing processor busy times by 1,2,4,8

Contribution of network Simulation Processors at freq=1,2,4,8 Network multi-cycle Memories at freq=1 Does not make sense: cancels network effect Computing Compare single and multi-cycle runs

Contribution of memories Simulation Processors at freq=1,2,4,8 Network single cycle, slow (freq=1) Memories at freq=1,2,4,8 By multi-port memories (1,2,4,8 ports per cycle)

Contributions NE=wait time / (wait + collision)

Conclusions 24

Cores temporal activity Higher frequency  cores executing versions of same task finish at different times Thanks to path latency diversity This is finer granularity of core activity Lower frequency  cores executing versions of same task finish closer together Many cores become free at once Coarser granularity activity Seen both in CPU activity and Temporal activity graphs Worse (burst) load on scheduler More banks  fewer accesses per bank Fewer collisions 25

Collisions Collisions decreases with Higher frequency More banks Affects both Speed-Up & Wait Time Higher frequency, more banks  higher diversity of path latency Fewer collisions Higher frequency  collisions incur lower wait time penalty 26

Speed-Up Frequency is the dominant Factor, mostly due to faster Cores, smaller number of collisions and mean shorter memory access cycles. Within the same frequency, larger bank# is better due to lower collision rate. This can be seen in Normal and Parallel cases. In Serial case we don’t see such a dependency due to many accesses to a single memory address, which physically located differently on systems with different bank #. In a highly collided program, Speed-Up is larger for fast frequencies than for parallel program, because of Cores/Banks path latency diversity (low frequencies have larger wait time penalty). 27

Relative Wait Time Relative Wait Time Decreases with frequency because of path latency diversity. Bank# has hardly no impact for the Serial program, because of the high collision number. In Normal & Parallel programs, the difference between bank# is significant in low frequencies because of the higher collision number. In higher frequencies the bank# factor becomes less significant because of path latency diversity hence less collisions. 28

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

Similar presentations

Presentation on theme: "The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

Similar presentations

Presentation on theme: "The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,"— Presentation transcript:

Similar presentations

About project

Feedback