The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

Slides:

Advertisements

Similar presentations

COMP375 Computer Architecture and Organization Senior Review.

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Instruction Set Design

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Issues in Future NoC Ran Ginosar. 2 Research Directions – Now NOC for CMP for ASIC / SOC / MPSoC for All Physical Flow CTL Architecture Routing Photonic,

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

Chapter 1 and 2 Computer System and Operating System Overview

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Processor Organization and Architecture

RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

CMPE 421 Parallel Computer Architecture

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

What have mr aldred’s dirty clothes got to do with the cpu

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Sunpyo Hong, Hyesoon Kim

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

GCSE Computing - The CPU

Using the VTune Analyzer on Multithreaded Applications

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

COSC3330 Computer Architecture

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

5.2 Eleven Advanced Optimizations of Cache Performance

Morgan Kaufmann Publishers Memory & Cache

ECE 445 – Computer Organization

CSCI206 - Computer Organization & Programming

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4 Multiprocessors

Cache - Optimization.

GCSE Computing - The CPU

Caches & Memory.

Presentation transcript:

The many-core architecture 1

The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores, no caches Perform tasks given by the Scheduler Report back to Scheduler when done Memory Banks Interleaved addresses among the cores Processor-to-Memory Network Propagates read/write commands from Cores to Memory Bufferless Collision o n >1 Read/Write requests to same bank at same time No collision for 2 Reads from the same address Return NACK to cores (one succeeds, others fail) Core retries after NACK Access time Fixed (same to all addresses) in base system Variable in this research 2

The Memory Network 3

The Memory Network - Collision 4

The Memory Banks - Interleaved 64 banks 4B example 5

Research Question: Non equi-distant memory The base system [Bayer Ginosar 91] uses equi-distant memory Clock cycle time good for Processor cycle Access to farthest memory bank Slow clock (Freq 1) Access to memory takes 2 cycles (one cycle to memory & one back) But the Cores can work faster… Higher clock frequency faster processors, higher performance Some memory access is shorter 6

Mem Access in The Circular Layout Model Frequency increase by 2 (Freq 2) Access near memory (<radius/2) in 1 cycle Only ¼ of mem area Access far memory (>radius/2) in 2 cycles Average cycles per memory access Average time per memory access in relation to slow frequency (was 2) Frequency increase by N (Freq N) Average cycles per memory access Average time per memory access in relation to slow frequency 7

Frequency ++  Access time -- 8

Mem Access in The Rectangular Layout Model The more banks, the fewer collisions 9

Memory Access Time The closer the memory bank to the requesting core, the less cycles it takes: Access time: Freq 1: 2 (1 cycle one-way) Freq 2: 2 (2 cycles one-way) Freq 4: 2 (4 cycles one-way) Freq 8: 2 (8 cycles one-way) Access time: Freq 1: 2 (1 cycle one-way) Freq 2: 1 (1 cycles one-way) Freq 4: 1 (2 cycles one-way) Freq 8: 2/3 (3 cycles one-way) 10

Memory Access Time Matrix Freq 4 (roundtrip) 11

Memory Access Time Matrix Freq 8 (roundtrip) 12

Tested Parameters Cores: Fixed 256 Frequency 1, 2, 4, 8 Results are compared to Freq 1 Memory Banks: 128, 256, 512 The more banks, the fewer collisions ?? 13

Synthetic Program Task Map Three variants Same block diagram Vary in: number of duplications distribution of memory addresses Variants: Serial Most cores access same memory address High rate of collisions. Normal Uniform distribution of memory addresses Parallel Many more duplications Cores busier, less idle time 14

Actual test programs Three programs JPEG Linear Solver Mandelbrot fractals Executed by sw simulator on a single core, generated traces Traces processed by many-core architecture simulator

Results 16

JPEG Mem index Freq 1,2,4,8 512,256,128 mem banks proc index time

serial parallel typical JPEG 1 frame Linear solver

Decomposed Contributions Three factors affect speedup Processors executing faster (freq=1,2,4,8) Network latency shorter Far blocks take same long time Nearer blocks reachable at shorter latencies Memories allow faster access (freq=1,2,4,8) Need to separate 3 contributions By modified simulations By re-computing (manipulating) the results

Contribution of processors Simulation Processors at freq=1,2,4,8 Network single cycle (equi-distant), freq=1 Memories at freq=1 Computing Use data of Freq=1 for everything Dividing processor busy times by 1,2,4,8

Contribution of network Simulation Processors at freq=1,2,4,8 Network multi-cycle Memories at freq=1 Does not make sense: cancels network effect Computing Compare single and multi-cycle runs

Contribution of memories Simulation Processors at freq=1,2,4,8 Network single cycle, slow (freq=1) Memories at freq=1,2,4,8 By multi-port memories (1,2,4,8 ports per cycle)

Contributions NE=wait time / (wait + collision)

Conclusions 24

Cores temporal activity Higher frequency  cores executing versions of same task finish at different times Thanks to path latency diversity This is finer granularity of core activity Lower frequency  cores executing versions of same task finish closer together Many cores become free at once Coarser granularity activity Seen both in CPU activity and Temporal activity graphs Worse (burst) load on scheduler More banks  fewer accesses per bank Fewer collisions 25

Collisions Collisions decreases with Higher frequency More banks Affects both Speed-Up & Wait Time Higher frequency, more banks  higher diversity of path latency Fewer collisions Higher frequency  collisions incur lower wait time penalty 26

Speed-Up Frequency is the dominant Factor, mostly due to faster Cores, smaller number of collisions and mean shorter memory access cycles. Within the same frequency, larger bank# is better due to lower collision rate. This can be seen in Normal and Parallel cases. In Serial case we don’t see such a dependency due to many accesses to a single memory address, which physically located differently on systems with different bank #. In a highly collided program, Speed-Up is larger for fast frequencies than for parallel program, because of Cores/Banks path latency diversity (low frequencies have larger wait time penalty). 27

Relative Wait Time Relative Wait Time Decreases with frequency because of path latency diversity. Bank# has hardly no impact for the Serial program, because of the high collision number. In Normal & Parallel programs, the difference between bank# is significant in low frequencies because of the higher collision number. In higher frequencies the bank# factor becomes less significant because of path latency diversity hence less collisions. 28