Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010.

Similar presentations


Presentation on theme: "1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010."— Presentation transcript:

1 1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010

2 2 Disclosure I am co-inventor / co-founder of Plurality Based on 30 years of (on/off) research MSc of Dr. Nimrod Bayer, ~1990 However, research is forward-looking But takes advantage of Plurality tools and data

3 3 Many-cores CMP / Multi-core is “more of the same” Several high-end complex powerful processors Each processor manages itself Each processor can execute the OS Good for many unrelated tasks (e.g. Windows) Reasonable on 2–8 processors, then it breaks Many-cores 100 – 1,000 – 10,000 Useful for heavy compute-bound tasks So far (50 years) many disasters But there is light at the end of the tunnel

4 4 Agenda Review 4 cases Analyze How NOT to make a many-core

5 5 Many many-core contenders Ambric Aspex Semiconductor ATI GPGPU BrightScale ClearSpeed Technologies Coherent Logix, Inc. CPU Technology, Inc. Element CXI Elixent/Panasonic IBM Cell IMEC Intel Larrabee Intellasys IP Flex MathStar Motorola Labs NEC Nvidia GPGPU PACT XPP Picochip Plurality Rapport Inc. Recore Silicon Hive Stream Processors Inc. Tabula Tilera (many are dead / dying / will die / should die)

6 6 PACT XPP German company, since 1999 Martin Vorbach, an ex-user of Transputers 42x Transputers mesh 1980s

7 7 PACT XPP (96 elements)

8 8 PACT XPP die photo

9 9 PACT: Static mapping, circuit-switch reconfigured NoC

10 10 PACT ALU-PAE

11 11 PACT Static task mapping  And a debug tool for that

12 12 PACT analysis Fine granularity computing Heterogeneous processors  Static mapping  complex programming  Circuit-switched NoC  static reconfigurations  complex programming  Limited parallelism Doesn’t scale easily

13 13 UK company Inspired by Transputers (1980s), David May 42x Transputers mesh 1980s

14 14 322x 16-bit LIW RISC

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24 : Static Task Mapping  Compile

25 25 MIMD, fine granularity, homogeneous cores Static mapping  complex programming  Circuit-switched NoC  static reconfigurations  complex programming  Doesn’t scale easily Can we create / debug / understand static mapping on 10K? analysis

26 26 USA company Based on RAW research @ MIT (A. Agarwal) Heavy DARPA funding, university IP Classic homogeneous MIMD on mesh NoC “Upgraded” Transputers with “powerful” uniprocessor features Caches  Complex communications  “tiles era”

27 27 Tiles Powerful processor High freq: ~1 GHz High power (0.5W)  5-mesh NoC P-M / P-P / P-IO 2.5 levels cache  L1+ L2 Can fetch from L2 of others Variable access time 1 – 7 – 70 cycles

28 28 Caches Kill Performance Cache is great for a single processor Exploits locality (in time and space) Locality only happens locally on many-cores Other (shared) data are buried elsewhere Caches help speed up parallel (local) phases Amdahl [1967]: the challenge is NOT the parallel phases Need to program many looong tasks on same data Hard ! That’s the software gap in parallel programming

29 29 Array 36-64 processors MIMD / SIMD  Total 5+ MB memory In distributed caches High power ~27W  Die photo

30 30 allows statics Pre-programmed streams span multi-processors Static mapping

31 31 co-mapping: code, memory, routing 

32 32 static mapping debugger 

33 33 analysis Achieves good performance Bad on power Hard to scale Hard to program

34 34 Israel Technion research (since 1980s) P LURALITY

35 35 Architecture: Part I “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC P P P PPP P P external memory shared memory P-to-M resolving NoC P LURALITY

36 36 P P P PPP P P external memory shared memory P-to-M resolving NoC low latency parallel scheduling enables fine granularity scheduler P-to-S scheduling NoC “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC Architecture: Part II P LURALITY

37 37 Floorplan S P LURALITY

38 38 Other possible floor-plans? NoC to Instruction Memory NoC to Data Memory

39 39 programming model Compile into task-dependency-graph = ‘task map’ task codes Task maps loaded into scheduler Tasks loaded into memory regular duplicable task xxx( dependencies ) join/fork { … INSTANCE …. ….. } Task template: P P P PPP P P external memory shared memory P-to-M resolving NoC scheduler P-to-S scheduling NoC P LURALITY

40 40 Fine Grain Parallelization Convert (independent) loop iterations for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; } into parallel tasks duplicable task XX(…) 10000 { ii = INSTANCE; a[ii] = b[ii]*c[ii]; } All tasks, or any subset, can be executed in parallel

41 41 Task map example (2D FFT) Duplicable task Conditional task Join / fork task

42 42 Another task map (linear solver)

43 43 Linear Solver: Simulation snap-shots

44 44 Architectural Benefits Shared, uniform (equi-distant) memory no worry which core does what no advantage to any core because it already holds the data Many-bank memory + fast P-to-M NoC low latency no bottleneck accessing shared memory Fast scheduling of tasks to free cores (many at once) enables fine grain data parallelism impossible in other architectures due to: task scheduling overhead data locality Any core can do any task equally well on short notice scales automatically Programming model: intuitive to programmers easy for automatic parallelizing compiler P LURALITY

45 45 Target design (no silicon yet) 256 cores 500 MHz For 2 MB, slower for 20 MB Access time: 2 cycles (+) 3 Watts Designed to be Attractive to programmers (simple) Scalable Fight Amdahl’s rule P LURALITY

46 46 Analysis

47 47 Two (analytic) approaches to many cores 1) How many cores can fit into a fixed size VLSI chip? 2) One core is fixed size. How many cores can be integrated? The following analysis employs approach (1)

48 48 The VLSI-aware many-core (crude) analysis

49 49 The VLSI-aware many-core (crude) analysis 64 256 1K 4K power  1/  N perf   N freq  1/  N Perf / power  N

50 50 A modified analysis

51 51 The modified analysis 64 256 1K 4K power   N perf   N freq  1/  N Perf / power  const

52 52 things we shouldn’t do in many-cores No processor-sensitive code No heterogeneous processors No speculation No speculative execution No speculative storage (aka cache) No speculative latency (aka packet-switched or circuit-switched NoC) No bottlenecks No scheduling bottleneck (aka OS) No issue bottlenecks (aka multithreading) No memory bottlenecks (aka local storage) No programming bottlenecks No multithreading / GPGPU / SIMD / static mappings / heterogeneous processors / … No statics No static task mapping No static communication patterns

53 53 Conclusions Powerful processors are inefficient Principles of high-end CPU are damaging Speculative anything, cache, locality, hierarchy Complexity harms (when exposed) Hard to program Doesn’t scale Hacking (static anything) is hacking Hard to program Doesn’t scale Keep it simple, stupid [Pythagoras, 520 BC]

54 54 Research issues Scalability Memory and P-M NoC: async / multicycle Add local memories? How to reduce power there? Scheduler and P-S NoC: active / distributed Schedule large number of tasks together? Shared memory as 1-cache to external memory Special instruction memories Programming model New PRAM, special case of functional? Hierarchical task map? Other task maps? Dynamic task map? Algorithm / program mapping to tasks Combine HPC and GP models? Power delivery

55 55


Download ppt "1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010."

Similar presentations


Ads by Google