Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University.

Similar presentations


Presentation on theme: "1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University."— Presentation transcript:

1 1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University of Southampton, UK adb@ecs.soton.ac.uk Steve Furber University of Manchester, UK sbf@cs.manchester.ac.uk

2 2 ParCo'13 September 2013 Outline Machine architecture Programming model Atomic computing –Finite difference time marching –Neural simulation –Pointwise computing Where next?

3 3 ParCo'13 September 2013 Machine architecture Triangular mesh of nodes Connected as a 256 x 256 toroid

4 4 ParCo'13 September 2013 One Spinnaker node 6 bi-directional comms links Core farm (1 Monitor) System... NoC RAM Watchdogs Off-die SDRAM

5 5 ParCo'13 September 2013 128 Mbyte DDR SDRAM

6 6 ParCo'13 September 2013 Physical construction 48 node board: 48 x 18 cores/node = 864 cores Final machine: 256 x 256 nodes... x 18 cores/node... = 1179648 cores

7 7 ParCo'13 September 2013 SpiNNaker machines 103 machine: 864 cores, 1 PCB, 75W 104 machine:10,368 cores, 1 rack, 900W (12 PCBs: largest configuration possible for operation without aircon) 105 machine: 103,680 cores, 1 cabinet, 9kW 106 machine: 1M cores, 10 cabinets, 90kW (Largest configuration possible for operation with forced air, no water cooling)

8 8 ParCo'13 September 2013 Outline Machine architecture Programming model Atomic computing –Finite difference time marching –Neural simulation –Pointwise computing Where next?

9 9 ParCo'13 September 2013 Programming model A conventional multi-processor program: Problem: represented as a network of programs with a certain behaviour......embodied as data structures and algorithms in code......compile, link......binary files loaded into instruction memory... MPI farm (or similar) Myrinet (or similar) Messages addressed at runtime from arbitrary process to arbitrary process Interface presented to the application is a homogenous set of processes of arbitrary size; process can talk to process by messages under application software control

10 10 ParCo'13 September 2013...on SpiNNaker The problem (Circuit under Simulation) is defined as a graph Torn into two components: –CuS topology Embodied as hardware route tables in the nodes –Circuit device behaviour Embodied as software event handlers running on cores

11 11 ParCo'13 September 2013 SpiNNaker execution SpiNNaker: Problem: represented as a network of nodes with a certain behaviour......behaviour of each node embodied as an interrupt handler in code......compile, link......binary files loaded into core instruction memory... Messages launched at runtime take a path defined by the firmware router...problem is split into two parts......problem topology loaded into firmware routing tables......abstract problem topology... The code says "send message" but has no control where the output message goes. It knows the original source of the message that woke up the handler but not the path by which it was delivered.

12 12 ParCo'13 September 2013 Event-driven software Packet arrival –initiate DMA DMA of connection data complete –Process inputs –Insert device delay –Generate outputs? Real time: –Timer interrupt update_device_ state(); update_ stimulus(); sleeping event goto_sleep(); Priority 1 Priority 2 Priority 3 Timer millisecond interrupt fetch_ connection_ data(); DMA completion interrupt Packet received interrupt

13 13 ParCo'13 September 2013 Managing interrupts Stack (resides in DTCM) Stack controller Queueable requests Priority request queue (resides in DTCM) Interrupt request arrives Interrupt handler executing (Un)mask interrupt instruction Non-queueable requests Queueable requests pulled off the top of the priority queue in order unless.................. a non-queueable request jumps in and pushes the stack

14 14 ParCo'13 September 2013 Design intention Packet propagation fast –0.1 us/node hop Software handlers fast –200 MHz ARM9 BUT code is small ==> –Most of the time, most cores are idle –Most packet queues lightly loaded Wallclock time send-receive Message size (bytes) core-core inter node core-core intra node 3 us12 us 4 Gbyte/s 1.8 Gbyte/s 0.1 us 0.03 Gbyte/s What's it cost? (0.1 us /node hop) Southampton Iridis cluster, 800 nodes, 4 cores/node SpiNNaker

15 15 ParCo'13 September 2013 Outline Machine architecture Programming model Atomic computing –Finite difference time marching –Neural simulation –Pointwise computing Where next?

16 16 ParCo'13 September 2013 Particle and field One process/field point One process/particle Field moves the particle; particle bends the field Particle and field One process/field point One process/particle Field moves the particle; particle bends the field Atomic computing Anything that can be transformed into –Large number of simple processes –Asynchronous short-range communication Finite difference One process/element Finite difference One process/element Discrete simulation One process/device Discrete simulation One process/device Continuous simulation One process/node One process/connection Continuous simulation One process/node One process/connection Neural simulation One processor/10 3 neurons Neural simulation One processor/10 3 neurons Ray tracing One process/pixel Ray tracing One process/pixel

17 17 ParCo'13 September 2013 Mapping problem graph to compute mesh Multiple devices per core 06 07 03 09 01 07 01 Problem graph (circuit) 02 4 72 23 Node 94 14 15 Core 10 2 6 5 9 3 6 Connection 10 2 7 11 1 8 12 Connection topology (circuit) embodied in node route tables Device states stored locally in relevant cores –Generic - independent of problem domain

18 18 ParCo'13 September 2013 Simulation (single core): Discrete simulation Discrete simulation S g1 g2g3 g4 δ =1 δ =8 δ =2 t =1 g11 Queue g49 Queue g22 Queue g49 g34 Queue g49 g2 and g4 inserted in any order, but the queue is ordered in time t:=1 t:=2 t:=4..- Queue..- - Future events inserted into central time-ordered queue as they are computed Next event popped from queue head 1234567891011 g1 g2 g3 g4 Conventional Conventional

19 19 ParCo'13 September 2013 Simulation (distributed): Simulation..- - - Overhead: a complex choreography of synchronisation signals and anti-events to maintain causality..- - - - - - Inter-core messages are: ●Conventionally expensive ●Cheap on SpiNNaker SpiNNaker SpiNNaker

20 20 ParCo'13 September 2013 Discrete simulation Simulation of a simulation –Iridis: 800 nodes, 4 cores/node –Dynamic re- mapping of CuS devices : physical cores during simulation Dynamic load balancing in discrete simulation

21 21 ParCo'13 September 2013 Finite difference time marching {...} may be simple......but there's a lot of it Conventional Conventional At each time point... In every spatial dimension... For each grid point... { }

22 22 ParCo'13 September 2013 Finite differences void ihr() { Recv(val,port); // React to neighbour value change ghost[port] = val; // It WILL BE different oldtemp = mytemp; // Store current state mytemp = fn(ghost); // Compute new value if (oldtemp==mytemp) stop; // If nothing changed, keep quiet Send(mytemp); // Broadcast changed state } void ihr() { Recv(val,port); // React to neighbour value change ghost[port] = val; // It WILL BE different oldtemp = mytemp; // Store current state mytemp = fn(ghost); // Compute new value if (oldtemp==mytemp) stop; // If nothing changed, keep quiet Send(mytemp); // Broadcast changed state } Handler awoken by arrival of changed neighbour state Stencil updated Compute new state If nothing has changed.... Tell neighbours I've changed One handler/mesh point Computation data driven Solution trajectories non-deterministic Steady state valid Convergence? SpiNNaker SpiNNaker

23 23 ParCo'13 September 2013 Finite differences Canonical 2D square grid: Diagonal temperature profile vs iteration

24 24 ParCo'13 September 2013 Solution times

25 25 ParCo'13 September 2013 Reliable computing on unreliable computers Finite difference grid sites mapped to faulty core –Algorithm 'self-heals' around unresponsive core

26 26 ParCo'13 September 2013 Neural simulation –SpiNNaker –SpiNNaker maps a user defined graph : machine topology –106 (million core) machine: 10 9 devices 10 6 cores –It's just another discrete system? 1000 neurons per processor

27 27 ParCo'13 September 2013 Neural simulation SpiNNaker SpiNNaker Devices (neurons) represented by a differential equation –Integrated in real time –Integration timestep << equation time constants Therefore: –Solution correct and technique stable

28 28 ParCo'13 September 2013 Neural simulation SpiNNaker SpiNNaker sn s2 s1 ΣsΣs clock Individual message frequencies < real- time clock Superposition of all inputs: exact timing = fn(neuron:core) i.e. independent of CuS (bad) BUT message latency << CuS time constants (so it doesn't matter) Change of neuron state derived locally, stored until next (real) timestep Change of neuron state broadcast (or not) at next (real) timestep

29 29 ParCo'13 September 2013 Brian validation Apparent phase lag in timer ticks between simulations because of the reporting latency out of SpiNNaker

30 30 ParCo'13 September 2013 Large simulations Simulation of large neural aggregates in real time Large: design intention is 10 9 on 10 6 core machine SpiNNaker SpiNNaker

31 31 ParCo'13 September 2013 Pointwise computing Matrix operations: –1 core/matrix element Complexity –Trade off N operations for N cores

32 32 ParCo'13 September 2013 LU decomposition Message contains l 41 evaluated after time step 2 (1,1) (2,1) u 11 (1) l 21 (2) (1,2) (2,2) u 12 (1) (1,3) (2,3) u 13 (1) (1,4) (2,4) u 14 (1) (3,1) l 31 (2) (3,2) u 22 (3) l 32 (4) (3,3) u 23 (3) (3,4) u 24 (4) (4,1) l 41 (2) (4,2) l 42 (2) (4,3) u 33 (5) l 43 (6) (4,4) u 34 (5) l 41 (2) (1,1) (2,1) y 1 (1) l 21 y 1 (2) (2,2) (3,1) l 31 y 1 (2) (3,2) y 2 (3) l 32 y 2 (4) (3,3) (4,1) l 41 y 1 (2) (4,2) l 42 y 2 (4) (4,3) y 3 (5) l 43 y 3 (6) (4,4) x 4 (1) x 2 (5) u 13 x 3 (4) x 3 (3) (1,1) (2,1) u 12 x 2 (6) (2,2) (3,1) (3,2) u 23 x 3 (4) (3,3) (4,1) (4,2) (4,3) (4,4) u 14 x 4 (2)u 24 x 4 (2)u 34 x 4 (2)

33 33 ParCo'13 September 2013 Conjugate gradient At every search trajectory inflection, O(n) vector.matrix product needs to be computed

34 34 ParCo'13 September 2013 Life... the Universe, and everything Life... the Universe, and everything

35 35 ParCo'13 September 2013 Outline Machine architecture Programming model Atomic computing –Finite difference time marching –Neural simulation –Pointwise computing Where next?

36 36 ParCo'13 September 2013 Where next? Neural simulation –Robotics –Modelling of auditory and visual systems –Cognitive disorders Physics applications –Computational fluid dynamics –Thermal modelling –Plasmas –Inverse field problems –Computational chemistry


Download ppt "1 ParCo'13 September 2013 Atomic computing - a different perspective on massively parallel problems Andrew Brown, Rob Mills, Jeff Reeve, Kier Dugan University."

Similar presentations


Ads by Google