Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Similar presentations


Presentation on theme: "Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison."— Presentation transcript:

1 Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison Daniel Killebrew

2 Agent-Based Model of Flocking Flocking agents follow simple rules: Don't crowd other agents. Align your velocity with your neighbors' average velocity. Move toward the center of gravity of your neighbors. Move stochastically.

3 Serial Implementation: The Grid Spatial decomposition into a moving grid that follows the agents’ center of gravity Performs better than the naïve implementation during flock formation

4 OpenMP on POWER5: Spatial Layout The parallel for construct defaults to rows A Hilbert curve provides better load-balancing Hilbert curve layouts for 8x8, 16x16, and 32x32 grid sizes.

5 OpenMP on POWER5: Performance

6 OpenMP on POWER5: Profile

7 QuadTree Two dimensional dynamic spatial decomposition When a square reaches capacity, split it up

8 QuadTree balancing Unbalanced code still has some speedup because the total simulation space is divided among more processors Mass flock movement requires balancing the quadtree among threads by reassigning areas of the simulation space

9 QuadTree optimizations Can adjust the maximum number of occupants before splitting a cell, as well as the minimum number before recombining a cell A lower max prevents spurious inter-boid computation A higher minimum prevents checking more quads for interaction than necessary Min and max that are too close means too much quad splitting/recombining

10 Cell Broadband Engine Architecture - Developed by Sony, Toshiba, IBM - 8 SPEs, 1 PPE - PS3 has 7 SPEs (annoying)‏ - High bandwidth interconnect (205GB/s peak)‏

11 Hardware support for Communication SPEs to PPE – or – PPE to SPEs – SPE Mailboxes (32-bit messages)‏ 4 inbound 2 outbound (total)‏ Can use mailboxes to talk SPE-SPE, but must setup memory mapping – DMA Transfers Must be 16B aligned Transfer from main memory to local store

12 Flocking on SPEs, first go Used Function-Offload parallel programming model Shipped off call to interact_fish() to 4 spes. (must use pthreads to do this)‏ Each get pointers to data in main memory DMA in the data, calculate ax ay, write back

13 Performance

14 Had 5 more goes at it 1) Function-Offload interact_fish()‏ 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good)‏ 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model

15 Performance 1) Function-Offload interact_fish()‏ 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good)‏ 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model Still a lot of performance enhancing options 1) SIMDization of code: need SOA, not AOS 2) Reducing branch penalty on SPE – branch hint statements 3) Minimize agents transfer 4) QuadTree on SPEs 5) SPE->SPE communication

16 Arch Nemesis: Mailbox Waiting

17 Defeating Mailbox Waiting

18 Lastly, Usability 256KB LS = BAD Mostly low level “generic” C functions Weird context swapping Programmer intimate w/ hardware :( High memory bandwidth Code overlay (demand paging)‏ Virtual Caches SPEs can run different code Programmer intimate w/ hardware :)‏

19 Questions? Mark Howison Jonathan Ellithorpe Daniel Killebrew


Download ppt "Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison."

Similar presentations


Ads by Google