Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison Daniel Killebrew

Agent-Based Model of Flocking Flocking agents follow simple rules: Don't crowd other agents. Align your velocity with your neighbors' average velocity. Move toward the center of gravity of your neighbors. Move stochastically.

Serial Implementation: The Grid Spatial decomposition into a moving grid that follows the agents’ center of gravity Performs better than the naïve implementation during flock formation

OpenMP on POWER5: Spatial Layout The parallel for construct defaults to rows A Hilbert curve provides better load-balancing Hilbert curve layouts for 8x8, 16x16, and 32x32 grid sizes.

OpenMP on POWER5: Performance

OpenMP on POWER5: Profile

QuadTree Two dimensional dynamic spatial decomposition When a square reaches capacity, split it up

QuadTree balancing Unbalanced code still has some speedup because the total simulation space is divided among more processors Mass flock movement requires balancing the quadtree among threads by reassigning areas of the simulation space

QuadTree optimizations Can adjust the maximum number of occupants before splitting a cell, as well as the minimum number before recombining a cell A lower max prevents spurious inter-boid computation A higher minimum prevents checking more quads for interaction than necessary Min and max that are too close means too much quad splitting/recombining

Cell Broadband Engine Architecture - Developed by Sony, Toshiba, IBM - 8 SPEs, 1 PPE - PS3 has 7 SPEs (annoying)‏ - High bandwidth interconnect (205GB/s peak)‏

Hardware support for Communication SPEs to PPE – or – PPE to SPEs – SPE Mailboxes (32-bit messages)‏ 4 inbound 2 outbound (total)‏ Can use mailboxes to talk SPE-SPE, but must setup memory mapping – DMA Transfers Must be 16B aligned Transfer from main memory to local store

Flocking on SPEs, first go Used Function-Offload parallel programming model Shipped off call to interact_fish() to 4 spes. (must use pthreads to do this)‏ Each get pointers to data in main memory DMA in the data, calculate ax ay, write back

Performance

Had 5 more goes at it 1) Function-Offload interact_fish()‏ 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good)‏ 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model

Performance 1) Function-Offload interact_fish()‏ 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good)‏ 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model Still a lot of performance enhancing options 1) SIMDization of code: need SOA, not AOS 2) Reducing branch penalty on SPE – branch hint statements 3) Minimize agents transfer 4) QuadTree on SPEs 5) SPE->SPE communication

Arch Nemesis: Mailbox Waiting

Defeating Mailbox Waiting

Lastly, Usability 256KB LS = BAD Mostly low level “generic” C functions Weird context swapping Programmer intimate w/ hardware :( High memory bandwidth Code overlay (demand paging)‏ Virtual Caches SPEs can run different code Programmer intimate w/ hardware :)‏

Questions? Mark Howison Jonathan Ellithorpe Daniel Killebrew

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Similar presentations

Presentation on theme: "Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Similar presentations

Presentation on theme: "Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison."— Presentation transcript:

Similar presentations

About project

Feedback