Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive.

Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive

Why are we here? The 7th generation is approaching. We are no longer next gen We are all scrambling to adopt to the new stuff, so that we can stay on the bleeding edge And push the envelope and take things to the next level.

What’s Next Gen? Multiple Processors not entirely new, but more than before. Parallelism not entirely new, but more than before. Physics not entirely new, but more than before.

Take-Away So much to cover General Principles Useful Concepts Techniques Tips Bad Jokes Goal is to save you time during the transition to.. Next Gen

Format for presentation Every year we discover new ways to communicate information.

Patterns A description of a recurrent problem and of the core of possible solutions Difficult to write Too pretentious Inviting criticism

Gems Valuable bits of information Too 6 th Gen

Blog Free Form Continuity not required Subjective/opinionated is okay Arbitrary Tangents are okay Catchy Title need not match article No quality bar This sounds 7 th Gen to me.

Disclaimer My information sources range from: press releases Patents other Blogs on the net random probabilistic guesses. Much of the information is probably wrong.

Multi-threaded programming I participated in some in depth discussions on this topic, after weeks of debate, the conclusion was: “Multi-threaded programming is hard” 1-Mar-05

What is 7 th Gen Hardware? Fast Many parallel processors Very High peak Flops In order execution 2-Mar-05

What is 7 th Gen Hardware? High memory latency Not enough Bandwidth Moderate clock speed improvements Not enough Memory CPU-GPU convergence 2-Mar-05

Hardware usually sucks Is Multi-Processor Revolutionary? It is kind of here already Hyper Threading Dual Processor Sega Saturn not entirely new, but more than before. 3-Mar-05

Hardware usually sucks Hardware advances require years of preparatory hype: 3D Accelerators Online SIMD “Not with a bang but with a whimper” 3-Mar-05

Hardware usually sucks The big problem with hardware advances is sofware. We don’t like to do things that are hard. If there is a big enough payoff we do it. This time there is a big enough payoff. 3-Mar-05

Types of Parallelism Task Parallelism Render+physics Data Parallelism collision detection on two objects at a time Instruction Parallelism multiple elements in a vector Use all three 4-Mar-05

Techniques Pipeline Work Crew Forking 5-Mar-05

Pipeline – Task Parallelism Subdivide problem into discrete tasks Solve tasks in parallel, spreading them across multiple processors. 5-Mar-05

Pipeline – Task Parallelism Thread 0 collision detection Frame 3 Thread 1 Logic/AI Frame 2 Thread 2 Integration Frame 1 Thread 0 collision detection Frame 4 Thread 1 Logic/AI Frame 3 Thread 2 Integration Frame 2 5-Mar-05

Pipeline Similar to CPU/GPU parallelism CPU Frame 3 GPU Frame 2 CPU Frame 4 GPU Frame 3 5-Mar-05

Pipeline: notes Dependencies explicit Communication explicit I.e. through FIFO Avoids deadlock issues Avoids most race conditions Load balancing is not great Does not reduce latency vs. singled threaded case 5-Mar-05

Pipeline: notes Feedback between tasks is difficult Best for open loop tasks Secondary dynamics, I.e. pony tail Effects Suitable for specialized hardware, because task requirements are cleanly divided. 5-Mar-05

Pipeline: notes Suitable for restricted memory architectures, as seen in a certain proposed 7 th gen console design. Adds bandwidth overhead and memory use overhead to SMP systems that would otherwise communicate via the cache. 5-Mar-05

Work Crew Component wise division of system Collision Detection Integration Rendering AI/Logic Audio IO Particle System Fluid Simulation 5-Mar-05

Work Crew – Task Parallelism Similar to pipeline but without explicit ordering. Dependencies are handled on a case by case basis. i.e. particles that do not effect game play might not need to be deterministic, so they can run without explicit synchronization. Components without interdependencies can run asynchronously, e.g. kinematics and AI. 5-Mar-05

Work Crew Suitable for some external processes such as IO, Gamepad, Sound, Sockets. Suitable for decoupled systems: particle simulations that do not effect game play Fluid dynamics Visual damage simulation Cloth simulation 5-Mar-05

Work Crew Scalability is limited by the number of discrete tasks Load balancing is limited by the asymmetric nature of the components and their requirements. Higher risk of deadlocks Higher risk of race conditions 5-Mar-05

Work Crew May require double buffering of some data to avoid race conditions. Poor data coherency Good code coherency 5-Mar-05

Forking – Data Parallelism Perform the same task on multiple objects in parallel. Thread “forks” into multiple threads across multiple processors All threads repeatedly grab pending objects indiscriminately and execute the task on them When finished, threads combine back into the original thread. 5-Mar-05

Forking Object A Thread 2 Object B Thread 0 Object C Thread 1 Fork combine 5-Mar-05

Forking Task assignment can often be done using simple interlocked primitives: I.e. Int i = InterlockedIncrement(&nextTodo); OpenMP adds compiler support for this via pragmas 5-Mar-05

Forking Externally Synchronous external callers don’t have to worry about being thread safe thread safety requirements are limited to the scope of the code within the forked section. This is a big deal. good for isolated engine components and middle ware 5-Mar-05

Forking – Example AI running in thread 0 AI calls RayQuery() for a line of sight check RayQuery forks into 6 threads, computes the ray query, and then returns the results through thread 0 AI, running in thread 0 uses the result. 5-Mar-05

Forking Minimizes Latency for a given task Good data and code coherency Potentially high synchronization overhead, depending on the coupling. Highly scalable if you have many tasks with few dependencies Ideal for Collision detection. 5-Mar-05

Forking - Batches Reduces inter- thread communication Reduces potential for load balancing. Improves Instruction level parallelism Objects 21..30 Thread 2 Objects 0..10 Thread 0 Objects 11..20 Thread 1 Fork combine 5-Mar-05

Our Approach 1) Collision Detection Forked 3) Integration Forked 2) AI/Logic Single threaded 4) Rendering Forked/Pipeline 2b) Damage Effects Contractor Queue All extra threads Audio Whatever 2a) engine calls Forked 6-Mar-05

Multithreaded programming is Hard Solutions that directly expose multiple threads to leaf code are a bad idea. Sequential, single threaded, synchronous code is the fastest to write and debug In order to meet schedules most leaf code will stay this way. 7-Mar-05

Notes on Collision detection All collision prims are stored in a global search tree. Bounding Kdop tree with 8 children per node. The most common case is when 0 or 1 children need to be traversed 8 children results in fewer branches 8 Children allows better Prefetching 7-Mar-05

Collision detection Each moving object is a “task” Each object is independently queried vs. all other objects in the tree. Results are output to a global list of contacts and collisions To avoid duplicates, moving object vs. moving object collisions are only processed if the active moving object’s memory address is <= the other moving object. 7-Mar-05

Collision detection Threads pop objects off of the todo list one by one using interlocked access until they are all processed. Each query takes O(lgN) time. Very little data contention output operations are rare and quick task allocation uses InterlockedIncrement On 2 Cpus with many objects I see a 80% performance increase. Hopefully scalable to many CPUs 7-Mar-05

Collision detection We try to keep collision code and data in the cache as much as possible We try to finish Collision detection as soon as possible because there are dependencies on it All threads attack the problem at once 7-Mar-05

Notes on Integration The process that steps objects forward in time, in a manner consistent with all contacts and constraints. 8-Mar-05

Integration Each batch of coupled objects is a task. Each Batch is solved independently Threads pop batches with no dependencies off of the todo list one by one using interlocked access until they are all processed. 8-Mar-05

Integration When a dynamic object does not interact with other dynamic objects, it’s batch contains only that object. When dynamic objects interact, they are coupled, their solutions are dependant on each other and they most be solved together. 8-Mar-05

Integration In some cases, objects can be artificially decoupled. I.e. assume object A weighs 2000kg, and object B weighs 1 kg. In some cases we can assume that the dynamics of B do not effect the dynamics of A. In this case, A can first be solved independently, and the resulting dynamics can be fed into the solution for B. This creates an ordering dependency. A must be solved before B. 8-Mar-05

Integration When objects are moved they must be updated in the global collision tree. Transactions need to be atomic, this is accomplished with locks/critical sections Ditto for the VSD tree Task allocation is slightly more complex due to dependencies Despite all this we see a 75% performance increase on 2 CPUs with many objects. 8-Mar-05

Integration We use a discrete newton solver, which works okay with our task dicretization I.e. One thread per batch If there where hundreds of processors and not as many batches, we would fork the solver itself and use Jacobi iterations 8-Mar-05

Transactions With fine grained data parallelism, we require many, light weight atomic transactions. For this we use either: Interlocked primitives Critical Sections Spin Locks 9-Mar-05

Transactions Whenever possible, interlocked primitives are used. Interlocked primitives are simple atomic transactions on single words If the transaction is short a spin Lock is used. Otherwise a critical section is used. A Spin Lock is like a critical section, except that it spins rather than sleeps when blocking 9-Mar-05

CPU’s are difficult There are some processor specific nuances to consider when writing your own locks: Due to out of order reads, data access following the acquisition of a lock should be proceeded by a load fence or isync. Otherwise the processor might preload old data that changes right before the lock is released. 9-Mar-05

CPU’s are difficult Due to out of order writes, a store fence or lwsync needs to happen before releasing the lock, otherwise the unlock might be visible to threads before the data update has taken place, and another thread might claim the lock and then fetch stale data from its cache all before the real data arrives. 9-Mar-05

Lock Example Acquire() looks like: while(_InterlockedCompareExchange(&isLocked,1,0)!= 0) { PauseWhileLocked(); } __isync(); Release() looks like: __lwsync(); isLocked=0; 9-Mar-05

CPU’s are difficult On Hyperthreaded systems it is important that PauseWhileLocked() puts the thread to sleep so that the other thread(s) can use the complete core. It is also important that you don’t constantly bang on memory while trying to take the lock. If you are going to hold locks for a fair bit of time, a critical section is usually a better choice, as it switches to another thread rather than spinning. 9-Mar-05

Instruction Parallelism is Good Most relevant processors are pipelined Multiple execution units run in parallel No Out of Order Execution High execution latency Most have SIMD Intrinsics 10-Mar-05

Code Scheduling is Good Instruction level parallelism requires appropriate code scheduling “Compiler Hand Holding” is often necessary to give the compiler more freedom to schedule loop unrolling Using temporaries rather than member vars or globals inline functions __restrict 11-Mar-05

Branches are Bad Data dependant branches are frequent in collision detection. Branches on floating point results are often very slow, partly due to the long floating point pipelines. Whenever possible use instructions like fsel, vsel, min,max, etc to eliminate branches 12-Mar-05

Branches i.e. Rather Than: if( (a>b)||(c>d) ) {} Use If( max(a-b,c-d) > 0) {} 12-Mar-05

Branches and GPUS On earlier GPU hardware HLSL will emulate all conditionals using predicated instructions. Similar techniques are often beneficial on CPUs. If(a>=b) c = d; Could be written as C = fsel(a-b,d,c); 12-Mar-05

Hyper Threading? What: 1 core, >1 simultaneous threads on >1 simultaneous contexts. Same execute units and cache Why: Execution units are often idle, extra threads can make better utilization of them. 13-Mar-05

Hyper Threading? Why are execution units idle? Pipeline Latency I.e. if a multiply-add pipe has a through put of 1 per cycle, and a latency of 7 cycles, a 4 element dot product takes 1+6+6+6 = 19 cycles. 13-Mar-05

Pipeline latency is Bad Most of the time, only one stage of the madd pipeline is active and the others are idle. If 4 threads are all doing dot products at once the time taken is: 1+1+1+1+3+1+1+1+3+1+1+1+3+1+1+1 = 23 cycles, which is 3.3x the throughput. Somewhat redundant with Out of Order execution and loop unrolling. 14-Mar-05

Memory is Slow Cache Misses When one thread blocks on a cache miss, the other threads can continue running while the cache line is being filled. 15-Mar-05

Branches are Bad (still) Branches Data dependant branches do not mix well with deep pipelines. If the result at the end of the pipeline is needed to determine what to fetch next at the beginning of the pipeline, you get a big bubble. This can be filled by the other threads. 15-Mar-05

Data Locality is Good It is worth mentioning that cores with multiple threads share l1 caches. So it is usually best to have all threads of a core working on the same code and data set. 16-Mar-05

Hyper Threading And Cell Cell’s design side steps the motivation for HyperThreading in a variety of ways: No cache, stream data in and out ahead of time. Lots of registers for loop unrolling. Memory architecture encourages stream processing which is conducive to loop unrolling. Complex programming model makes scheduling optimizations seem relatively easy. 17-Mar-05

Cache Is important on SMP Shared L2 Cache, some L1 Cache Sharing Big caches is one of the distinct advantages of CPUs vs. GPU’s some consoles Physics algorithms for collision detection, integration, and constraint resolution require repeated accesses to individual structures, which is ideal for caches. 18-Mar-05

Cache Is important on SMP Shared caches also make inter-thread and inter-core communication less expensive. These points motivate forked models and data parallelism 18-Mar-05

Memory Latency is a big deal Latency is a real problem. To design a high performance system we need to consider this as a first order concern. In this generation we are looking at ~500 cycle penalties on an l2 cache miss, ~50 cycles on an l1 cache miss. L2 cache is shared between cores. L1 cache is shared between threads. 19-Mar-05

Memory Latency is a big deal This also motivates the use of data parallelism. An extreme form of data parallelism is seen in stream processing. 19-Mar-05

GPU’s are Fast GPU's are an effective demonstration of parallelism. The overall system is a pipeline Vertex shader -> triangle Setup -> Rasterization ->Pixel Shader -> frame buffer Within each stage, the model is forked, with many simultaneous threads 20-Mar-05

Physics on GPUS There has been a fair bit of research on the topic of use GPU’s for physical simulations. A review article is here: http://download.nvidia.com/developer/GP U_Gems_2/GPU_Gems2_ch29.pdf 20-Mar-05

Eulerian vs. Lagrangian in a Nutshell Eulerian approaches discretize dof in space Lagrangian approaches don’t. 21-Mar-05

Lagrangian Example Each Particle has Dof 21-Mar-05

Eulerian Example Each Cell Has DOF 21-Mar-05

Eulerian vs. Lagrangian Eulerian algorithms are effective for dense, highly coupled interactive systems. Solving navier stokes Computation fluid dynamics Water, Smoke, Fire 21-Mar-05

Eulerian vs. Lagrangian Interactions are not qualitatively data dependant So they can run in parallel without feedback. Langrangian interactions often are data dependant I.e. collisions 21-Mar-05

Eulerian vs. Lagrangian Physically-Based Simulation on Graphics Hardware http://developer.nvidia.com/docs/IO/823 0/GDC2003_PhysSimOnGPUs.pdf 21-Mar-05

Particles A good example of a Lagrangrian technique implemented on stream processors is UberFlow UberFlow implements a fully featured particle system on a GPU Data dependant control is difficult, Uberflow uses a data independent sorting method for collision detection 21-Mar-05

Particles http://www.ati.com/developer/Eurographi cs/Kipfer04_UberFlow_eghw.pdf 21-Mar-05

Solvers and Parallelism Conjugate Gradient Jacobi iterations Gauss Seidel Red Black Gauss Seidel MultiGrid 22-Mar-05

Solvers and Parallelism Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid http://www.multires.caltech.edu/pubs/GPUSi m.pdf 22-Mar-05

Programming Languages are not Good C++, Java, C# are not ideal for fine grained parallelism What's next: HLSL? Functional Languages? Haskel? OpenMp? 23-Mar-05

Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive.

Similar presentations

Presentation on theme: "Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive.

Similar presentations

Presentation on theme: "Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive."— Presentation transcript:

Similar presentations

About project

Feedback