Presentation on theme: "Insomniac’s SPU Best Practices Hard-won lessons we’re applying to a 3rd generation PS3 title Mike Acton Eric Christensen GDC 08."— Presentation transcript:
Insomniac’s SPU Best Practices Hard-won lessons we’re applying to a 3rd generation PS3 title Mike Acton Eric Christensen GDC 08
Introduction What will be covered... –Understanding SPU programming –Designing systems for the SPUs –SPU optimization tips –Breaking the 256K barrier –Looking to the future
Introduction Isn't it harder to program for the SPUs? –No. –Classical optimizations techniques still apply – Perhaps even more so than on other architectures. e.g. In-order processing means predictable pipeline. Means easier to optimize. –Both at instruction-level and multi-processing level.
Introduction Multi-processing is not new –Trouble with the SPUs usually is just trouble with multi-core. –You can't wish multi-core programming away. It's part of the job.
Introduction But isn't programming for the SPUs different? –The SPU is not a magical beast only tamed by wizards. –It's just a CPU –Get your feet wet. Code something. Highly Recommend Linux on the PS3!
Introduction Seriously though. It's not the same, right? –Not the same if you've been sucked into one of the three big lies of software development...
Introduction The “software as a platform" lie. The "domain-model design" lie. The "code design is more important than data design" lie... The real difficulty is unlearning these mindsets.
Introduction But what's changed? –Old model Big semi truck. Stuff everything in. Then stuff some more. Then put some stuff up front. Then drive away. –New model Fleet of Ford GTs taking off every five minutes. Each one only fits so much. Bucket brigade. Damn they're fast!
Introduction But what about special code management? –Yes, you need to upload the code. So what? Something needs to load the code on every CPU.
Introduction But what about DMA'ing data? –Yes, you need to use a DMA controller to move around the data. Not really different from calling memcpy
SPU DMA vs. PPU memcpy DMA from main ram to local store wrch $ch16, ls_addr wrch $ch18, main_addr wrch $ch19, size wrch $ch20, dma_tag il $2, MFC_GET_CMD wrch $ch21, $2 PPU MemcpySPU DMA PPU memcpy from far ram to near ram mr $3, near_addr mr $4, far_addr mr $5, size bl memcpy DMA from local store to main ram wrch $ch16, ls_addr wrch $ch18, main_addr wrch $ch19, size wrch $ch20, dma_tag il $2, MFC_PUT_CMD wrch $ch21, $2 PPU memcpy from near ram to far ram mr $4, near_addr mr $3, far_addr mr $5, size bl memcpy Conclusion: If you can call memcpy, you can DMA data.
Introduction But what about DMA'ing data? –But with more control about how and when it's sent, retrieved.
SPU Synchronization DMA from main ram to local store Fence: Transfer after previous with the same tag Example Sync PUTF Transfer previous before this PUT PUTLF Transfer previous before this PUT LIST GETF Transfer previous before this GET GETLF Transfer previous before this GET LIST (Sync) Wait for DMA to complete il $2, 1 shl $2, $2, dma_tag wrch $ch22, $2 il $3, MFC_TAG_UPDATE_ALL wrch $ch23, $3 rdch $2, $ch24 Do other productive work while DMA is happening... Barrier: Transfer after previous and before next with the same tag PUTB Fixed order with respect to this PUT PUTLB Fixed order with respect to this PUT LIST GETB Fixed order with respect to this GET GETLB Fixed order with respect to this GET LIST Lock Line Reservation GETLLAR Gets locked line. (PPU: lwarx, ldarx) PUTLLC Puts locked line. (PPU: stwcx, stdcx)
Introduction Bottom line: SPUs are like most CPUs –Basics are pretty much the same. –Good data design decisions and smart code choices see benefits any platform –Good DMA pattern also means cache coherency. Better on every platform –Bad choices may work on some, but not others. –Xbox 360, PC, Wii, DS, PSP, whatever.
Introduction And that's what we're talking about today. –Trying to apply smart choices to these particular CPUs for our games. That's what console development is. –What mistakes we've made along the way. –What's worked best.
Understanding the SPUs Rule 1: The SPU is not a co-processor! –Don't think of SPUs as hiding time “behind” a main PPU loop
Understanding the SPUs What “clicked” with some Insomniacs about the SPUs: –“Everything is local” –“Think streams of data” –“Forget conventional OOP” –“Everything is a quadword” –“si intrinsics make things clearer” –“Local memory is really, really fast”
Designing for the SPUs The ultimate goal: Get everything on the SPUs. –Leave the PPU for shuffling stuff around. Complex systems can go on the SPUs –Not just streaming systems –Used for any kind of task –But you do need to consider some things...
Designing for the SPUs Data comes first. –Goal is minimum energy for transformation. –What is energy usage? CPU time. Memory read/write time. Stall time. Transform() OutputInput
Designing for the SPUs Design the transformation pipeline back to front. –Start with your destination data and work backward. –Changes are inevitable. This way you pay less for them. –An example...
Simulate Glass Generate Crack Geometry igTriangulate Started Here Had a really nice looking simulation but would find out soon that This stage was worthless Then wrote igTriangulate Oops, the only possible output didn’t support the “glamorous” crack rendering Realized that the level of detail from the simulation wasn’t necessary considering that the granularity restrictions (memory, cpu) Could not support it. Even worse, the inputs that were being provided to the triangulation library weren’t adequate. Needed more information about retaining surface features. Simulate Glass igTriangulate Render Front to BackBack to Front The rendering part of the pipeline didn’t completely support the outputs of the triangulation library Rendered Dynamic Geometry using Fake Mesh Data Faked Inputs to Triangulate and output transformed data to render stage wrote the simulation to provide useful (and expected) results to the triangulation library. Could have avoided re-writing the simulation if the design process was done in the correct order. Good looking results were arrived at with a much smaller processing and memory impact. Full simulation turned out to be un-necessary since it’s outputs weren’t realistic considering the restrictions of the final stage. Proof that “code as you design” can be disasterous. Working from back to front forces you to think about your pipeline in advance. It’s easier to fix problems that live in front of final code. Wildly scattered fixes and data format changes will only end in sorrow.
Designing for the SPUs The data the SPUs will transform is the canonical data. i.e. Store the data in the best format for the case that takes the most resources.
Designing for the SPUs Minimize synchronization –Start with the smallest synchronization method possible.
Designing for the SPUs Simplest method is usually lock-free single reader, single writer queue.
Write Data Increment Index lwsync Write Data Increment Index (with Fence) SPU Ordered WritePPU Ordered Write
Designing for the SPUs Fairly straightforward to load balance –For constant time transforms, just divide into multiple queues –For other transforms, use heuristic to decide times and a single entry queue to distribute to multiple queues.
Designing for the SPUs Then work your way up. –Is there a pre-existing sync point that will work? (e.g. vsync) –Can you split your data into need-to- sync and don't-care?
Sync Immediate Updates For Last Frame Deferred Update & Render PPUSPU Sync Deferred Updates Update Game Objects System Manager Sync Effect System Manager Run Immediate Effect Update/Render Immediate Update & Render (Can run past end of PPU Frame due to reduced sync points) Generate Push Buffer To Render Frame Finish Push Buffer Setup Run Deferred Effect Update/Render Post Update Game Objects Run Effects System Manager Finish Frame Update & Start Rendering PPUSPU Update Game Objects Run Immediate Effect Updates Immediate Update Generate Push Buffer To Render Frame Finish Push Buffer Setup Finish Frame Update & Start Rendering Sync Immediate Effect Updates Resistance : Fall of Man Immediate Effect Updates Only Resistance2 Immediate & Deferred Effect Updates + Reduced Sync Points Generate Push Buffer To Render Effects
Likely to stall here, due to limited window in which to update all effects. = PPU time that cannot be overlapped = PPU time spent on effect system PPUSPU Update Game Objects Run Immediate Effect Updates Immediate Update Generate Push Buffer To Render Frame Finish Push Buffer Setup Finish Frame Update & Start Rendering Sync Immediate Effect Updates Resistance : Fall of Man Immediate Effect Updates Only Generate Push Buffer To Render Effects = PPU time overlapping effects SPU time Each effect is a separate SPU job Effect updates running on all available SPUs (four) The number of effects that could render were limited by available PPU time to generate their PBs. No effects can be updated till all game objects have updated so attachments do not lag. Visibility and LOD culling done on PPU before creating jobs.
Sync Immediate Updates For Last Frame Deferred Update & Render PPUSPU Sync Deferred Updates Update Game Objects System Manager Sync Effect System Manager Run Immediate Effect Update/Render Immediate Update & Render (Can run past end of PPU Frame due to reduced sync points) Generate Push Buffer To Render Frame Finish Push Buffer Setup Run Deferred Effect Update/Render Post Update Game Objects Run Effects System Manager Finish Frame Update & Start Rendering Resistance2 Immediate & Deferred Effect Updates + Reduced Sync Points = PPU time that cannot be overlapped = PPU time spent on effect system = PPU time overlapping effects SPU time Initial PB allocations done on PPU Single SPU job for each SPU (Anywhere from one to three) Initial PB allocations done on PPU Single SPU job for each SPU (Anywhere from one to three) Huge amount of previously unused SPU processing time available. Deferred effects are one frame behind, so effects attached to moving objects usually should not be deferred. SPU manager handles all visibility and LOD culling previously done on the PPU. Generates lists of instances for update jobs to process. Doing the initial PB alloc on the PPU eliminates need to sync SPU updates before generating full PB. Immediate updates are allowed to run till the beginning of the next frame, as they do not need to sync to finish generating this frame’s PB Smaller window available to update immediate effects, so only effects attached to moving objects should be immediate.
Designing for the SPUs Write “optimizable” code. –Often “optimized” code can wait a bit. –Simple, self-contained loops Over as many iterations as possible No branches
Designing for the SPUs Transitioning from "legacy" systems... –We're not immune to design problems –Schedule, manpower, education, and experience all play a part.
Designing for the SPUs Example from RCF... –FastPathFollowers C++ class –And it's derived classes –Running on the PPU –Typical Update() method Derived from a root class of all “updatable” types
Designing for the SPUs Where did this go wrong? What rules where broken? –Used domain-model design –Code “design” over data design –No advatage of scale –No synchronization design –No cache consideration
Designing for the SPUs Result: –Typical performance issues –Cache misses –Unnecessary transformations –Didn't scale well –Problems after a few hundred updating
Designing for the SPUs Step 1: Group the data together –“Where there's one, there's more than one.” –Before the update() loop was called, –Intercepted all FastPathFollowers and derived classes and removed them from the update list. –Then kept in a separate array.
Designing for the SPUs Step 1: Group the data together –Created new function, UpdateFastPathFollowers() –Used the new list of same type of data –Generic Update() no longer used –(Ignored derived class behaviors here.)
Designing for the SPUs Step 2: Organize Inputs and Outputs –Define what's read, what's write. –Inputs: Position, Time, State, Results of queries, Paths –Outputs: Position, State, Queries, Animation –Read inputs. Transform to Outputs. Nothing more complex than that.
Designing for the SPUs Step 3: Reduce Synchronization Points –Collected all outputs together –Collected any external function calls together into a command buffer Separate Query and Query-Result Effectively a Queue between systems –Reduced from many sync points per “object” to one sync point for the system
Designing for the SPUs Before Pattern: –Loop Objects Read Input 0 Update 0 Write Output Read Input 1 Update 1 Call External Function Block (Sync)
Designing for the SPUs After Pattern (Simplified) –Loop Objects Read Input 0, 1 Update 0, 1 Write Output, Function to Queue –Block (Sync) –Empty (Execute) Queue
Designing for the SPUs Next: Added derived-class functionality Similarly simplified derived-class Update() functions into functions with clear inputs and outputs. Added functions to deferred queue as any other function. Advantage: Can limit derived functionality based on count, LOD, etc.
Designing for the SPUs Step 4: Move to PPU thread –Now system update has no external dependencies –Now system update has no conflicting data areas (with other systems) –Now system update does not call non- re-entrant functions –Simply put in another thread
Designing for the SPUs Step 4: Move to PPU thread –Add literal sync between system update and queue execution –Sync can be removed because only single reader and single writer to data Queue can be emptied while being filled without collision See: R&D page on multi-threaded optimization
Designing for the SPUs Step 5: Move to SPU –Now completely independent thread –Can be run anytime –Prototype for new SPU system AsyncMobyUpdate Using SPU Shaders
Designing for the SPUs Transitioning from “SPU as coprocessor” model. Example: igPhysics from Resistance to now...
Environment Pre-Update (Resolve Anim+IK) Environment Update Collision Update (Start Coll Jobs while building) Sync Collision Jobs and Process Contact Points Post Update (Transform Anim Joints) Collide Prims (generate contacts) PPUSPUExecution Package Rigid Body Pools. (Start SPU Jobs While packing) Simulate Sync Sim Jobs and Process Rigid Body Data *Blocked! AABB Tests Triangle Intersection Note: One Job Per Object. (box, ragdoll, etc..) Sphere, Capsule, etc.. Pack contact points Associate Rigid Bodies Through Constraints Unpack Constraints Generate Jacobian Data Solve Constraints Pack Rigid Body Data *The only time hidden between start and stop of jobs is the packing of job data. The only other savings come from merely running the jobs on the SPU. Resistance: Fall of Man Physics Pipeline
Environment Update Triangle Cache Update Object Cache Update Start Physics Jobs Sync Physics Jobs Update Rigid Bodies PPU Work Upload Object Cache Collide Triangles Collide Primitives Build Simulation Pools Simulate Pools Post Update For Each Iteration Upload Tri-Cache Upload RB Prims Upload Intersect Funcs PPUSPUExecution Intersection Tests Upload CO Prims Upload Intersect Funcs Intersection Tests Upload Physics Joints Sort Joint Types Per Joint Type Upload Jacobian Generation Code Calculate Jacobian Data Solve Constraints Integrate For Each Physics Object Upload Anim Joints Transform Anim Joints Using Rigid Body Data Send Update To PPU Resistance 2 Physics Pipeline Upload Solver Code
Optimizing for SPUs Instruction-level optimizations are similar to any other platform –i.e. Look at the instruction set and write code that takes advantage of it.
Optimizing for SPUs Memory transfer optimizations are similar to any other platform –i.e. Organize data for line-length and coherency. Separate read and write buffers wherever possible. –DMA is exactly like cache pre-fetch
Optimizing for SPUs Local memory optimizations are similar to any other platform –i.e. Have a fixed-size buffer, split it into smaller buffers for input, output, temporary data and code. –Organizing 256K is essentially the same process as organizing 256M
Optimizing for SPUs Memory layout –Memory is dedicated to your code. –Memory is local to your code. –Design so you know what will read and write to the memory i.e. DMAs from PPU, other SPUs, etc. –Generally fairly straightforward. –Remember you can use an offline tool to layout your memory if you want.
Optimizing for SPUs Memory layout –But never, ever try to use a dynamic memory allocator. Malloc for dedicated 256K would be ridiculous. OK. Malloc in a console game would be ridiculous.
Optimizing for SPUs Memory layout –Rules of thumb: Organize everything into blocks of 16b. –SPU Reads/Writes only 16b Group same fields together – No “single object” data – Similar to most SIMD. – Similar to GPUs.
Optimizing for SPUs Memory transfer –Usually pretty straightforward –Rules of thumb: Keep everything 128b aligned – Nothing different. Same rule as the PPU. (Cache-line is 128b) Transfer as much data as possible together. Transform together. – Nothing different. Same rule as the PPU. (For cache coherency)
Optimizing for SPUs Memory transfer –Let's dig in to these “rules of thumb” a bit... –Shared alignment between main ram and SPU local memory is going to be faster. (So pick an alignment and stick with it.) –Transfer is done in 128b blocks, so alignment isn't strictly necessary (but no worries about above if it is)
Optimizing for SPUs Number of transfers doesn't really matter (re: Biggest transfers possible) but... –You want transfer 128b blocks, not scattered. –You want to minimize synchronization (sync on less dma tags) –You have less places to worry about alignment. –You want to minimize scatter/gather. Especially considering TLB misses.
Optimizing for SPUs Memory transfer –Rules of thumb: If scattered reads, writes are necessary, use DMA list (not individual DMAs) –Advantage over PPU. PPU can't do out-of-order, grouped memory transfer. –Keeps predictability of in-order execution with performance of out- of-order memory transfer.
Optimizing for SPUs Speaking of out-of-order transfers... –Use DMA fence to dictate order –Reads and write are interleaved, If you need max transfer performance, issue them separately.
Optimizing for SPUs Memory transfer –Double, Triple buffer optimization –(Show fence example)
Optimizing for SPUs Code level optimization –Rules of thumb: Know the instruction set Use si intrinsics (or asm) Stick with native types – Clue: There's only one (qword)
Optimizing for SPUs Code level optimization –Rules of thumb: –Code branch free Not just for branch performance. Branch free scalar transforms to SIMD extremely well. –There is a hitch. No SIMD loads or stores. This drives data design decisions.
Optimizing for SPUs Code level optimization –Examples...
Optimizing for SPUs Example 1: Vector-Matrix Multiply
Vector-Matrix Multiplication Standard Approach Multiplying a vector (x,y,z,w) by a 4x4 matrix (x’ y’ z’ w’) = (x y z w) * (m00 m01 m02 m03) (m10 m11 m12 m13) (m20 m21 m22 m23) (m30 m31 m32 m33) The result is obtained by multiplying the x by the first row of the matrix, y by the second, etc. and accumulating these products. This observation leads to the standard method: Broadcast each of the x,y,z and w across all 4 components, then perform 4 multiply-add type instructions. Abbreviated versions are possible in the special cases of w=0 and w=1, which occur frequently. All 3 versions are shown to the right. It’s a simple matter to extend this approach to the product of two 4x4 matrices. Note that the w=0 and w=1 cases come into play here when our matrices have (0,0,0,1) T in the rightmost column. The general case: shufb xxxx, xyzw, xyzw, shuf_AAAA shufb yyyy, xyzw, xyzw, shuf_BBBB shufb zzzz, xyzw, xyzw, shuf_CCCC shufb wwww, xyzw, xyzw, shuf_DDDD fm result, xxxx, m0 fma result, yyyy, m1, result fma result, zzzz, m2, result fma result, wwww, m3, result Case w=0: shufb xxxx, xyz0, xyz0, shuf_AAAA shufb yyyy, xyz0, xyz0, shuf_BBBB shufb zzzz, xyz0, xyz0, shuf_CCCC fm result, xxxx, m0 fma result, yyyy, m1, result fma result, zzzz, m2, result Case w=1: shufb xxxx, xyz1, xyz1, shuf_AAAA shufb yyyy, xyz1, xyz1, shuf_BBBB shufb zzzz, xyz1, xyz1, shuf_CCCC fma result, xxxx, m0, m3 fma result, yyyy, m1, result fma result, zzzz, m2, result
Vector-Matrix Multiplication Faster Alternatives In the simple case where we only wish to transform a single vector, or multiply a single pair of matrices, the standard approach that was shown would be most appropriate. But frequently we’ll have a collection of vectors or matrices which we wish to multiply by the same matrix, in which case we may be prepared to make sacrifices for the sake of reducing the instruction count.
Vector-Matrix Multiplication The general case: Preswizzle the matrix as: (m00 m11 m22 m33) (m10 m21 m32 m03) (m20 m31 m02 m13) (m30 m01 m12 m23) then transform a vector using the sequence: rotqbyi yzwx, xyzw, 4 rotqbyi zwxy, xyzw, 8 rotqbyi wxyz, xyzw, 12 fm result, xyzw, m0_ fma result, yzwx, m1_, result fma result, zwxy, m2_, result fma result, wxyz, m3_, result Alternative 1 By simply preswizzling the matrix, we can reduce the number of shuffles needed Case w=0, with (0,0,0,1)T in the rightmost matrix column: Preswizzle the matrix as: (m00, m11, m22, 0) (m10, m21, m02, 0) (m20, m01, m12, 0) This can be done efficiently using selb: fsmbi mask_0F00, 0x0F00 fsmbi mask_00F0, 0x00F0 selb m0_, m0, m1, mask_0F00 selb m1_, m1, m2, mask_0F00 selb m2_, m2, m0, mask_0F00 selb m0_, m0_, m2, mask_00F0 selb m1_, m1_, m0, mask_00F0 selb m2_, m2_, m1, mask_00F0 The vector multiply then only requires 5 instructions: shufb yzx0, xyz0, xyz0, shuf_BCA0 shufb zxy0, xyz0, xyz0, shuf_CAB0 fm result, xyz0, m0_ fma result, yzx0, m1_, result fma result, zxy0, m2_, result Case w=1, with (0,0,0,1)T in the rightmost matrix column: Use the same preswizzle as the w=0 case, leaving row 3 unchanged. Again 5 instructions suffice: shufb yzx0, xyz0, xyz0, shuf_BCA0 shufb zxy0, xyz0, xyz0, shuf_CAB0 fma result, xyz0, m0_, m3 fma result, yzx0, m1_, result fma result, zxy0, m2_, result
Vector-Matrix Multiplication Using the preswizzle: (m02, m13, m20, m31) (m12, m23, m30, m01) (m00, m11, m22, m33) (m10, m21, m32, m03) we can carry out the vector multiply in just 6 instructions: rotqbyi yzwx, xyzw, 4 fm temp, xyzw, m0_ fma temp, yzwx, m1_, temp rotqbyi result, temp, 8 fma result, xyzw, m2_, result fma result, yzwx, m3_, result Alternative 2 If we’re dealing with the general case, we can reduce the instruction count further still This approach yields no additional benefits for the w=0 and w=1 cases however. Conclusion Single vector/matrix times a single matrix: use the Standard Approach. Many vectors/matrices times a single matrix: use Alternative 1. Many general vectors/matrices (i.e. anything in w) times a single matrix in a pipelined loop: use Alternative 2.
Optimizing for SPUs Example 2: Matrix Transpose
Matrix Transposition Standard Approach A general 4x4 matrix can be transposed in 8 shuffles as follows (x0, y0, z0, w0) (x0, x1, x2, x3) (x1, y1, z1, w1) -> (y0, y1, y2, y3) (x2, y1, z2, w2) (z0, z1, z2, z3) (x3, y3, z3, w3) (w0, w1, w2, w3) shufb t0, a0, a2, shuf_AaBb // t0 = (x0, x2, y0, y2) shufb t1, a1, a3, shuf_AaBb // t1 = (x1, x3, y1, y3) shufb t2, a0, a2, shuf_CcDd // t2 = (z0, z2, w0, w2) shufb t3, a1, a3, shuf_CcDd // t3 = (z1, z3, w1, w3) shufb b0, t0, t1, shuf_AaBb // b0 = (x0, x1, x2, x3) shufb b1, t0, t1, shuf_CcDd // b1 = (y0, y1, y2, y3) shufb b2, t2, t3, shuf_AaBb // b2 = (z0, z1, z2, z3) shufb b3, t2, t3, shuf_CcDd // b3 = (w0, w1, w2, w3) Many variations are possible by changing the particular shuffles used, but they all end up doing the same thing in the same amount of work. The version shown above is a good choice because it only requires two constants.
Matrix Transposition Faster 4x4 By using a different set of shuffles, a couple of the shuffles can then be replaced by select-bytes which has lower latency shufb t0, a0, a1, shuf_AaCc // t0 = (x0, x1, z0, z1) shufb t1, a2, a3, shuf_CcAa // t1 = (z2, z3, x2, x3) shufb t2, a0, a1, shuf_BbDd // t2 = (y0, y1, w0, w1) shufb t3, a2, a3, shuf_DdBb // t3 = (w2, w3, y2, y3) shufb b2, t0, t1, shuf_CDab // b2 = (z0, z2, z2, z2) shufb b3, t2, t3, shuf_CDab // b3 = (w0, w3, w3, w3) selb b0, t0, t1, mask_00FF // b0 = (x0, x0, x0, x0) selb b1, t2, t3, mask_00FF // b1 = (y0, y1, y1, y1) This version is quicker by 1 cycle, at the expense of requiring more constants
Matrix Transposition 3x4 -> 4x3 Here is an example that uses only 6 shuffles (x0, y0, z0, w0) (x0, x1, x2, 0) (x1, y1, z1, w1) -> (y0, y1, y2, 0) (x2, y2, z2, w2) (z0, z1, z2, 0) (w0, w1, w2, 0) shufb t0, a0, a1, shuf_AaBb // t0 = (x0, x1, y0, y1) shufb t1, a0, a1, shuf_CcDd // t1 = (z0, z1, w0, w1) shufb b0, t0, a2, shuf_ABa0 // b0 = (x0, x1, x2, 0) shufb b1, t0, a2, shuf_CDb0 // b1 = (y0, y1, y2, 0) shufb b2, t1, a2, shuf_ABc0 // b2 = (z0, z1, z2, 0) shufb b3, t1, a2, shuf_CDd0 // b3 = (w0, w1, w2, 0) Note that care must be taken if the destination matrix is the same as the source. In this case the last 2 lines of code must be swapped to avoid prematurely overwriting a2.
Matrix Transposition 3x3 (reduced latency) If we seek the lowest latency, this example is 2 cycles quicker than the last example, at the expense of an extra instruction and an extra constant (x0, y0, z0, w0) (x0, x1, x2, 0) (x1, y1, z1, w1) -> (y0, y1, y2, 0) (x2, y1, z2, w2) (z0, z1, z2, 0) shufb t0, a1, a2, shuf_0Aa0 // t0 = ( 0, x1, x2, 0) shufb t1, a2, a0, shuf_b0B0 // t1 = (y0, 0, y2, 0) shufb t2, a0, a1, shuf_Cc00 // t2 = (z0, z1, 0, 0) selb b0, a0, t0, mask_0FFF // b0 = (x0, x1, x2, 0) selb b1, a1, t1, mask_F0FF // b1 = (y0, y1, y2, 0) selb b2, a2, t2, mask_FF0F // b2 = (z0, z1, z2, 0) Hybrid versions are also possible, which may be of use when trying to balance even vs. odd counts.
Optimizing for SPUs Example 3: 8 bit palette lookup –Flip the problem around –Instead of looking up index for each byte... –Loop through the palette and compare each quadword of indices and mask any matching results
Optimizing for SPUs When is it better to use asm? –When you know facts the compiler cannot (and can take advantage of them) –i.e. almost always.
Optimizing for SPUs When is asm really worth it? –Case-by-case. Time, experience, performance, practice. Doesn't it make the code unmaintainable? –Not much different from using intrinsics. –Especially if you use macro-asm tools. –e.g. for register coloring - that´s really the tedious part of editing asm.
Optimizing for SPUs Writing asm rules-of-thumb: –Minimize instruction count –Minimize trace latency –(Instruction count takes precedence) –Balance even/odd instruction pipelines –Minimize memory accesses Can block DMA or instruction fetch
The 256K Barrier The solution is simple: –Upload more code when you need it. –Upload more data when you need it. Data is managed by traditional means –i.e. Double, triple fixed-buffers, etc. Code is just data. –Can we manage code the same way we manage data?
SPU Shaders SPU Shaders are:SPU Shaders are: –Fragments of code used in existing systems (Physics, Animation, Effects, AI, etc.) –Code is loaded at location pre-determined by system. –Custom (Data/Interface) for each system. –An expansion of an existing system (e.g. Pipelined stages) –Custom modifications of system data. –Way of delivering feedback to other systems outside the scope of the current system.
SPU Shaders SPU Shaders are NOT:SPU Shaders are NOT: –Generic, general purpose system. –A system of any kind, actually. –Globally scheduled.
SPU Shaders Why is it called a “shader”?Why is it called a “shader”? –Shares important similarities to GPU shaders. Native code fragmentsNative code fragments Part of a larger systemPart of a larger system In-context executionIn-context execution Independently optimizableIndependently optimizable –Most important: Concept is approachable.
SPU Shaders “Don't try to solve everyone's problems”“Don't try to solve everyone's problems” –Solutions that try to solve all problems tend to cause more problems than they solve. –Solutions that try to solve all problems tend to cause more problems than they solve.
SPU Shaders Easy to Implement –Pick stage(s) in system kernel to inject shaders. –Define available inputs and outputs. –Collect common functions. –Compile shaders as data. –Sort instance data based on shader type(s) –Load shader on-demand based on data select. –Call shaders.
SPU Shaders What data is being transformed?What data is being transformed? –What are the inputs? –What are the outputs? –What can be modified?
SPU Shaders Collect the common functions...Collect the common functions... –Always loaded by the system –e.g. Dma wrapper functionsDma wrapper functions Debugging functionsDebugging functions Common transformation functionsCommon transformation functions
SPU Shaders System Shader Configuration...System Shader Configuration... –System knows where the fragments are. –System knows when to call the fragments. –System doesn't know what the fragments do. –Fragments are in main RAM. –Fragments don't need to be fixed.
SPU Shaders System Shader Configuration.System Shader Configuration. Manage fragment memory:Manage fragment memory: –Simplest method: Double buffer,Double buffer, On-demand,On-demand, Fixed maximum size,Fixed maximum size, By-index from array,...By-index from array,...
SPU Shaders Create the shader code...Create the shader code... “Code is just data”“Code is just data” –No special distinquishing feature on the SPUs Overlays or additional jobs are too complex and heavyweight.Overlays or additional jobs are too complex and heavyweight. –Just want load and execute. –No special system needed.
SPU Shaders Create the shader code..Create the shader code.. –Method 1: Shader as PPU header Compile shader as normal, to obj file.Compile shader as normal, to obj file. Dump obj file using spu-objdumpDump obj file using spu-objdump Convert dump to header using script.Convert dump to header using script. This is what we started withThis is what we started with
SPU Shaders Create the shader code..Create the shader code.. –Method 2: Use elf file Requires extra compile step, but more debugger friendly.Requires extra compile step, but more debugger friendly. This is what we're doing now.This is what we're doing now. –Other methods too, use whatever works for you.
SPU Shaders Calling the shader...Calling the shader... Nothing could be easier.Nothing could be easier. – ShaderEntry* shader = (addr of fragment); – shader( data, common );
SPU Shaders Debugging Shaders... –Fragments are small –Fragments have well defined inputs and outputs. –Ideal for unit tests in separate framework. –Test on PS3/Linux box. Alternatives:Alternatives: –Debug on PPU (intrinsics are portable) –Temporarily link in shader.
SPU Shaders Runtime debugging:Runtime debugging: –Is a problem with the first method. –Using the full elf, have debugging info –Now works transparently in our debugger.
SPU Shaders Rule 1: Don't Manage Data for ShadersRule 1: Don't Manage Data for Shaders –Just give shaders a buffer and fixed size. –Shaders should depend on size, so leave room for system changes. –Best size depends on system. (Maybe 4K, maybe 32K)(Maybe 4K, maybe 32K) –Don't read or write from/to shader buffer.
SPU Shaders System-specificSystem-specific –Multiple list of instances to modify or transform –Context data Shader-internal (“local”)Shader-internal (“local”) –EA passed by system –Fixed buffer Shader shared (“global”)Shader shared (“global”) –EA passed by system
SPU Shaders Rule 2: Don't Manage DMA for ShadersRule 2: Don't Manage DMA for Shaders –Give fixed number of DMA tags to shader Grab them in the entry function and pass down)Grab them in the entry function and pass down) Avoid: GetDmaTagFromParentSystem()Avoid: GetDmaTagFromParentSystem() –Give DMA functions to shaders To allow system to run with any job manager, or noneTo allow system to run with any job manager, or none –Don't use shader tags for other purposes
SPU Shaders Rule 3: Enforce fixed maximum size for Shader code.Rule 3: Enforce fixed maximum size for Shader code. –System can be maintained. Rule 4: Shaders are always called in a clear, well defined context.Rule 4: Shaders are always called in a clear, well defined context. –i.e. Part of a larger system. –i.e. Part of a larger system.
SPU Shaders Rule 5: Fixed parameter list for shaders, per-system (or sub-system)Rule 5: Fixed parameter list for shaders, per-system (or sub-system) –Don't want to re-compile all shaders. –Don't want to manage dynamic parameter lists. Rule 6: Shaders should be given as many instances as possible.Rule 6: Shaders should be given as many instances as possible. –More optimizable. –More optimizable.
SPU Shaders Rule 7: Don't break the rules.Rule 7: Don't break the rules. –You'll end up with a new job manager. –You'll end up with a big headache.
SPU Shaders Where are we using these? –Physics, Effects, Animation, Some AI Update Also experimenting with pre-vertex shaders on the SPUs And experimenting with giving some of that control to the artists (Directly generating code from a tool...)
Conclusion It's not that complicated. Good data and good design works well on the SPUs (and will work well anywhere) –Sometimes you can get away with bad design and bad data on other platforms –...for now. Bad design will not survive this generation. Lots of opportunities for optimization.
Credits This was based on the hard work and dedication of the Insomniac Tech Team. You guys are awesome.