Presentation on theme: "Mass Market Applications of Massively Parallel Computing Chas. Boyd."— Presentation transcript:
Mass Market Applications of Massively Parallel Computing Chas. Boyd
2 Silicon fabrication processes promise an embarrassment of processor cores arriving in the next few years. While multi-core machines can scale to small single-digit factors, academic researchers and the new GPGPU community have identified the data- parallel programming model as a way for applications to scale to 1000’s of cores. To date, most of the applications evaluated have been in the technical and scientific arenas, but what about more mass-market applications for data-parallel programming? What features of data-parallel processors are most important to such applications? And how might such processors and their host systems change in order to better target such applications in the future?
3 Outline Projections of future hardware The client computing space Mass-market parallel applications Common application characteristics Interesting processor features
4 The Physics of Silicon The way processors get faster has fundamentally changed No more free performance gains due to clock rate and Instruction-Level Parallelism Yet gates-per-die continues to grow Possibly faster now that clock rate isn’t an issue Estimate: doubling every 2-2.5 years New area means more cores and caches In-order core counts may grow faster than Out-of-Order core counts do
9 A Surplus of Cores ‘More cores than we know what to do with’ Literally Servers scale with transaction counts Technical Computing history of dealing with parallel workloads What are the opportunities for the PC client? Are there mass market applications that are parallelizable?
10 Requirements of Mass Market Space Fairly easy to program and maintain Cannot break on future hardware or operating systems Transparent back-compatibility, fwd compatibility Mass market customers hate regressions! Consumer software must operate for decades Must get faster automatically Why we are here
11 AMD Term: Personal Stream Computing Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.
12 Data-Parallel Processing Key technique, how do we apply it to consumers? What takes lots of data? Media, pixels, audio samples Video, imaging, audio Games
13 Video Decode, encode, transcode Motion Estimation, DCT, Quantization Effects Anything you would want to do to an image Scaling, sepia, DVE effects (transitions) Indexing Search/Recognition -convolutions
14 Imaging Demosaicing Extract colors with knowledge of sensor layout Segmentation Identify areas of image to process Cleanup Color correction, noise removal, etc. Indexing Identify areas for tagging
15 User Interaction with Media Client applications can/should be interactive Mass market wants full automation ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images Automating media processing requires analysis Recognition, segmentation, image understanding Is this image outdoors or inside? Is this image right-side up? Does it contain faces? Are their eyes red?
16 Imaging Markets In some sense, the broader the market, the more sophisticated the algorithm required Although pro-sumers care more about performance, and they are the ones that write the reviews
Game Rendering Well established at this point, but new techniques keep being discovered Rendering different terms at different spatial scales E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered Spherical harmonic coefficient manipulations
22 Game Imaging Post processing Reduction (histogram or single average value) Exposure estimation based on log average luminance Exposure correction Oversaturation extraction Large blurs (proportional to screen size) Depth of field Motion blur
26 Game Physics Particles -non-interacting Particles -interacting Rigid bodies Deformable bodies Etc.
Game Processor Evolution Vertex Shader Pixel Shader Animation AI Texture Creation Mesh Modeling Physics Content Creation Process Game Stack Offline CPU GPU Real Time 27
28 Common Properties of Mass Apps Results of client computations are displayed at interactive rates Fundamental requirement of client systems Tight coupling with graphics is optimal Physical proximity to renderer is beneficial Smaller data types are key
29 Support for Image Data Types Pixels, texels, motion vectors, etc. Image data more important than float32s Data declines in size as importance drops Bytes, words, fp11, fp16, single, double Bytes may be declining in importance Hardware support for formatting is useful Clock cycles required by shift/or/mul, etc. cost too much power
30 I/O Considerations Like most computations that are not 3-D rendering, GPUs are i/o bound Arithmetic intensity is lower than GPUs Convolutions Support for efficient data types is very important
32 GPU Arithmetic Intensity Projection 2-3 more process doublings before new memory technologies will help much Stacked die?, 2k wide bus?, optical? Estimate at least 4x increase in nr of compute instructions per read operation Arithmetic intensities reach 64?? This is fine for 3-D rendering Other data-parallel apps need more i/o
33 I/O Patterns Solutions will have a variety of mechanisms to help with worsening i/o constraints Data re-use (at cache size scales) is relatively rare in media applications Read-write use of memory is rare Read-write caches are less critical Streaming data behavior is sufficient Read contention and write contention are the issue, not read-after-write scenarios
34 Interesting Techniques Shared registers Possibly interesting to help with i/o bandwidth Reducing on-chip bandwidth may help power/heat Scatter Can be useful in scenarios that don’t thrash output subsystem Can reduce pressure on gather input system
35 Convolution Key element of almost all image and video processing operations Scaling, glows, blurs, search, segmentation Algorithm has very low arithmetic intensity 1 MAD per sample Also has huge re-use (order of kernel size) Shared registers should reduce arithmetic intensity by factor of kernel size
36 Processor Core Types Heterogeneous Many-core In-Order vs. Out-of-Order Distinction arose from targeting 2 different workload design points Software can select ideal core type for each algorithm (workload design point) Since not all cores can be powered anyway Hardware can make trade-offs on: Power, area, performance growth rate
Workload Differences General Processing Small batches Frequent branches Many data inter- dependencies Scalar ops Vector ops Media Processing Large batches Few branches Few data inter- dependencies Scalar ops Vector ops
39 Lesson from GPGPU Research Many important tasks have data-parallel implementations Typically requires a new algorithm May be just as maintainable Definitely more scalable with core count
40 APIs Must Hide Implementations Implementation attributes must be hidden from apps to enable scaling over time Number of cores operating Number of registers available Number of i/o ports Sizes of caches Thread scheduling policies Otherwise, these cannot change, and performance will not grow
41 Order of Thread Execution Shared registers and scatter share a pitfall: It may be possible to write code that is dependent on the order of thread execution This violates scaling requirement The order of thread execution may vary from run-to-run (each frame) Will certainly vary between implementations Cross-vendor and within vendor product lines All such code is considered incorrect
42 System Design Goals Enable massively parallel implementations Efficient scaling to 1000s of cores No blocking/waiting No constraints on order of thread execution No read-after-write hazards Enable future compatibility New hardware releases, new operating systems
43 Other Computing Paradigms CPU –originated: Lock-based, Lockless Message Passing Transactional Memory May not scale well to 1000s of cores GPU Paradigms CUDA, CtM May not scale well over time
44 High Level APIs Microsoft Accelerator Google Peakstream Rapidmind Acceleware Stream processing Brook, Sequoia
45 Additional GPU Features Linear Filtering 1-D, 2-D, 3-D floating point array indices Image and video data benefit Triangle Interpolators Address calculations take many clocks Blenders Atomic reduction ops reduce ordering concerns 4-vector operations Vector data, syntactic convenience
46 Processor Opportunities Client computing performance can improve Client space is a large un-tapped opportunity for parallel processing Hardware changes required are minimal and fairly obvious Fast display, efficient i/o, scalable over time
47 Notes Conversely, the only client applications that are massively parallel enough to scale to this number of cores are those that are data-parallel and therefore find in-order cores sufficient for that additional work Then power constraints dictate in-order cores are the ones to be used