Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mass Market Applications of Massively Parallel Computing Chas. Boyd.

Similar presentations

Presentation on theme: "Mass Market Applications of Massively Parallel Computing Chas. Boyd."— Presentation transcript:

1 Mass Market Applications of Massively Parallel Computing Chas. Boyd

2 2 Silicon fabrication processes promise an embarrassment of processor cores arriving in the next few years. While multi-core machines can scale to small single-digit factors, academic researchers and the new GPGPU community have identified the data- parallel programming model as a way for applications to scale to 1000’s of cores. To date, most of the applications evaluated have been in the technical and scientific arenas, but what about more mass-market applications for data-parallel programming? What features of data-parallel processors are most important to such applications? And how might such processors and their host systems change in order to better target such applications in the future?

3 3 Outline Projections of future hardware The client computing space Mass-market parallel applications Common application characteristics Interesting processor features

4 4 The Physics of Silicon The way processors get faster has fundamentally changed No more free performance gains due to clock rate and Instruction-Level Parallelism Yet gates-per-die continues to grow Possibly faster now that clock rate isn’t an issue Estimate: doubling every years New area means more cores and caches In-order core counts may grow faster than Out-of-Order core counts do

5 5

6 6 The Old Story

7 7

8 8

9 9 A Surplus of Cores ‘More cores than we know what to do with’ Literally Servers scale with transaction counts Technical Computing history of dealing with parallel workloads What are the opportunities for the PC client? Are there mass market applications that are parallelizable?

10 10 Requirements of Mass Market Space Fairly easy to program and maintain Cannot break on future hardware or operating systems Transparent back-compatibility, fwd compatibility Mass market customers hate regressions! Consumer software must operate for decades Must get faster automatically Why we are here

11 11 AMD Term: Personal Stream Computing Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.

12 12 Data-Parallel Processing Key technique, how do we apply it to consumers? What takes lots of data? Media, pixels, audio samples Video, imaging, audio Games

13 13 Video Decode, encode, transcode Motion Estimation, DCT, Quantization Effects Anything you would want to do to an image Scaling, sepia, DVE effects (transitions) Indexing Search/Recognition -convolutions

14 14 Imaging Demosaicing Extract colors with knowledge of sensor layout Segmentation Identify areas of image to process Cleanup Color correction, noise removal, etc. Indexing Identify areas for tagging

15 15 User Interaction with Media Client applications can/should be interactive Mass market wants full automation ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images Automating media processing requires analysis Recognition, segmentation, image understanding Is this image outdoors or inside? Is this image right-side up? Does it contain faces? Are their eyes red?

16 16 Imaging Markets In some sense, the broader the market, the more sophisticated the algorithm required Although pro-sumers care more about performance, and they are the ones that write the reviews

17 17 FFT Performance

18 18 Game Applications of Mass Parallel Rendering Imaging Physics IK AI

19 19 Ultima Underworld 1993

20 Dark Messiah

21 Game Rendering Well established at this point, but new techniques keep being discovered Rendering different terms at different spatial scales E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered Spherical harmonic coefficient manipulations

22 22 Game Imaging Post processing Reduction (histogram or single average value) Exposure estimation based on log average luminance Exposure correction Oversaturation extraction Large blurs (proportional to screen size) Depth of field Motion blur

23 Half-Life 2 23

24 Half-Life 2 24

25 Half-Life 2 25

26 26 Game Physics Particles -non-interacting Particles -interacting Rigid bodies Deformable bodies Etc.

27 Game Processor Evolution Vertex Shader Pixel Shader Animation AI Texture Creation Mesh Modeling Physics Content Creation Process Game Stack Offline CPU GPU Real Time 27

28 28 Common Properties of Mass Apps Results of client computations are displayed at interactive rates Fundamental requirement of client systems Tight coupling with graphics is optimal Physical proximity to renderer is beneficial Smaller data types are key

29 29 Support for Image Data Types Pixels, texels, motion vectors, etc. Image data more important than float32s Data declines in size as importance drops Bytes, words, fp11, fp16, single, double Bytes may be declining in importance Hardware support for formatting is useful Clock cycles required by shift/or/mul, etc. cost too much power

30 30 I/O Considerations Like most computations that are not 3-D rendering, GPUs are i/o bound Arithmetic intensity is lower than GPUs Convolutions Support for efficient data types is very important

31 31 GPU Arithmetic Intensity Projection

32 32 GPU Arithmetic Intensity Projection 2-3 more process doublings before new memory technologies will help much Stacked die?, 2k wide bus?, optical? Estimate at least 4x increase in nr of compute instructions per read operation Arithmetic intensities reach 64?? This is fine for 3-D rendering Other data-parallel apps need more i/o

33 33 I/O Patterns Solutions will have a variety of mechanisms to help with worsening i/o constraints Data re-use (at cache size scales) is relatively rare in media applications Read-write use of memory is rare Read-write caches are less critical Streaming data behavior is sufficient Read contention and write contention are the issue, not read-after-write scenarios

34 34 Interesting Techniques Shared registers Possibly interesting to help with i/o bandwidth Reducing on-chip bandwidth may help power/heat Scatter Can be useful in scenarios that don’t thrash output subsystem Can reduce pressure on gather input system

35 35 Convolution Key element of almost all image and video processing operations Scaling, glows, blurs, search, segmentation Algorithm has very low arithmetic intensity 1 MAD per sample Also has huge re-use (order of kernel size) Shared registers should reduce arithmetic intensity by factor of kernel size

36 36 Processor Core Types Heterogeneous Many-core In-Order vs. Out-of-Order Distinction arose from targeting 2 different workload design points Software can select ideal core type for each algorithm (workload design point) Since not all cores can be powered anyway Hardware can make trade-offs on: Power, area, performance growth rate

37 Workloads Local Memory AccessesStreaming Memory Access Code Branchiness CPUs GPUs

38 Workload Differences General Processing Small batches Frequent branches Many data inter- dependencies Scalar ops Vector ops Media Processing Large batches Few branches Few data inter- dependencies Scalar ops Vector ops

39 39 Lesson from GPGPU Research Many important tasks have data-parallel implementations Typically requires a new algorithm May be just as maintainable Definitely more scalable with core count

40 40 APIs Must Hide Implementations Implementation attributes must be hidden from apps to enable scaling over time Number of cores operating Number of registers available Number of i/o ports Sizes of caches Thread scheduling policies Otherwise, these cannot change, and performance will not grow

41 41 Order of Thread Execution Shared registers and scatter share a pitfall: It may be possible to write code that is dependent on the order of thread execution This violates scaling requirement The order of thread execution may vary from run-to-run (each frame) Will certainly vary between implementations Cross-vendor and within vendor product lines All such code is considered incorrect

42 42 System Design Goals Enable massively parallel implementations Efficient scaling to 1000s of cores No blocking/waiting No constraints on order of thread execution No read-after-write hazards Enable future compatibility New hardware releases, new operating systems

43 43 Other Computing Paradigms CPU –originated: Lock-based, Lockless Message Passing Transactional Memory May not scale well to 1000s of cores GPU Paradigms CUDA, CtM May not scale well over time

44 44 High Level APIs Microsoft Accelerator Google Peakstream Rapidmind Acceleware Stream processing Brook, Sequoia

45 45 Additional GPU Features Linear Filtering 1-D, 2-D, 3-D floating point array indices Image and video data benefit Triangle Interpolators Address calculations take many clocks Blenders Atomic reduction ops reduce ordering concerns 4-vector operations Vector data, syntactic convenience

46 46 Processor Opportunities Client computing performance can improve Client space is a large un-tapped opportunity for parallel processing Hardware changes required are minimal and fairly obvious Fast display, efficient i/o, scalable over time

47 47 Notes Conversely, the only client applications that are massively parallel enough to scale to this number of cores are those that are data-parallel and therefore find in-order cores sufficient for that additional work Then power constraints dictate in-order cores are the ones to be used

Download ppt "Mass Market Applications of Massively Parallel Computing Chas. Boyd."

Similar presentations

Ads by Google