Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T.

Similar presentations


Presentation on theme: "Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T."— Presentation transcript:

1 Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T Labs - Research)

2 What you will see in this tutorial… The GPU is fast The GPU is programmable The GPU can be used for interactive visualization How do we abstract the GPU ? What is an efficient GPU program How various data mining primitives are implemented Using the GPU for non-standard data visualization

3 What you will NOT see… Detailed programming tricks Hacks to improve performance Mechanics of GPU programming BUT…  we will show you where to find all of this. Plethora of resources now available on the web, as well as code, toolkits, examples, and more…

4 Schedule 1 st Hour  1:30pm – 1:50pm: Introduction to GPUs (Shankar)  1:50pm – 2:30pm: Examples of GPUs in Data Analysis (Shankar) 2 nd Hour  2:30pm – 2:50pm: Stream Algorithms on the GPU (Sudipto)  2:50pm – 3:20pm: Data mining primitives (Sudipto)  3:20pm – 3:30pm: Questions and Short break 3 rd Hour  3:30pm – 4:00pm: GPU Case Studies (Shankar)  4:00pm – 4:20pm: High-level software support (Shankar)  4:20pm – 4:30pm: Wrap-up and Questions

5 Animusic Demo (courtesy ATI)

6 But I don’t do graphics ! Why should I care about the GPU ?

7 Two Converging Trends in Computing … The accelerated development of graphics cards  developing faster than CPUs  GPUs are cheap and ubiquitous Increasing need for streaming computations  original motivation from dealing with large data sets  also interesting for multimedia applications, image processing, visualization etc.

8 What is a Stream? An ordered list of data items Each data item has the same type  like a tuple or record Length of stream is potentially very large Examples  data records in database applications  vertex information in computer graphics  points, lines etc. in computational geometry

9 Streaming Model Input presented as a sequence Algorithm works in passes  allowed one sequential scan over input  not permitted to move backwards mid-scan Workspace  typically o(n)  arbitrary computation allowed Algorithm efficiency  size of workspace and computation time

10 Streaming: Data driven to Performance driven Primary motivation is computing over transient data (data driven)  data over a network, sensor data, router data etc. Computing over large, disk-resident data which are expensive to access (data and performance driven) To improve algorithm performance How does streaming help performance?

11 Von Neumann Bottleneck Cache ALU Control Unit Instructions Data Memory CPU Bus

12 Von Neumann Bottleneck Memory bottleneck  CPU processing faster than memory bandwidth  discrepancy getting worse  large caches and sophisticated prefetching strategies alleviate bottleneck to some extent  caches occupy large portions of real estate in modern ship design

13 Trends In Hardware

14 Cache Real Estate Die photograph of the Intel/HP IA-64 processor (Itanium2 chip) L3 Cache Array L3 Cache Array L3 Cache Array

15 Stream Architectures All input in the form of streams Stream processed by a specialized computational unit called kernel Kernel performs the same operations on each stream element kernel Data Stream

16 Stream Architectures Items processed in a FIFO fashion Reduced memory latency and cache requirements Simplified control flow Data-level parallelism Greater computational efficiency Examples  CHEOPS [Rixner et. al. ’98] and Imagine [Kapasi et. al. ’02]  high performance media applications

17 GPU: A Streaming Pipelined Architecture Inputs presented in streaming fashion  processed data items pass to next phase and does not return Data-level parallelism Limited local storage  data items essentially carry their own state Pipelining: each item processed identically Not quite general purpose yet, but getting there

18 GPU Performance From ‘Stream Programming Environments’ – Hanrahan, 2004.Stream Programming Environments

19 The Graphics Pipeline Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Input: Geometric model: Description of all object, surface and light source geometry and transformations Lighting model: Computational description of object and light properties, interaction (reflection, scattering etc.) Synthetic viewpoint (or Camera location): Eye position and viewing frustum Raster viewport: Pixel grid on to which image is mapped Output: Colors/Intensities suitable for display (For example, 24-bit RGB value at each pixel)

20 The Graphics Pipeline Primitives are processed in a series of stages Each stage forwards its result on to the next stage The pipeline can be implemented in different ways Optimizations & additional programmability are available at some stages Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

21 Modeling Transformations 3D models defined in their own coordinate system (object space) Modeling transforms orient the models within a common coordinate frame (world space) Object spaceWorld space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

22 Illumination (Shading) (Lighting) Vertices lit (shaded) according to material properties, surface properties (normal) and light sources Local lighting model (Diffuse, Ambient, Phong, etc.) Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

23 Viewing Transformation Maps world space to eye space Viewing position is transformed to origin & direction is oriented along some axis (usually z) Eye space World space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display

24 Clipping Transform to Normalized Device Coordinates (NDC) Portions of the object outside the view volume (view frustum) are removed Eye spaceNDC Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

25 Projection The objects are projected to the 2D image place (screen space) NDC Screen Space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

26 Scan Conversion (Rasterization) Rasterizes objects into pixels Interpolate values as we go (color, depth, etc.) Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

27 Visibility / Display Each pixel remembers the closest object (depth buffer) Almost every step in the graphics pipeline involves a change of coordinate system. Transformations are central to understanding 3D computer graphics. Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

28 Coordinate Systems in the Pipeline Object space World space Eye Space / Camera Space Clip Space (NDC) Screen Space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

29 Programmable Graphics Pipeline Vertex Index Stream 3D API Commands Assembled Primitives Pixel Updates Pixel Location Stream Programmable Fragment Processor Programmable Fragment Processor Transformed Vertices Programmable Vertex Processor Programmable Vertex Processor GPU Front End GPU Front End Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Rasterization and Interpolation 3D API: OpenGL or Direct3D 3D API: OpenGL or Direct3D 3D Application Or Game 3D Application Or Game Pre-transformed Vertices Pre-transformed Fragments Transformed Fragments GPU Command & Data Stream CPU-GPU Boundary Courtesy: Cg Book [Fernando and Kilgard]

30 Increase in Expressive Power 1996simple if-then tests via depth and stencil testing. 1998More complex arithmetic and lookup operations 2001Limited programmability in pipeline via specialized assembly constructs 2002Full programmability, but only straight line programs 2004True conditionals and loops 2006(?)General purpose streaming processor ?

31 Recall Hanrahan, 2004.

32 GPU speed increasing at cubed-Moore’s Law. This is a consequence of the data-parallel streaming aspects of the GPU. GPUs are cheap ! Put enough together, and you can get a super- computer. NYT May 26, 2003: TECHNOLOGY; From PlayStation to Supercomputer for $50,000: National Center for Supercomputing Applications at University of Illinois at Urbana-Champaign builds supercomputer using 70 individual Sony Playstation 2 machines; project required no hardware engineering other than mounting Playstations in a rack and connecting them with high-speed network switch So can we use the GPU for general-purpose computing ? GPU = Fast co-processor ?

33 In Summary GPU is fast The trends, compared to CPU, is already better But is it being used in non-graphics applications?

34 Voronoi Diagrams Data AnalysisMotion Planning Geometric Optimization Physical Simulation Linear Solvers Sorting and Searching Force-field simulation Particle Systems Molecular Dynamics Graph Drawing Audio, Video and Image processing Database queries Range queries … and graphics too !! Wealth of applications

35 Computation & Visualization The success stories involving the GPU revolve around the merger of computation and visualization Linear system solvers used for real-time physical simulation Voronoi diagrams allow us to perform shape analysis n-body computations form the basis for graph drawing and molecular modelling.

36 For large data, visualization=analysis Viz tools Interactive Data Analysis Analysis Tools Viz tools GPU combines both in one Analysis Tools

37 Example 1: Voronoi Diagrams Hoff et. al. Siggraph 99

38 Example 2: Change Detection using Histograms

39 Other examples Physical system simulation:  Fluid flow visualization for graphics and scientific computing  One of the most well-studied and successful uses of the GPU. Image processing and analysis  A very hot area of research in the GPU world  Numerous packages (openvidia, apple dev kit) for processing images on the GPU  Cameras mounted on robots with real-time scene processing

40 Other examples SCOUT: GPU-based system for expressing general data visualization queries  provides a high level data processing language  Data processed thru ‘programs’; user can interact with output data SCOUT demonstrates  effectiveness of GPU for data visualization  need for general GPU primitives for processing data. But the computation and visualization is decoupled. Also are many more examples !!

41 Central Theme of This Tutorial GPU is fast, and viz is built-in GPU can be programmed at high level Some of the most challenging aspects of computation and visualization of large data  interactivity  dynamic data GPU enables both

42 An Extended Example

43 Reverse Nearest Neighbours An instance consists of clients and servers Each server “serves” those clients that it is closest to. What is the load on a server ? It is the number of clients that consider this server to be their nearest neighbour (among all servers) Hence, the term “reverse nearest neighbour”: for each server, find its reverse nearest neighbours. Note: both servers and clients can move or be inserted or deleted.

44 What do we know ? First proposed by Korn and Muthukrishnan (2000)  1-D, static Easy if servers are fixed (compute Voronoi diagram of servers, and locate clients in the diagram) In general, hard to do with moving clients and servers in two dimensions.

45 Algorithmic strategy For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them. apparent “complexity” is M * N ……Or is it ?

46 Looking more closely… For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them. Each of these loops can be performed in parallel

47 Looking more closely… For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them. Complexity is N + M

48 Demo

49 Demo: Change distance function

50 RNN as special case of EM In expectation-maximization, in each iteration  first determine which points are to be associated with which cluster centers (the “nearest neighbour” step)  Then compute a new cluster center based on the points in a cluster (the “counting” step) We can implement expectation-maximization on the GPU !  More later

51 What does this example show us? GPU is fast GPU is programmable GPU can deal with dynamic data GPU can be used for interactive visualization However, care is needed to get the best performance

52 Coming up next … How do we abstract the GPU? What is an efficient GPU program How various data mining primitives are implemented Using the GPU for non-standard data visualization

53 PART II. Stream algorithms using the GPU

54 Programmable Graphics Pipeline Vertex Index Stream 3D API Commands Assembled Primitives Pixel Updates Pixel Location Stream Programmable Fragment Processor Programmable Fragment Processor Transformed Vertices Programmable Vertex Processor Programmable Vertex Processor GPU Front End GPU Front End Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Rasterization and Interpolation 3D API: OpenGL or Direct3D 3D API: OpenGL or Direct3D 3D Application Or Game 3D Application Or Game Pre-transformed Vertices Pre-transformed Fragments Transformed Fragments GPU Command & Data Stream CPU-GPU Boundary Courtesy: Cg Book [Fernando and Kilgard]

55 Think parallel First cut: Each Pixel is a processor.  Not really – multiple pipes Each of the pipes perform bits of the whole The mapping is not necessarily known Can even be load dependent But the drivers think that way.  Order of completion is not guaranteed i.e. cannot say that the computation proceeds along “scan” lines  Restricted interaction between pixels We cannot write and read the same location

56 Think simple SIMD machine  Single instruction  Same program is applied at every pixel  Recall data parallelism Program size is bounded The ability to keep state is limited  Cost is in changing state: a pass-based model Cannot read and write from same memory location in a single pass

57 Think streams Works as transducers or kernels : Algorithms that take an input, consume it and output a transform before looking at the next input. Can set up networks

58 Example: Dense Matrix Multiplication Larsen/McAllester (2002), Hall, Carr & Hart (2003) Given matrices X, Y, compute Z=XY  For s=1 to n For t=1 to n  Z st =0  For m:=1 to n Do Z st ← Z st +X sm Y mt  EndFor EndFor  EndFor

59 Data Structures For s=1 to n For t=1 to n Z st =0 For m:=1 to n Do Z st ← Z st +X sm Y mt EndFor Data Structures: Arrays X,Y,Z How are arrays implemented in GPUs ? As texture memory (Used for shading and bump maps) Fast texture lookups built into pipeline. Textures are primary storage mechanism for general purpose programming

60 Control Flow For s=1 to n For t=1 to n Z st =0 For m:=1 to n Z st ← Z st +X sm Y mt EndFor For m=1 to n For s=1 to n For t=1 to n Z st ← Z st +X sm Y mt EndFor Endfor For s=1 to n For t=1 to n Z st ← 0

61 From Loops to Pixels For m=1 to n For s=1 to n For t=1 to n Z st ← Z st +X sm Y mt EndFor Endfor For m=1 to n For pixel (s,t) Z st ← Z st +X sm Y mt EndFor Endfor Operations for all pixels are performed in “parallel”

62 Is that it ? SIMD … For m=1 to n For s=1 to n For t=1 to n R1.r ← lookup(s,m,X) R2.r ← lookup(m,t,Y) R3.r ← lookup(s,t,Z) buffer ← R3.r + R1.r * R2.r EndFor Save buffer into Z Endfor Each pixel (x,y) gets the parameter m: R1.r ← lookup(myxcoord,m,X) R2.r ← lookup(m,myycoord,Y) R3.r ← lookup(myxcoord,mycoord,Z) mybuffer ← R3.r + R1.r * R2.r

63 Analysing theoretically O(n) passes to multiply two n X n matrices. Uses O(n 2 ) size textures. Your favourite parallel algorithm for matrix multiplication can be mapped onto GPU How about Strassen ?  Number of passes is the same  Total work decreases  Can reduce the # to O(n 0.9 ) using special packing ideas [GKV04]

64 In practice… Simpler implementations win out The fastest GPU implementation uses n matrix—vector operations, and packs data into 4-vectors. It is better than naïve code using three For loops, but not better than using optimized CPU code, e.g., in linear algebra packages. Trendlines show that the running time of GPU code grows linearly. Packing and additions, e.g. in Strassen type algorithms are problematic due to data movement, cache effects.

65 Summing up SIMD machine Pass based computation  Pipelined operation  Cost is in changing state  Total work Revisiting RNN

66 Recall the two implementations M*N distances need to be evaluated First idea:  M+N passes using M+N processors in parallel Second idea:  M x N grid to compute the distances  1 pass  Find Min in 1 pass ?  How to aggregate ? M N

67 GPU hardware modelling The GPU is a parallel processor, where each pixel can be considered to be a single processor (almost) The GPU is a streaming processor, where each pixel processes a stream of data We can build high-level algorithms (EM) from low level primitives (distance calculations, counting)

68 Operation costs When evaluating the cost of an operation on the GPU, standard operation counts are not applicable  The basic unit of cost is the “rendering pass”, consisting of a stream of data being processed by a single program  Virtual parallelism of GPU means that we “pretend” that all streams can be processed simultaneously  Assign a cost of 1 to a pass: this is analogous to external memory models where each disk access is 1 unit Many caveats An extreme example

69 One pass median finding Well studied problem in streaming with log n pass lower bound. Consider repeating the data O(log n) times to get a sequence of size O(n log n) At any point the algorithm has an UB and an LB. 1. The algorithm reads the next n numbers and picks a random element u between the LB and UB. 2. The algorithm uses the next n numbers and finds the rank of u. 3. Now it sets LB=u or UB=u depending on counts. 4. Repeat ! Quickfind! Why does the example not generalize? Read is not the same as Write. Total work vs number of passes.

70 PART II.5. Data Mining Primitives

71 I: Clustering Problem definition: Given N points find K points c i to optimize Hall & Hart ’03: K+N pass KMEANS algorithm What we will see:  Log N passes per iteration  EM and “soft assignments”

72 K-Means 2D version Basic Algorithm: Recall Voronoi scenario

73 Why GPU ? Motion/updates ! When does the cross over happen ?

74 K-Means in GPU In each pass, given a set of K colors we can assign a “color” to each point and keep track of “mass” and “mean” of each color. Issues:  Higher D  Extension to more general algorithms

75 The RNN type approach K*N distances, 1 pass Compute min, logarithmic passes Aggregate, logarithmic passes Normalize and compute new centers High D ? Replace 1 by O(D) EM ? Instead of Min (which is 0/1) compute “shares” Same complexity! K N

76 Demo: Expectation Maximization

77 II: Wavelets Transforms Image analysis AB CD

78 Wavelets: Recurse!

79 DWT Demo Wang, Wong, Heng & Leung http://www.cse.cuhk.edu.hk/~ttwong/demo/dwtgpu/dwtgpu.html

80 DWT Demo Wang, Wong, Heng & Leung http://www.cse.cuhk.edu.hk/~ttwong/demo/dwtgpu/dwtgpu.html

81 III: FFTs What does the above remind you of ?

82 FFT Demo Moreland and Angel http://www.cs.unm.edu/~kmorel/documents/fftgpu/

83 III: Sorting Networks! Can be used for quantiles, computing histograms GPUSort: [Govindaraju etal 2005] Basically “network computations” Each pixel performs some local piece only.

84 In summary The big idea  Think parallel.  Think simple.  Think streams. SIMD Pass based computing Tradeoff of pass versus total cost “Ordered” access

85 Next up We saw “number crunching algorithms”. What are the applications in mining which are “combinatorial” in nature and yet visual ? And of courses, resources and howtos.

86 Part III: Other GPU Case Studies and High-level Software Support for the GPU

87 What have we seen so far Computation on the GPU How to program Cost model  What is cheap and expensive Number based problems, e.g.,  nearest neighbor (Voronoi), clustering, wavelets, sorting

88 What you will see … Examples with different kinds of input data 1. Computation of Depth Contours 2. Graph Drawing Languages and tools to program the GPU Final wrap-up

89 Depth Contours

90 3 5 1 7 2 6 Location depth = 1

91 Depth Contours k-contour: set of all points having location depth ≥ k We wish to compute the set of all depth contours Every n-point set has an n/3-contour The Tukey median is the deepest contour.

92 Motivation Special case of general non-parametric data analysis tool “visualize the location, spread, correlation, skewness, and tails of the data” [Tukey et al 1999] Hypothesis testing, robust statistics

93 Point-Hyperplane Duality Mapping from R 2  R 2 Points get mapped to lines and vice-versa  (a,b)  y = -ax + b  y = ax + b  (a,b) Incidence properties are preserved

94 Contours  k-levels k-contour: convex hulls of k- and n-k levels in dual.

95 Main Algorithm Draw primal line for each dual pixel that lies on a dual line Record only (at each primal pixel) the line whose dual pixel has least depth Recall nearest neighbor (Voronoi), clustering – we are using the geometry heavily

96 Demo: Depth Contours

97 Dynamic Depth Contours Smooth, but random movement of points Compute incremental depth changes using duals Worst-case quadratic updates might be performed In practice, small number of updates  Can achieve real-time speeds for first time

98 Dynamic Depth Contours

99 Graph Drawing

100 Graph  Set of vertices  Boolean relation between pairs of vertices defines edges Given a graph, layout the vertices and edges in a plane based on some aesthetics

101 Abstract Graph Example  9 vertices, 12 edges  Edges v1: v2, v4 v2: v1, v3, v5 v3: v2, v6 v4: v1, v5, v7 v5: v2, v4, v6, v8 v6: v3, v5, v9 v7: v4, v8 v8: v5, v7, v9 v9: v6, v8

102 After Graph Layout grid3.graph  Can visualize the relationships much better v1 v2 v3 v6 v9 v4 v7 v8 v5

103 Force-Directed Layouts Start with random initial placement of vertices Define an energy field on the plane using forces between vertices (both attractive and repulsive)  Evolve the system based on the energy field Local energy minimization defines the layout Essentially solving a N-body problem

104 GPU Algorithm Naïve approach  At each iteration Compute pairwise repulsive forces Compute attractive forces between edges Force on each vertex is sum of individual forces Resultant force defines vertex displacement Displace vertices and iterate  Complexity – O(n) per iteration Bottleneck – summing individual forces Optimized approach  Use logarithmic parallel summation algorithm  Complexity – O(log n) per iteration

105 Parallel Summation Algorithm 5 -38 -7 4 1 11 -6 4 1 11 -6 9-2 19 -13 4 1 11 -6 19 -13 28 -15 4 1 11 -6 19 -13 -15 13 offset = 4 offset = 2 offset = 1 O(log n)

106 Demo: Graph Drawing

107 Software tools

108 GPGPU Languages Why do we want them?  Make programming GPUs easier! Don’t need to know OpenGL, DirectX, or ATI/NV extensions Simplify common operations Focus on the algorithm, not on the implementation Sh  University of Waterloo  http://libsh.sourceforge.net Brook  Stanford University  http://brook.sourceforge.net http://brook.sourceforge.net

109 Brook

110 Brook: Streams streams  collection of records requiring similar computation particle positions, voxels, FEM cell, … float3 positions ; float3 velocityfield ;  similar to arrays, but… index operations disallowed: position[i] read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr);  encourage data parallelism Slide Courtesy: Ian Buck

111 Brook: Kernels kernels  functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a ; float b ; float c ; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; Slide Courtesy: Ian Buck

112 Brook: Kernels kernels  functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; }  no dependencies between stream elements  encourage high arithmetic intensity Slide Courtesy: Ian Buck

113 Brook: Reductions reductions  compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i]; Slide Courtesy: Ian Buck

114 Brook: Reductions reductions  associative operations only (a+b)+c = a+(b+c) Order independence sum, multiply, max, min, OR, AND, XOR matrix multiply Slide Courtesy: Ian Buck

115 Brook: Reductions multi-dimension reductions  stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r ; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; Slide Courtesy: Ian Buck

116 Brook: Matrix Vector Multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); M V V V T = Slide Courtesy: Ian Buck

117 Brook: matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) {result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); RT sum Slide Courtesy: Ian Buck

118 Demo: Matrix Vector Multiplication

119 Demo: Bitonic Sort and Binary Search

120 Debugging Tools

121 Debugger Features Per-pixel register watch  Including interpolants, outputs, etc. Breakpoints Fragment program interpreter  Single step forwards or backwards  Execute modified code on the fly Slide Courtesy: Tim Purcell

122 Shadesmith http://graphics.stanford.edu/projects/shadesmith Debugger in the spirit of imdebug  Simply add a debug statement when binding shaders  Display window with scale, bias, component masking Advanced features  Can watch any shader register contents without recompile  Shader single stepping (forward and backward), breakpointing  Shader source edit and reload without recompile Slide Courtesy: Tim Purcell

123 Demo: Shadesmith

124 GPGPU.org Your first stop for GPGPU information! News: www.GPGPU.orgwww.GPGPU.org Discussion: www.GPGPU.org/forumswww.GPGPU.org/forums Developer resources, sample code, tutorials: www.GPGPU.org/developer www.GPGPU.org/developer And for open source GPGPU software: http://gpgpu.sourceforge.net/ http://gpgpu.sourceforge.net/

125 Summing up

126 What we saw in this tutorial … GPU is fast outperforming the CPU Shown to be effective in a variety of applications  High memory bandwidth  Parallel processing capabilities  Interactive visualization applications Data mining, simulations, streaming operations Spatial database operations High-level programming and debugging support

127 Limitation of the GPU Not designed as a general-purpose processor  meant to extract maximum performance for highly parallel tasks of computer graphics Will not be suited for all applications  Bad for “pointer chasing” tasks word processing  No bit shifts or bit-wise logical operations Cryptography  No double precision arithmetic yet Large scale scientific applications Unusual programming model

128 Conclusions: Looking forward Increased performance More features  double precision arithmetic  Bit-wise operations  Random bits Increased programmability and generality  strike the right balance between generality and performance Other data-parallel processors will emerge  Cell processor by IBM, Sony, Toshiba  Better suited to general-purpose stream computing

129 Questions? Slides can be downloaded soon from http://www.research.att.com/areas/visualization/gpgpu.html

130 Map Simplification

131 Motivation Original US Map: ~425,000 vertices

132 Motivation Simplified US Map: ~1500 vertices

133 Motivation

134 Basic idea Simplified chain: proper subsequence of chain Short Cut Segment (SCS): any line segment between two distinct vertices v 1 v 2 v 3 v 4 v 5

135 Problem Statement Given a map M and tolerance parameter ε, construct the simplest map M´ that preserves shape  Geometry each point on simplified chain is within ε of original chain and vice versa  Topology prevent inter-chain intersections Valid simplification: preserves both

136 Proximate Short-Cut Segments Consider a SCS between vertex i and j of original chain  Portions of the chain between vertex i and j are within  of the SCS  geometry preservation: does not affect other simplifications! i i+1 j  

137 Compliant Short-Cut Segments A SCS that lies completely inside the chain’s Voronoi cell  topology preserving Non-Compliant SCS Compliant SCS

138  Fattening

139 But what does this remind you of … Voronoi diagrams ! The same recurrent themes of geometric and visual primitives.

140 The Algorithm Given tolerance , use SCS that are “not expired”  set of valid SCS Construct Voronoi diagram for all chains Phase I: for each chain compute valid SCS phase II: stitch them together (done in CPU) Short of time to explain details…

141 Compliant Short-Cut Segments

142 Minimum Link Path

143 Demo: Map Simplification

144 Example 1: Voronoi diagrams Demo Courtesy: Kenny Hoff, UNC

145 Why should you care ? If you are a GPU “user”  A fast streaming coprocessor  Many tools (and code) out there to test applications If you are (or want to be) a GPU “researcher”  Many problems remain unsolved  Multidisciplinary problems Mapping stream problems to the GPU New interaction paradigms for interfacing with data analysis tools…

146 Common Coordinate Systems Object space  local to each object World space  common to all objects Eye space / Camera space  derived from view frustum Clip space / Normalized Device Coordinates (NDC)  [-1,-1,-1] → [1,1,1] Screen space  indexed according to hardware attributes

147 Programming Tools The code you saw was written in a language called “Cg”, developed by NVIDIA, but supported on all cards (mostly). Cg is a one-level-higher-than-assembly language. Other higher level languages, like Brook, Sh etc also exist We cannot yet write a general C program and expect the compiler to identify GPU-mapped elements: GPU languages are the gatekeepers of the hardware: they force us to express constructs in certain ways. More on this later….

148 Coming up next… What is the correct way to model the GPU ? What kind of low level primitives are available for doing data mining on the GPU

149 Breaking down the application A standard strategy: 1. Reduce each image to short signatures. 2. Cluster the images Given an image (or a collection) perform the above Compute intersection Common tools of the trade:  Clustering/nearest neighbour  Wavelets/FFT  Sorting/Quantiles

150 Two views of the GPU Snake: exploit parallelism  Computation intensive tasks  Neighbourhoods can be probed. Cube : More general, “processor on a object”  Extremely useful for dynamic/geometric data


Download ppt "Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T."

Similar presentations


Ads by Google