Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T.

Slides:

Advertisements

Similar presentations

COMPUTER GRAPHICS SOFTWARE.

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Graphics Pipeline.

3D Graphics Rendering and Terrain Modeling

Chapter 6: Vertices to Fragments Part 2 E. Angel and D. Shreiner: Interactive Computer Graphics 6E © Addison-Wesley Mohan Sridharan Based on Slides.

Vertices and Fragments I CS4395: Computer Graphics 1 Mohan Sridharan Based on slides created by Edward Angel.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.

Game Engine Design ITCS 4010/5010 Spring 2006 Kalpathi Subramanian Department of Computer Science UNC Charlotte.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

1 Angel: Interactive Computer Graphics 4E © Addison-Wesley 2005 Models and Architectures Ed Angel Professor of Computer Science, Electrical and Computer.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Erdem Alpay Ala Nawaiseh. Why Shadows? Real world has shadows More control of the game’s feel  dramatic effects  spooky effects Without shadows the.

CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.

General-Purpose Computation on Graphics Hardware.

COMP 175: Computer Graphics March 24, 2015

GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.

Enhancing GPU for Scientific Computing Some thoughts.

Technology and Historical Overview. Introduction to 3d Computer Graphics  3D computer graphics is the science, study, and method of projecting a mathematical.

Programmable Pipelines. Objectives Introduce programmable pipelines Vertex shaders Fragment shaders Introduce shading languages Needed to describe.

Programmable Pipelines. 2 Objectives Introduce programmable pipelines Vertex shaders Fragment shaders Introduce shading languages Needed to describe.

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

Cg Programming Mapping Computational Concepts to GPUs.

MIT EECS 6.837, Durand and Cutler Graphics Pipeline: Projective Transformations.

CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.

CSC 461: Lecture 3 1 CSC461 Lecture 3: Models and Architectures  Objectives –Learn the basic design of a graphics system –Introduce pipeline architecture.

General-Purpose Computation on Graphics Hardware.

The programmable pipeline Lecture 3.

1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.

Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.

1Computer Graphics Lecture 4 - Models and Architectures John Shearer Culture Lab – space 2

COMPUTER GRAPHICS CSCI 375. What do I need to know?  Familiarity with  Trigonometry  Analytic geometry  Linear algebra  Data structures  OOP.

Shadow Mapping Chun-Fa Chang National Taiwan Normal University.

GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.

GPGPU Tools and Source Code Mark HarrisNVIDIA Developer Technology.

Programmable Pipelines Ed Angel Professor of Computer Science, Electrical and Computer Engineering, and Media Arts Director, Arts Technology Center University.

Computer Graphics Chapter 6 Andreas Savva. 2 Interactive Graphics Graphics provides one of the most natural means of communicating with a computer. Interactive.

1Computer Graphics Implementation II Lecture 16 John Shearer Culture Lab – space 2

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.

MIT EECS 6.837, Durand and Cutler The Graphics Pipeline: Projective Transformations.

1 Angel: Interactive Computer Graphics5E © Addison- Wesley 2009 Image Formation Fundamental imaging notions Fundamental imaging notions Physical basis.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.

GLSL Review Monday, Nov OpenGL pipeline Command Stream Vertex Processing Geometry processing Rasterization Fragment processing Fragment Ops/Blending.

COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

- Introduction - Graphics Pipeline

Programmable Pipelines

Graphics Processing Unit

Real-Time Ray Tracing Stefan Popov.

3D Graphics Rendering PPT By Ricardo Veguilla.

Chapter 6 GPU, Shaders, and Shading Languages

The Graphics Rendering Pipeline

CS451Real-time Rendering Pipeline

Models and Architectures

Models and Architectures

Models and Architectures

Debugging Tools Tim Purcell NVIDIA.

Ray Tracing on Programmable Graphics Hardware

CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders

Presentation transcript:

Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T Labs - Research)

What you will see in this tutorial… The GPU is fast The GPU is programmable The GPU can be used for interactive visualization How do we abstract the GPU ? What is an efficient GPU program How various data mining primitives are implemented Using the GPU for non-standard data visualization

What you will NOT see… Detailed programming tricks Hacks to improve performance Mechanics of GPU programming BUT…  we will show you where to find all of this. Plethora of resources now available on the web, as well as code, toolkits, examples, and more…

Schedule 1 st Hour  1:30pm – 1:50pm: Introduction to GPUs (Shankar)  1:50pm – 2:30pm: Examples of GPUs in Data Analysis (Shankar) 2 nd Hour  2:30pm – 2:50pm: Stream Algorithms on the GPU (Sudipto)  2:50pm – 3:20pm: Data mining primitives (Sudipto)  3:20pm – 3:30pm: Questions and Short break 3 rd Hour  3:30pm – 4:00pm: GPU Case Studies (Shankar)  4:00pm – 4:20pm: High-level software support (Shankar)  4:20pm – 4:30pm: Wrap-up and Questions

Animusic Demo (courtesy ATI)

But I don’t do graphics ! Why should I care about the GPU ?

Two Converging Trends in Computing … The accelerated development of graphics cards  developing faster than CPUs  GPUs are cheap and ubiquitous Increasing need for streaming computations  original motivation from dealing with large data sets  also interesting for multimedia applications, image processing, visualization etc.

What is a Stream? An ordered list of data items Each data item has the same type  like a tuple or record Length of stream is potentially very large Examples  data records in database applications  vertex information in computer graphics  points, lines etc. in computational geometry

Streaming Model Input presented as a sequence Algorithm works in passes  allowed one sequential scan over input  not permitted to move backwards mid-scan Workspace  typically o(n)  arbitrary computation allowed Algorithm efficiency  size of workspace and computation time

Streaming: Data driven to Performance driven Primary motivation is computing over transient data (data driven)  data over a network, sensor data, router data etc. Computing over large, disk-resident data which are expensive to access (data and performance driven) To improve algorithm performance How does streaming help performance?

Von Neumann Bottleneck Cache ALU Control Unit Instructions Data Memory CPU Bus

Von Neumann Bottleneck Memory bottleneck  CPU processing faster than memory bandwidth  discrepancy getting worse  large caches and sophisticated prefetching strategies alleviate bottleneck to some extent  caches occupy large portions of real estate in modern ship design

Trends In Hardware

Cache Real Estate Die photograph of the Intel/HP IA-64 processor (Itanium2 chip) L3 Cache Array L3 Cache Array L3 Cache Array

Stream Architectures All input in the form of streams Stream processed by a specialized computational unit called kernel Kernel performs the same operations on each stream element kernel Data Stream

Stream Architectures Items processed in a FIFO fashion Reduced memory latency and cache requirements Simplified control flow Data-level parallelism Greater computational efficiency Examples  CHEOPS [Rixner et. al. ’98] and Imagine [Kapasi et. al. ’02]  high performance media applications

GPU: A Streaming Pipelined Architecture Inputs presented in streaming fashion  processed data items pass to next phase and does not return Data-level parallelism Limited local storage  data items essentially carry their own state Pipelining: each item processed identically Not quite general purpose yet, but getting there

GPU Performance From ‘Stream Programming Environments’ – Hanrahan, 2004.Stream Programming Environments

The Graphics Pipeline Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Input: Geometric model: Description of all object, surface and light source geometry and transformations Lighting model: Computational description of object and light properties, interaction (reflection, scattering etc.) Synthetic viewpoint (or Camera location): Eye position and viewing frustum Raster viewport: Pixel grid on to which image is mapped Output: Colors/Intensities suitable for display (For example, 24-bit RGB value at each pixel)

The Graphics Pipeline Primitives are processed in a series of stages Each stage forwards its result on to the next stage The pipeline can be implemented in different ways Optimizations & additional programmability are available at some stages Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Modeling Transformations 3D models defined in their own coordinate system (object space) Modeling transforms orient the models within a common coordinate frame (world space) Object spaceWorld space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Illumination (Shading) (Lighting) Vertices lit (shaded) according to material properties, surface properties (normal) and light sources Local lighting model (Diffuse, Ambient, Phong, etc.) Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Viewing Transformation Maps world space to eye space Viewing position is transformed to origin & direction is oriented along some axis (usually z) Eye space World space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display

Clipping Transform to Normalized Device Coordinates (NDC) Portions of the object outside the view volume (view frustum) are removed Eye spaceNDC Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Projection The objects are projected to the 2D image place (screen space) NDC Screen Space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Scan Conversion (Rasterization) Rasterizes objects into pixels Interpolate values as we go (color, depth, etc.) Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Visibility / Display Each pixel remembers the closest object (depth buffer) Almost every step in the graphics pipeline involves a change of coordinate system. Transformations are central to understanding 3D computer graphics. Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Coordinate Systems in the Pipeline Object space World space Eye Space / Camera Space Clip Space (NDC) Screen Space Modeling Transformations Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler

Programmable Graphics Pipeline Vertex Index Stream 3D API Commands Assembled Primitives Pixel Updates Pixel Location Stream Programmable Fragment Processor Programmable Fragment Processor Transformed Vertices Programmable Vertex Processor Programmable Vertex Processor GPU Front End GPU Front End Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Rasterization and Interpolation 3D API: OpenGL or Direct3D 3D API: OpenGL or Direct3D 3D Application Or Game 3D Application Or Game Pre-transformed Vertices Pre-transformed Fragments Transformed Fragments GPU Command & Data Stream CPU-GPU Boundary Courtesy: Cg Book [Fernando and Kilgard]

Increase in Expressive Power 1996simple if-then tests via depth and stencil testing. 1998More complex arithmetic and lookup operations 2001Limited programmability in pipeline via specialized assembly constructs 2002Full programmability, but only straight line programs 2004True conditionals and loops 2006(?)General purpose streaming processor ?

Recall Hanrahan, 2004.

GPU speed increasing at cubed-Moore’s Law. This is a consequence of the data-parallel streaming aspects of the GPU. GPUs are cheap ! Put enough together, and you can get a supercomputer. NYT May 26, 2003: TECHNOLOGY; From PlayStation to Supercomputer for $50,000: National Center for Supercomputing Applications at University of Illinois at Urbana-Champaign builds supercomputer using 70 individual Sony Playstation 2 machines; project required no hardware engineering other than mounting Playstations in a rack and connecting them with high-speed network switch So can we use the GPU for general-purpose computing ? GPU = Fast co-processor ?

In Summary GPU is fast The trends, compared to CPU, is already better But is it being used in non-graphics applications?

Voronoi Diagrams Data AnalysisMotion Planning Geometric Optimization Physical Simulation Linear Solvers Sorting and Searching Force-field simulation Particle Systems Molecular Dynamics Graph Drawing Audio, Video and Image processing Database queries Range queries … and graphics too !! Wealth of applications

Computation & Visualization The success stories involving the GPU revolve around the merger of computation and visualization Linear system solvers used for real-time physical simulation Voronoi diagrams allow us to perform shape analysis n-body computations form the basis for graph drawing and molecular modelling.

For large data, visualization=analysis Viz tools Interactive Data Analysis Analysis Tools Viz tools GPU combines both in one Analysis Tools

Example 1: Voronoi Diagrams Hoff et. al. Siggraph 99

Example 2: Change Detection using Histograms

Other examples Physical system simulation:  Fluid flow visualization for graphics and scientific computing  One of the most well-studied and successful uses of the GPU. Image processing and analysis  A very hot area of research in the GPU world  Numerous packages (openvidia, apple dev kit) for processing images on the GPU  Cameras mounted on robots with real-time scene processing

Other examples SCOUT: GPU-based system for expressing general data visualization queries  provides a high level data processing language  Data processed thru ‘programs’; user can interact with output data SCOUT demonstrates  effectiveness of GPU for data visualization  need for general GPU primitives for processing data. But the computation and visualization is decoupled. Also are many more examples !!

Central Theme of This Tutorial GPU is fast, and viz is built-in GPU can be programmed at high level Some of the most challenging aspects of computation and visualization of large data  interactivity  dynamic data GPU enables both

An Extended Example

Reverse Nearest Neighbours An instance consists of clients and servers Each server “serves” those clients that it is closest to. What is the load on a server ? It is the number of clients that consider this server to be their nearest neighbour (among all servers) Hence, the term “reverse nearest neighbour”: for each server, find its reverse nearest neighbours. Note: both servers and clients can move or be inserted or deleted.

What do we know ? First proposed by Korn and Muthukrishnan (2000)  1-D, static Easy if servers are fixed (compute Voronoi diagram of servers, and locate clients in the diagram) In general, hard to do with moving clients and servers in two dimensions.

Algorithmic strategy For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them. apparent “complexity” is M * N ……Or is it ?

Looking more closely… For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them. Each of these loops can be performed in parallel

Looking more closely… For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them. Complexity is N + M

Demo

Demo: Change distance function

RNN as special case of EM In expectation-maximization, in each iteration  first determine which points are to be associated with which cluster centers (the “nearest neighbour” step)  Then compute a new cluster center based on the points in a cluster (the “counting” step) We can implement expectation-maximization on the GPU !  More later

What does this example show us? GPU is fast GPU is programmable GPU can deal with dynamic data GPU can be used for interactive visualization However, care is needed to get the best performance

Coming up next … How do we abstract the GPU? What is an efficient GPU program How various data mining primitives are implemented Using the GPU for non-standard data visualization

PART II. Stream algorithms using the GPU

Programmable Graphics Pipeline Vertex Index Stream 3D API Commands Assembled Primitives Pixel Updates Pixel Location Stream Programmable Fragment Processor Programmable Fragment Processor Transformed Vertices Programmable Vertex Processor Programmable Vertex Processor GPU Front End GPU Front End Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Rasterization and Interpolation 3D API: OpenGL or Direct3D 3D API: OpenGL or Direct3D 3D Application Or Game 3D Application Or Game Pre-transformed Vertices Pre-transformed Fragments Transformed Fragments GPU Command & Data Stream CPU-GPU Boundary Courtesy: Cg Book [Fernando and Kilgard]

Think parallel First cut: Each Pixel is a processor.  Not really – multiple pipes Each of the pipes perform bits of the whole The mapping is not necessarily known Can even be load dependent But the drivers think that way.  Order of completion is not guaranteed i.e. cannot say that the computation proceeds along “scan” lines  Restricted interaction between pixels We cannot write and read the same location

Think simple SIMD machine  Single instruction  Same program is applied at every pixel  Recall data parallelism Program size is bounded The ability to keep state is limited  Cost is in changing state: a pass-based model Cannot read and write from same memory location in a single pass

Think streams Works as transducers or kernels : Algorithms that take an input, consume it and output a transform before looking at the next input. Can set up networks

Example: Dense Matrix Multiplication Larsen/McAllester (2002), Hall, Carr & Hart (2003) Given matrices X, Y, compute Z=XY  For s=1 to n For t=1 to n  Z st =0  For m:=1 to n Do Z st ← Z st +X sm Y mt  EndFor EndFor  EndFor

Data Structures For s=1 to n For t=1 to n Z st =0 For m:=1 to n Do Z st ← Z st +X sm Y mt EndFor Data Structures: Arrays X,Y,Z How are arrays implemented in GPUs ? As texture memory (Used for shading and bump maps) Fast texture lookups built into pipeline. Textures are primary storage mechanism for general purpose programming

Control Flow For s=1 to n For t=1 to n Z st =0 For m:=1 to n Z st ← Z st +X sm Y mt EndFor For m=1 to n For s=1 to n For t=1 to n Z st ← Z st +X sm Y mt EndFor Endfor For s=1 to n For t=1 to n Z st ← 0

From Loops to Pixels For m=1 to n For s=1 to n For t=1 to n Z st ← Z st +X sm Y mt EndFor Endfor For m=1 to n For pixel (s,t) Z st ← Z st +X sm Y mt EndFor Endfor Operations for all pixels are performed in “parallel”

Is that it ? SIMD … For m=1 to n For s=1 to n For t=1 to n R1.r ← lookup(s,m,X) R2.r ← lookup(m,t,Y) R3.r ← lookup(s,t,Z) buffer ← R3.r + R1.r * R2.r EndFor Save buffer into Z Endfor Each pixel (x,y) gets the parameter m: R1.r ← lookup(myxcoord,m,X) R2.r ← lookup(m,myycoord,Y) R3.r ← lookup(myxcoord,mycoord,Z) mybuffer ← R3.r + R1.r * R2.r

Analysing theoretically O(n) passes to multiply two n X n matrices. Uses O(n 2 ) size textures. Your favourite parallel algorithm for matrix multiplication can be mapped onto GPU How about Strassen ?  Number of passes is the same  Total work decreases  Can reduce the # to O(n 0.9 ) using special packing ideas [GKV04]

In practice… Simpler implementations win out The fastest GPU implementation uses n matrix—vector operations, and packs data into 4-vectors. It is better than naïve code using three For loops, but not better than using optimized CPU code, e.g., in linear algebra packages. Trendlines show that the running time of GPU code grows linearly. Packing and additions, e.g. in Strassen type algorithms are problematic due to data movement, cache effects.

Summing up SIMD machine Pass based computation  Pipelined operation  Cost is in changing state  Total work Revisiting RNN

Recall the two implementations M*N distances need to be evaluated First idea:  M+N passes using M+N processors in parallel Second idea:  M x N grid to compute the distances  1 pass  Find Min in 1 pass ?  How to aggregate ? M N

GPU hardware modelling The GPU is a parallel processor, where each pixel can be considered to be a single processor (almost) The GPU is a streaming processor, where each pixel processes a stream of data We can build high-level algorithms (EM) from low level primitives (distance calculations, counting)

Operation costs When evaluating the cost of an operation on the GPU, standard operation counts are not applicable  The basic unit of cost is the “rendering pass”, consisting of a stream of data being processed by a single program  Virtual parallelism of GPU means that we “pretend” that all streams can be processed simultaneously  Assign a cost of 1 to a pass: this is analogous to external memory models where each disk access is 1 unit Many caveats An extreme example

One pass median finding Well studied problem in streaming with log n pass lower bound. Consider repeating the data O(log n) times to get a sequence of size O(n log n) At any point the algorithm has an UB and an LB. 1. The algorithm reads the next n numbers and picks a random element u between the LB and UB. 2. The algorithm uses the next n numbers and finds the rank of u. 3. Now it sets LB=u or UB=u depending on counts. 4. Repeat ! Quickfind! Why does the example not generalize? Read is not the same as Write. Total work vs number of passes.

PART II.5. Data Mining Primitives

I: Clustering Problem definition: Given N points find K points c i to optimize Hall & Hart ’03: K+N pass KMEANS algorithm What we will see:  Log N passes per iteration  EM and “soft assignments”

K-Means 2D version Basic Algorithm: Recall Voronoi scenario

Why GPU ? Motion/updates ! When does the cross over happen ?

K-Means in GPU In each pass, given a set of K colors we can assign a “color” to each point and keep track of “mass” and “mean” of each color. Issues:  Higher D  Extension to more general algorithms

The RNN type approach K*N distances, 1 pass Compute min, logarithmic passes Aggregate, logarithmic passes Normalize and compute new centers High D ? Replace 1 by O(D) EM ? Instead of Min (which is 0/1) compute “shares” Same complexity! K N

Demo: Expectation Maximization

II: Wavelets Transforms Image analysis AB CD

Wavelets: Recurse!

DWT Demo Wang, Wong, Heng & Leung

DWT Demo Wang, Wong, Heng & Leung

III: FFTs What does the above remind you of ?

FFT Demo Moreland and Angel

III: Sorting Networks! Can be used for quantiles, computing histograms GPUSort: [Govindaraju etal 2005] Basically “network computations” Each pixel performs some local piece only.

In summary The big idea  Think parallel.  Think simple.  Think streams. SIMD Pass based computing Tradeoff of pass versus total cost “Ordered” access

Next up We saw “number crunching algorithms”. What are the applications in mining which are “combinatorial” in nature and yet visual ? And of courses, resources and howtos.

Part III: Other GPU Case Studies and High-level Software Support for the GPU

What have we seen so far Computation on the GPU How to program Cost model  What is cheap and expensive Number based problems, e.g.,  nearest neighbor (Voronoi), clustering, wavelets, sorting

What you will see … Examples with different kinds of input data 1. Computation of Depth Contours 2. Graph Drawing Languages and tools to program the GPU Final wrap-up

Depth Contours

Location depth = 1

Depth Contours k-contour: set of all points having location depth ≥ k We wish to compute the set of all depth contours Every n-point set has an n/3-contour The Tukey median is the deepest contour.

Motivation Special case of general non-parametric data analysis tool “visualize the location, spread, correlation, skewness, and tails of the data” [Tukey et al 1999] Hypothesis testing, robust statistics

Point-Hyperplane Duality Mapping from R 2  R 2 Points get mapped to lines and vice-versa  (a,b)  y = -ax + b  y = ax + b  (a,b) Incidence properties are preserved

Contours  k-levels k-contour: convex hulls of k- and n-k levels in dual.

Main Algorithm Draw primal line for each dual pixel that lies on a dual line Record only (at each primal pixel) the line whose dual pixel has least depth Recall nearest neighbor (Voronoi), clustering – we are using the geometry heavily

Demo: Depth Contours

Dynamic Depth Contours Smooth, but random movement of points Compute incremental depth changes using duals Worst-case quadratic updates might be performed In practice, small number of updates  Can achieve real-time speeds for first time

Dynamic Depth Contours

Graph Drawing

Graph  Set of vertices  Boolean relation between pairs of vertices defines edges Given a graph, layout the vertices and edges in a plane based on some aesthetics

Abstract Graph Example  9 vertices, 12 edges  Edges v1: v2, v4 v2: v1, v3, v5 v3: v2, v6 v4: v1, v5, v7 v5: v2, v4, v6, v8 v6: v3, v5, v9 v7: v4, v8 v8: v5, v7, v9 v9: v6, v8

After Graph Layout grid3.graph  Can visualize the relationships much better v1 v2 v3 v6 v9 v4 v7 v8 v5

Force-Directed Layouts Start with random initial placement of vertices Define an energy field on the plane using forces between vertices (both attractive and repulsive)  Evolve the system based on the energy field Local energy minimization defines the layout Essentially solving a N-body problem

GPU Algorithm Naïve approach  At each iteration Compute pairwise repulsive forces Compute attractive forces between edges Force on each vertex is sum of individual forces Resultant force defines vertex displacement Displace vertices and iterate  Complexity – O(n) per iteration Bottleneck – summing individual forces Optimized approach  Use logarithmic parallel summation algorithm  Complexity – O(log n) per iteration

Parallel Summation Algorithm offset = 4 offset = 2 offset = 1 O(log n)

Demo: Graph Drawing

Software tools

GPGPU Languages Why do we want them?  Make programming GPUs easier! Don’t need to know OpenGL, DirectX, or ATI/NV extensions Simplify common operations Focus on the algorithm, not on the implementation Sh  University of Waterloo  Brook  Stanford University 

Brook

Brook: Streams streams  collection of records requiring similar computation particle positions, voxels, FEM cell, … float3 positions ; float3 velocityfield ;  similar to arrays, but… index operations disallowed: position[i] read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr);  encourage data parallelism Slide Courtesy: Ian Buck

Brook: Kernels kernels  functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a ; float b ; float c ; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; Slide Courtesy: Ian Buck

Brook: Kernels kernels  functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; }  no dependencies between stream elements  encourage high arithmetic intensity Slide Courtesy: Ian Buck

Brook: Reductions reductions  compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i]; Slide Courtesy: Ian Buck

Brook: Reductions reductions  associative operations only (a+b)+c = a+(b+c) Order independence sum, multiply, max, min, OR, AND, XOR matrix multiply Slide Courtesy: Ian Buck

Brook: Reductions multi-dimension reductions  stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r ; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; Slide Courtesy: Ian Buck

Brook: Matrix Vector Multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); M V V V T = Slide Courtesy: Ian Buck

Brook: matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) {result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); RT sum Slide Courtesy: Ian Buck

Demo: Matrix Vector Multiplication

Demo: Bitonic Sort and Binary Search

Debugging Tools

Debugger Features Per-pixel register watch  Including interpolants, outputs, etc. Breakpoints Fragment program interpreter  Single step forwards or backwards  Execute modified code on the fly Slide Courtesy: Tim Purcell

Shadesmith Debugger in the spirit of imdebug  Simply add a debug statement when binding shaders  Display window with scale, bias, component masking Advanced features  Can watch any shader register contents without recompile  Shader single stepping (forward and backward), breakpointing  Shader source edit and reload without recompile Slide Courtesy: Tim Purcell

Demo: Shadesmith

GPGPU.org Your first stop for GPGPU information! News: Discussion: Developer resources, sample code, tutorials: And for open source GPGPU software:

Summing up

What we saw in this tutorial … GPU is fast outperforming the CPU Shown to be effective in a variety of applications  High memory bandwidth  Parallel processing capabilities  Interactive visualization applications Data mining, simulations, streaming operations Spatial database operations High-level programming and debugging support

Limitation of the GPU Not designed as a general-purpose processor  meant to extract maximum performance for highly parallel tasks of computer graphics Will not be suited for all applications  Bad for “pointer chasing” tasks word processing  No bit shifts or bit-wise logical operations Cryptography  No double precision arithmetic yet Large scale scientific applications Unusual programming model

Conclusions: Looking forward Increased performance More features  double precision arithmetic  Bit-wise operations  Random bits Increased programmability and generality  strike the right balance between generality and performance Other data-parallel processors will emerge  Cell processor by IBM, Sony, Toshiba  Better suited to general-purpose stream computing

Questions? Slides can be downloaded soon from

Map Simplification

Motivation Original US Map: ~425,000 vertices

Motivation Simplified US Map: ~1500 vertices

Motivation

Basic idea Simplified chain: proper subsequence of chain Short Cut Segment (SCS): any line segment between two distinct vertices v 1 v 2 v 3 v 4 v 5

Problem Statement Given a map M and tolerance parameter ε, construct the simplest map M´ that preserves shape  Geometry each point on simplified chain is within ε of original chain and vice versa  Topology prevent inter-chain intersections Valid simplification: preserves both

Proximate Short-Cut Segments Consider a SCS between vertex i and j of original chain  Portions of the chain between vertex i and j are within  of the SCS  geometry preservation: does not affect other simplifications! i i+1 j  

Compliant Short-Cut Segments A SCS that lies completely inside the chain’s Voronoi cell  topology preserving Non-Compliant SCS Compliant SCS

 Fattening

But what does this remind you of … Voronoi diagrams ! The same recurrent themes of geometric and visual primitives.

The Algorithm Given tolerance , use SCS that are “not expired”  set of valid SCS Construct Voronoi diagram for all chains Phase I: for each chain compute valid SCS phase II: stitch them together (done in CPU) Short of time to explain details…

Compliant Short-Cut Segments

Minimum Link Path

Demo: Map Simplification

Example 1: Voronoi diagrams Demo Courtesy: Kenny Hoff, UNC

Why should you care ? If you are a GPU “user”  A fast streaming coprocessor  Many tools (and code) out there to test applications If you are (or want to be) a GPU “researcher”  Many problems remain unsolved  Multidisciplinary problems Mapping stream problems to the GPU New interaction paradigms for interfacing with data analysis tools…

Common Coordinate Systems Object space  local to each object World space  common to all objects Eye space / Camera space  derived from view frustum Clip space / Normalized Device Coordinates (NDC)  [-1,-1,-1] → [1,1,1] Screen space  indexed according to hardware attributes

Programming Tools The code you saw was written in a language called “Cg”, developed by NVIDIA, but supported on all cards (mostly). Cg is a one-level-higher-than-assembly language. Other higher level languages, like Brook, Sh etc also exist We cannot yet write a general C program and expect the compiler to identify GPU-mapped elements: GPU languages are the gatekeepers of the hardware: they force us to express constructs in certain ways. More on this later….

Coming up next… What is the correct way to model the GPU ? What kind of low level primitives are available for doing data mining on the GPU

Breaking down the application A standard strategy: 1. Reduce each image to short signatures. 2. Cluster the images Given an image (or a collection) perform the above Compute intersection Common tools of the trade:  Clustering/nearest neighbour  Wavelets/FFT  Sorting/Quantiles

Two views of the GPU Snake: exploit parallelism  Computation intensive tasks  Neighbourhoods can be probed. Cube : More general, “processor on a object”  Extremely useful for dynamic/geometric data