Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Slides:



Advertisements
Similar presentations
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Advertisements

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
IMGD 4000: Computer Graphics in Games Emmanuel Agu.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
Technology to the Warfighter Quicker Stream Processing for Computer Generated Forces Kickoff Meeting Maria Bauer RDECOM-STTC.
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
11/28/ Manocha Interactive CGF Computations using COTS Graphics Processors Dinesh Manocha University of North Carolina at Chapel Hill
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Interactive Visualization of Volumetric Data on Consumer PC Hardware: Introduction Daniel Weiskopf Graphics Hardware Trends Faster development than Moore’s.
CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.
General-Purpose Computation on Graphics Hardware.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Enhancing GPU for Scientific Computing Some thoughts.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Computer Graphics Graphics Hardware
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Jeremy Meredith Lawrence Livermore National Laboratory UCRL-PRES This work was performed under the auspices of the U.S. Department of Energy by.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Appendix C Graphics and Computing GPUs
Computer Graphics Graphics Hardware
Scalability of Intervisibility Testing using Clusters of GPUs
Christian Lauterbach GPGPU presentation 3/5/2007
Computer-Generated Force Acceleration using GPUs: Next Steps
NVIDIA Fermi Architecture
Computer Graphics Graphics Hardware
Data Parallel Computing on Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
Presentation transcript:

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC May 6th, 2004

May 6, Motivation GPUs are faster than CPUs GPUs are getting faster, faster Why? –Massive parallelism (1000s of ALUs) –Choreographed communication –Efficiently utilize VLSI resources [DIS/PCA mantra] Programmable GPUs = stream processors Many streaming applications beyond graphics Buy desktop supercomputer for $50! Revolutionize computing?

May 6, Recent Performance Trends

May 6, 20044

5 CPU vs GPU Intel 3 Ghz Pentium 4 –12 GFLOPS peak performance (via SSE2) –5.96 GB/sec peak memory bandwidth –44 GB/sec peak bandwidth from 8K L1 data cache NVIDIA GeForce 6800 –45 GFLOPS peak performance –36 GB/sec peak memory bandwidth –Texture cache bandwidth and size (undisclosed)?

May 6, Deliverables Develop version of PCA Brook for GPUs –Programmer need not know GL Versions –New ATI (420) and NVIDIA (NV40) hardware –Linux and Windows –DX and OpenGL Release as open source [V1.0 Dec 2003] Support OneSAF LOS, collision detection and route planning algorithms

May 6, Research Issues Brook semantics –E.g. variable length streams: vout –… Compilation techniques –Virtualization of GPU –Splitting kernels (MRDS) Explore streaming application space –Scientific computing: RT, MD, BLAS, FFT, … –Machine learning: HMM, linear mod., Bayes, …

Brook Update Ian Buck

May 6, 20049

Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

May 6, Dense Matrix-Matrix Multiplication Atlas on the Intel P4 wins!

May 6, CPU vs GPU Intel 3 Ghz Pentium 4 –12 GFLOPS peak performance (via SSE2) –5.96 GB/sec peak memory bandwidth –44 GB/sec peak bandwidth from 8K L1 data cache NVIDIA GeForce 6800 –43 GFLOPS peak performance –36 GB/sec peak memory bandwidth –Texture cache bandwidth and size (undisclosed)? Why is graphics hardware so slow?

May 6, Why is Graphics Hardware so Slow? GFLOPSCache BWSeq Read BW NV NV ATI 9800XT ATI X Microbenchmark (MAD) NVIDIA: 8% compute efficiency, 82% of cache bandwidth. Arithmetic intensity: 12 math operations per float fetched from cache ATI: 18% of peak performance, 99% of peak cache bandwidth. Arithimetic intensity: 8 to 1 math to cache-fetch ratio

May 6, Why is Graphics Hardware so Slow? Matrix-matrix multiplication is bandwidth limited on GPU. –Memory blocking to increase cache utilization does not help –Architectural problem, not programming model problem PCA stream processing architectures (Imagine) will do much better! GFLOPSBandwidth NV NV ATI 9800XT ATI X800~12~30 P Matrix-Matrix Multiplication

Variable Output Shaders Daniel Horn, Ian Buck, Pat Hanrahan

May 6, Motivation: Enabling Algorithms Not all algorithms map to the 1-in 1-out semantics of GPUs Other classes of algorithms require data filtering (1-in 0-out) and amplification (1-in n-out). Vout is conditional write on Imagine

May 6, Algorithms Ray Tracing terrains Marching Cubes Adaptive Subdivision Surfaces Collision Detection [OBB] Graph traversal …

May 6, Implementation on GPU Push output (sentinel if no push) Options to consolidate sentinels: –Sort O(n (log n)^2) Sort sentinels to the end, truncate –Scan/Search O(n log n) Perform a running sum, then search for gather loc –Scan/Scatter O(n log n) Perform a running sum, scatter to destination –Constant time hardware implementation

May 6, Timing and Bandwidth Numbers

May 6, Future Work Brook: semantics, compiling, virtualization –Support new GPU features (branching, FB ops, …) –Predication Integration with graphics pipeline –Documented path to texture for rendering –Access to other GPU features: e.g. occlusion culling Interactive simulation; new algorithms –Collision detection and line of sight calculations Merge ray tracer with UNC/SAIC algorithm –Machine learning: HMM, GLM, K-means,... –Protein folding (StreamMD) and docking –Virtual surgery

May 6, Distributed Brook Stream- and thread-level parallelism UPC distributed memory semantics PCI-express system for fast readback

May 6, GPU Cluster [DOE] 16 node cluster Each node 3U half depth GHz P4 Xeons 16GB DDR 1.2TB disk Infiniband 4X interconnect Dual 2.4GHz P4 Xeons Intel E7505 chipset 1GB DDR ATI Radeon 9800 Pro 256MB GigE 80 GB IDE

May 6, Questions? Fly-fishing fly images from The English Fly Fishing ShopThe English Fly Fishing Shop