Jeremy Meredith Lawrence Livermore National Laboratory UCRL-PRES-206819 This work was performed under the auspices of the U.S. Department of Energy by.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Some Thoughts on Technology and Strategies for Petaflops.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

Interactive Visualization of Volumetric Data on Consumer PC Hardware: Introduction Daniel Weiskopf Graphics Hardware Trends Faster development than Moore’s.

CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.

Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.

Enhancing GPU for Scientific Computing Some thoughts.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.

Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.

Computer Graphics Graphics Hardware

Christopher Mitchell CDA 6938, Spring The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Cg Programming Mapping Computational Concepts to GPUs.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

The programmable pipeline Lecture 3.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Graphic Processing Units Presentation by John Manning.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

Computer Graphics Graphics Hardware

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

NVIDIA Fermi Architecture

Computer Graphics Graphics Hardware

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

Ray Tracing on Programmable Graphics Hardware

Presentation transcript:

Jeremy Meredith Lawrence Livermore National Laboratory UCRL-PRES This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48. The GAIA Project: The GAIA Project: Evaluation of GPU-Based Programming Environments for Knowledge Discovery David Bremer, Lawrence Flath, John Johnson, Holger Jones, Sheila Vaidya, Randall Frank*

2 Motivation Trends in the graphics marketplace Trends in the graphics marketplace Inherent parallelism of graphics tasksInherent parallelism of graphics tasks Performance increasing faster than for CPUsPerformance increasing faster than for CPUs Move to programmable hardwareMove to programmable hardware Effects of mass marketsEffects of mass markets Not expected to end anytime soon… Not expected to end anytime soon… Today: 40GF, 2GB/s I/O, 30GB/s memoryToday: 40GF, 2GB/s I/O, 30GB/s memory 2006: 100GF, 8GB/s I/O, 60GB/s memory2006: 100GF, 8GB/s I/O, 60GB/s memory 2007: 1TF…2007: 1TF…

3 The NV40 and the Sony Playstation 3 Are graphics trends a glimpse of the future? Are graphics trends a glimpse of the future? The nVidia NV40 Architecture The nVidia NV40 Architecture 256MB+ RAM256MB+ RAM bit IEEE FP 400Mhz128 32bit IEEE FP 400Mhz 220M transistors, 110W of power220M transistors, 110W of power The PlayStation3 (patent application) The PlayStation3 (patent application) Core component is a cellCore component is a cell 1 “PowerPC” CPU + 8 APUs (“vectorial” processors) 1 “PowerPC” CPU + 8 APUs (“vectorial” processors) 4GHz, 128K RAM, 256GFLOP/cell 4GHz, 128K RAM, 256GFLOP/cell Multiple cells (Phone, PDA, PS3, …)Multiple cells (Phone, PDA, PS3, …) Four cell architecture (1TFLOP) Four cell architecture (1TFLOP) Central 64MB memory Central 64MB memory Keys Keys Streaming data modelsStreaming data models Cache-driven/cache-oblivious computingCache-driven/cache-oblivious computing nVidia NV30 nVidia NV40

4 Data representations for GPUs Programmable FP SIMD engines, GF today, 1TF by ’06 Programmable FP SIMD engines, GF today, 1TF by ’06 Where can they be exploited? Where can they be exploited? Many advantages for the data pipelineMany advantages for the data pipeline Data/algorithmic design challengesData/algorithmic design challenges Possible applicability for simulationPossible applicability for simulation Many current research projects on scientific computing, databases, audio processingMany current research projects on scientific computing, databases, audio processing Current projects Current projects Programmable rendering pipelineProgrammable rendering pipeline Multi-variate, interactive Multi-variate, interactive Increased graphics precision Increased graphics precision Image composition pipelineImage composition pipeline Implementation of physics based renderingImplementation of physics based rendering Simulated radiography, diffraction computation Simulated radiography, diffraction computation Large image geo-registrationLarge image geo-registration 100x performance improvement over CPU 100x performance improvement over CPU Texture RAM Vertex Program Volume A Volume B GPU Fragment Program

5 Specific Project Goals Investigate use of COTS technologies for computation Investigate use of COTS technologies for computation “Non-traditional” applications“Non-traditional” applications Image and speech Image and speech String, statistical, graph… String, statistical, graph… Mechanisms necessary for exploitationMechanisms necessary for exploitation Data infrastructure (e.g. cache coherent streaming…) Data infrastructure (e.g. cache coherent streaming…) Software abstractions Software abstractions Delineate some boundary conditions on their useDelineate some boundary conditions on their use Evaluation vs CPU based solutions Evaluation vs CPU based solutions Parameter-space investigation Parameter-space investigation

6 Data Infrastructure Forms the basis of a comparative framework Forms the basis of a comparative framework Support both GPU and CPU algorithmic implementationsSupport both GPU and CPU algorithmic implementations Targets multiple platformsTargets multiple platforms Provides data abstractionProvides data abstraction “Tile-based” streaming “Tile-based” streaming Cache coherency control Cache coherency control CPU to GPU to CPU glue layer CPU to GPU to CPU glue layer Utilizes higher-level languages for algorithmsUtilizes higher-level languages for algorithms Cg, Brook, GLSL, etc Cg, Brook, GLSL, etc

7 Image Processing Applications Common attributes Common attributes Large, streaming imagery on a single gfx cardLarge, streaming imagery on a single gfx card Parallel 1D and 2D applicationsParallel 1D and 2D applications Multi-spectral (four, possibly temporal channels)Multi-spectral (four, possibly temporal channels) Discrete convolution Discrete convolution Arbitrary kernelsArbitrary kernels Correlation Correlation Separate threshold, search, and detection phase includedSeparate threshold, search, and detection phase included

8 String Processing Applications Representation and bandwidth characteristics Representation and bandwidth characteristics String comparison String comparison “Bulk” comparison operations individual outputs“Bulk” comparison operations individual outputs String sorting String sorting Based on string comparisonBased on string comparison Batched sort based on radix algorithmsBatched sort based on radix algorithms String searching String searching “Wildcard” pattern matching“Wildcard” pattern matching Sort-based element searchSort-based element search

9 Other Application Targets Image transforms Image transforms FFT, WaveletFFT, Wavelet Many application domainsMany application domains Statistical functions on images Statistical functions on images Moments, regression (general linear model)Moments, regression (general linear model) Hypothesis/model driven image processing, texture characterization, etcHypothesis/model driven image processing, texture characterization, etc Hidden Markov ModelsHidden Markov Models Graph search Graph search Structured (fully connected) or unstructured graphs, detect and return lowest cost pathStructured (fully connected) or unstructured graphs, detect and return lowest cost path Many application domainsMany application domains

10 System Targets Constrained system targets based on resource limits Constrained system targets based on resource limits Hardware targets Hardware targets nVidia: NV3x, NV4x, NV5xnVidia: NV3x, NV4x, NV5x Focus on NV4x due to new branching capabilities Focus on NV4x due to new branching capabilities Dual CPU IA32 platform Dual CPU IA32 platform PCI-Express (PCIe) enhanced readback and async bandwidth PCI-Express (PCIe) enhanced readback and async bandwidth BG/L and MerrimacBG/L and Merrimac OS targets OS targets Primarily Linux, some Windows due to driver issuesPrimarily Linux, some Windows due to driver issues Language targets Language targets nVidia Cg, BrooknVidia Cg, Brook

11 Convolution Timing Results All timings count download, render, and readback All timings count download, render, and readback First render pass is excluded from the count First render pass is excluded from the count Overhead to load shader can be substantial Overhead to load shader can be substantial

12 Convolution Timing Results Software vs. two-texture hardware implementation Software vs. two-texture hardware implementation At all but the smallest kernel sizes, GPUs are much faster At all but the smallest kernel sizes, GPUs are much faster

13 Convolution Timing Results Software vs. two-texture hardware implementation Software vs. two-texture hardware implementation 32-bit textures use more memory bandwidth 32-bit textures use more memory bandwidth

14 Convolution Timing Results Two-texture vs. procedural hardware implementations Two-texture vs. procedural hardware implementations Two-texture implementation requires more memory bandwidth Two-texture implementation requires more memory bandwidth

15 Double Precision Port of David Bailey’s single-double Fortran library* to NVidia’s Cg language Port of David Bailey’s single-double Fortran library* to NVidia’s Cg language Can emulate double precision Can emulate double precision Use two single-precision floats Use two single-precision floats High order float is estimate to the double; Low order float is error of that estimate High order float is estimate to the double; Low order float is error of that estimate Resulting precision is almost double Resulting precision is almost double The exponent remains at single range available at htpp://crd.lbl.gov/~dhbailey/mpdist The exponent remains at single range available at htpp://crd.lbl.gov/~dhbailey/mpdist

16 Double Precision Results One Convolution Pass, Single vs Double Precision 32-bit Texture Size Convolution with single and emulated-double arithmetic Convolution with single and emulated-double arithmetic Double precision only 1.5x slower than single precision at the same texture depth Double precision only 1.5x slower than single precision at the same texture depth

17 Future Plans Obtain results for a variety of algorithms including strings, HMMs, and FFTs Obtain results for a variety of algorithms including strings, HMMs, and FFTs Include performance and accuracy Include performance and accuracy Extend to new architectures as available (e.g. Merrimac) Extend to new architectures as available (e.g. Merrimac) Explore other high-level languages (e.g. brook implementations and other streaming languages) Explore other high-level languages (e.g. brook implementations and other streaming languages) Launch a benchmarking web site: Launch a benchmarking web site: