Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.

Slides:



Advertisements
Similar presentations
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
ECEn 191 – New Student Seminar - Session 9: Microprocessors, Digital Design Microprocessors and Digital Design ECEn 191 New Student Seminar.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Trends in the Infrastructure of Computing
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
ECEn 191 – New Student Seminar - Session 6 Digital Logic Digital Logic ECEn 191 New Student Seminar.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Philipp Gysel ECE Department University of California, Davis
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Why Parallel/Distributed Computing Sushil K. Prasad
Architecture & Organization 1
Computer Architecture and Organization
CSCE 190: Computing in the Modern World Dr. Jason D. Bakos
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos

CPU Design Heterogeneous Computing 2 ALU Control L1 cache L2 cache Control L1 cache L2 cache L3 cache CPU Core 1Core 2 FOR i = 0 to 1000 C[i] = A[i] + B[i] Copy part of A onto CPU Copy part of B onto CPU ADD Copy part of C into Mem time Copy part of A onto CPU Copy part of B onto CPU ADD Copy part of C into Mem …

Co-Processor Design Heterogeneous Computing 3 ALU cache GPU Core 1 ALU cache ALU cache ALU cache ALU cache ALU cache ALU cache ALU cache ALU Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 FOR i = 0 to 1000 C[i] = A[i] + B[i]

Heterogeneous Computing Heterogeneous Computing 4 initialization 0.5% of run time “hot” loop 99% of run time clean up 0.5% of run time 49% of code 2% of code co-processor Kernel speedup Application speedup Execution time hours hours hours hours hours Combine CPUs and coprocs Example: –Application requires a week of CPU time –Offload computation consumes 99% of execution time

Heterogeneous Computing 5 Heterogeneous Computing CPU X58 Host Memory Co- processor QPIPCIe On board Memory add-in cardhost General purpose CPU + special-purpose processors in one system ~25 GB/s ~8 GB/s (x16) ????? ~150 GB/s for GeForce 260

NVIDIA GT200 GPU Architecture Heterogeneous Computing 6 30 “streaming multiprocesors” Simple cores: –In-order execution, no: branch prediction, spec. execution, multiple issue, context switches –No cache (just 16K programmer- managed on-chip memory) –One active instruction at a time Can execute 32 threads in parallel per SM if no branch divergence

IBM Cell/B.E. Architecture Heterogeneous Computing 7 1 PowerPC, 8 small processors Programmer must manually manage 256K memory and threads invocation on each SPE Each SPE includes a vector unit like the one on current Intel processors –128 bits wide

Field Programmable Gate Arrays Heterogeneous Computing 8

Research Problems How well do certain types of (scientific) programs run on heterogeneous machines? What benefit should we expect? How do you modify an arbitrary program to run on a heterogeneous machine? What is the best way to design coprocessors? How can we make them faster and easier to program? What is the best way to design heterogeneous machines? How do we interconnect the CPU and coprocessors? Heterogeneous Computing 9

Heterogeneous Computing with FPGAs Annapolis Micro Systems WILDSTAR 2 PRO GiDEL PROCSTAR III Heterogeneous Computing 10

Heterogeneous Computing with FPGAs Convey HC-1 Heterogeneous Computing 11

Heterogeneous Computing with GPUs NVIDIA Tesla S1070 Heterogeneous Computing 12

Programming GPUs Heterogeneous Computing 13 GPU code: kernel function vector_multiply (vector v1, vector v2, vector v3) { /* declare arrays in on-chip memory */ on_chip_vector temp1,temp2,temp3; /* copy inputs from off-chip to on-chip in parallel */ temp1[my_thread_number]=v1[iteration count x number_of_threads + my_thread_number]; temp2[my_thread_number]=v2[iteration count x number_of_threads + my_thread_number]; /* perform parallel multiply */ temp3[my_thread_number]=temp1[my_thread_number]* temp2[my_thread_number]; /* copy outputs from on-chip to off-chip in parallel */ v3[iteration count x number_of_threads + my_thread_number]= temp3[my_thread_number]; } CPU code: copy v1 from cpu memory to gpu memory copy v2 from cpu memory to gpu memory invoke vector_multiply on GPU with 100 threads copy v3 from gpu memory to cpu memory

Heterogeneous Computing 14 Programming FPGAs

Our Projects FPGA-based co-processors for computational biology Heterogeneous Computing 15 1.Tiffany M. Mintz, Jason D. Bakos, "A Cluster-on-a-Chip Architecture for High-Throughput Phylogeny Search," IEEE Trans. on Parallel and Distributed Systems, to appear. 2.Stephanie Zierke, Jason D. Bakos, "FPGA Acceleration of Bayesian Phylogenetic Inference," BMC Bioinformatics, BMC Bioinformatics 2010, 11: Jason D. Bakos, Panormitis E. Elenis, "A Special-Purpose Architecture for Solving the Breakpoint Median Problem," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No. 12, Dec Jason D. Bakos, Panormitis E. Elenis, Jijun Tang, "FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data," 7th IEEE International Symposium on Bioinformatics & Bioengineering (BIBE'07), Boston, MA, Oct , Jason D. Bakos, “FPGA Acceleration of Gene Rearrangement Analysis,” 15th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM'07), April 23-25, 2007.

Our Projects FPGA-based co-processors for linear algebra Heterogeneous Computing 16 1.Krishna.K. Nagar, Jason D. Bakos, "A High-Performance Double Precision Accumulator," IEEE International Conference on Field-Programmable Technology (IC-FPT'09), Dec. 9-11, Yan Zhang, Yasser Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos, "FPGA vs. GPU for Sparse Matrix Vector Multiply," IEEE International Conference on Field-Programmable Technology (IC-FPT'09), Dec. 9-11, Krishna K. Nagar, Yan Zhang, Jason D. Bakos, "An Integrated Reduction Technique for a Double Precision Accumulator," Proc. Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA'09), held in conjunction with Supercomputing 2009 (SC'09), Nov. 15, Jason D. Bakos, Krishna K. Nagar, "Exploiting Matrix Symmetry to Improve FPGA-Accelerated Conjugate Gradient," 17th Annual IEEE International Symposium on Field Programmable Custom Computing Machines (FCCM'09), April 5-8, 2009.

Our Projects Heterogeneous Computing 17 Multi-FPGA System Architectures 1.Jason D. Bakos, Charles L. Cathey, E. Allen Michalski, "Predictive Load Balancing for Interconnected FPGAs," 16th International Conference on Field Programmable Logic and Applications (FPL'06), Madrid, Spain, August 28-30, Charles L. Cathey, Jason D. Bakos, Duncan A. Buell, "A Reconfigurable Distributed Computing Fabric Exploiting Multilevel Parallelism," 14th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM'06), April 24-26, 2006.

Most Recent Projects Task Partitioning for Heterogeneous Computing Heterogeneous Computing 18 GPU and FPGA Acceleration of Data Mining GPU Acceleration of Logic Synthesis 1.Ibrahim Savran, Jason D. Bakos, "GPU Acceleration of Near- Minimal Logic Minimization," 2010 Symposium on Application Accelerators in High Performance Computing (SAAHPC'10), July 13-15, 2010.

Contact Information Jason D. Bakos –Office: 3A52 – – Heterogeneous and Reconfigurable Computing (HeRC) Lab: –Lab: 3D15 – Heterogeneous Computing 19

Our Group Heterogeneous and Reconfigurable Computing Group Zheming Jin Tiffany Mintz Krishna Nagar Jason BakosYan Zhang Heterogeneous Computing 20