Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.

Slides:



Advertisements
Similar presentations
“Amdahl's Law in the Multicore Era” Mark Hill and Mike Marty University of Wisconsin IEEE Computer, July 2008 Presented by Dan Sorin.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Computer Abstractions and Technology
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Energy Model for Multiprocess Applications Texas Tech University.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Performance and Energy Efficiency of GPUs and FPGAs
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
A Survey of Mobile Cloud Computing Application Models
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
Amdahl’s Law in the Multicore Era Mark D.Hill & Michael R.Marty 2008 ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger The Dark Silicon Implications.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
CS203 – Advanced Computer Architecture
Philipp Gysel ECE Department University of California, Davis
CS203 – Advanced Computer Architecture Performance Evaluation.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS203 – Advanced Computer Architecture
CS203 – Advanced Computer Architecture
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Computer Architecture: Parallel Processing Basics
VLSI Tarik Booker.
Enabling machine learning in embedded systems
Uniprocessor Performance
Morgan Kaufmann Publishers
Clusters of Computational Accelerators
Parallel Processing Sharing the load.
CS/EE 6810: Computer Architecture
HIGH LEVEL SYNTHESIS.
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
Presentation transcript:

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015

Multiprocessor Era Why Multiprocessor? Performance gains while parallel processing Recall Moore`s law. Better technology to support more transistors per chip. Many cores but the performance of each core is still the same.

Why this study Energy efficiency Off chip bandwidths Need a better design for managing multiple cores. Solutions: Same strength multiple cores Custom logic design GPGPU SIMD engine Field programmable gate array

Chip Models

Prior work work done by Prof. Hill and Marty M. D. Hill et al., “Amdahl’s Law in the Multicore Era,” Computer, vol. 41, pp. 33–38, Conventional Cores -> Serial section of code Unconventional Cores -> Parallel section of code Extension on modelling Unconventional cores (U-cores) This work is targeting less obvious relationship between power and performance for U core multiprocessors

Focus of the study Modelling unconventional U-cores. Identify important trends in U-cores design Initial observations: Custom logic -> very efficient but costly GPGPU -> promising due to SIMD vector operations FPGA -> great flexibility at the cost of area and power

What they used for modelling Need a cost model that includes power budget. Power model for each BCE. Power model for sequential core. Power-seq (perf ) = perf^α as per E. Grochowski et al., “Energy per Instruction Trends in Intel Microprocessors,”in Magazine, where α was estimated to be Pollack’s Law perf = sqrt(r) Hence Power-seq (perf ) = sqrt(r)^α

Assumption for the model Clock frequency does not increase. Parallel sections are perfectly parallelizable Serial sections are perfectly serial No overhead in synchronizing memories Power hungry sequential processor could be turned off completely without any static power consumption

New Speedup

Cost function for Bandwidth Defined in terms of BCE compulsory bandwidth Compulsory bandwidth: Working bandwidth of a BCE when entire kernel is in on-chip memory. Scales linearly w.r.t performance.

Modelling U-cores for power and Bandwidth Two new parameters: μ, φ μ: relative performance relative to BCE core. φ: relative bandwidth compared to BCE compulsory bandwidth Can characterize any design space for U-cores. a U-core with μ > 1 and f = φ : Accelerator Similarly, μ = 1 but f < φ : Same performance with less power

Calibration Methodology To calibrate μ, φ Devices used: Core i7-960 – 4 way multicore GTX285, GTX480 : Programmable Nvidia GPU R5870 : Similar capable GPU from Advanced Micro Devices Virtex-6 LX760 : FPGA from Xillinx 65nm commercial synthesis for custom logic Workloads: Matrix-Matrix multiplication (MMM): high arithmetic intensity and simple memory Fast Fourier Transform (FFT): possesses complex dataflow and memory requirements Black-Scholes (BS): rich mixture of arithmetic operators.

Results:

On Equal Area basis, 3.4 performance Improvement at 0.7X power relative to BCE

Reevaluate U-cores ITRS roadmap poses major challenge. Three questions need to be answered: Is it good to go with Heterogeneous U cores under these bandwidth and power limitation? Is the custom logic always the best? Can our conclusion change if first order motive is Energy efficiency and not performance?

Useful links Prof. Hill and his team has developed a java based online tool to change parameters of cost function of these models and regenerate resulting speedup. Lets take a look at this tool,

Thank You

Recent Work in the domain Paul, S.; Krishna, A.; Wenchao Qian; Karam, R.; Bhunia, S. "MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data-Intensive Applications", Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, On page(s): Volume: 23, Issue: 6, June 2015Abstract | Full Text: PDF (5386KB)AbstractPDF Polig, R.; Atasu, K.; Chiticariu, L.; Hagleitner, C.; Hofstee, H.P.; Reiss, F.R.; Zhu, H.; Sitaridi, E. "Giving Text Analytics a Boost", Micro, IEEE, On page(s): Volume: 34, Issue: 4, July-Aug Nilakantan, S.; Battle, S.; Hempstead, M. "Metrics for Early-Stage Modeling of Many-Accelerator Architectures", Computer Architecture Letters, On page(s): Volume: 12, Issue: 1, January-June 2013 Total citations till now: 45

Backup Slides – Varying f for FFT workload