Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Slides:



Advertisements
Similar presentations
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Chapter 17 Parallel Processing.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Lecture 2 : Introduction to Multicore Computing
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
History of Microprocessor MPIntroductionData BusAddress Bus
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
GPU Architecture and Programming
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
OpenCL Programming James Perry EPCC The University of Edinburgh.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
Synchronization These notes introduce:
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,
1 A simple parallel algorithm Adding n numbers in parallel.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
The Art of Parallel Processing
Parallel Processing - introduction
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Multicore / Multiprocessor Architectures
Symmetric Multiprocessing (SMP)
Chapter 1 Introduction.
© 2012 Elsevier, Inc. All rights reserved.
Introduction to CUDA.
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Intel CPU for Desktop PC: Past, Present, Future
Presentation transcript:

Multi-Core Development Kyle Anderson

Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism

History First 4 bit microprocessor – ,000 instructions per second 2,300 transistors First 8 bit microprocessor – ,000 instructions per second 4,500 transistors Altair 8800 First 32 bit microprocessor – ,000 transistors

History First Pentium processor released – MHz Pentium 4 released – GHz 42,000,000 transistors Approach 4GHz Core 2 Duo released – ,000,000 tranisitors

History

Pollack’s Law Processor Performance grows with square root of area

Pollack’s Law

Moore’s Law “The Number of transistors incorporated in a chip will approximately double every 24 months.” – Gordon Moore, Intel co-founder Smaller and smaller transistors

Moore’s Law

CPU Sequential Fully functioning cores 16 cores maximum Currently Hyperthreading Little Latency

GPU Higher latency Thousands of cores Simple calculations Used for research

OpenCL Multitude of Devices Run-time compilation ensures most up to date features on device Lock-Step

OpenCL Data Structures Host Device Compute Units Work-Group Work-Item Command Queue Kernel Context

OpenCL Types of Memory Global Constant Local Private

OpenCL

OpenCL Example

CUDA NVidia's proprietary API for their GPU’s Stands for “Compute Unified Device Architecture” Compiles directly to hardware Used by Adobe, Autodesk, National Instruments, Microsoft and Wolfram Mathematica Faster than OpenCL because compiled directly on hardware and focus on a single architecture.

CUDA Indexing

CUDA Example

CUDA Function Call cudaMemcpy( dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice ); add >>( dev _ a, dev _ b, dev _ c );

Types of Parallelism SIMD MISD MIMD Instruction parallelism Task parallelism Data parallelism

SISD Stands for Single Instruction, Single Data Does not use multiple cores

SIMD Stands for “Single Instruction, Multiple Data Streams” Can be process multiple data streams concurrently

MISD Stands for “Multiple Instruction, Single Data” Risky because several instructions are processing the same data

MIMD Stands for “Multiple Instruction, Multiple Data” Instructions are processed sequentially

Instruction Parallelism Mutually exclusive MIMD and MISD often use this Allows multiple instructions to be run at once Instructions considered operations Not programmatically done Hardware Compiler

Task Parallelism Dividing up of main tasks or controls Runs multiple threads concurrently

Data Parallelism Used by SIMD and MIMD A list of instructions is able to work concurrently on a several data sets