© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Slides:



Advertisements
Similar presentations
Bus structures Unit objectives:
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.
Input/Output Systems and Peripheral Devices (03-2)
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Module 3 System Components. Case Form Factors Form Factor Characteristics ATX The most common form factor for full-sized computers. ATX boards measure.
INFO1119 (Fall 2012) INFO1119: Operating System and Hardware Module 2: Computer Components Hardware – Part 2 Hardware – Part 2.
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Peripheral Buses COMP Jamie Curtis. PC Buses ISA is the first generation bus 8 bit on IBM XT 16 bit on 286 or above (16MB/s) Extended through.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
COMP 1017: Digital Technologies Session 7: Motherboards.
Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture.
Peripheral Busses COMP Jamie Curtis. PC Busses ISA is the first generation bus 8 bit on IBM XT 16 bit on 286 or above (16MB/s) Extended through.
HyperTransport™ Technology I/O Link Presentation by Mike Jonas.
Bus structures Unit objectives Describe the primary types of buses, and define interrupt, IRQ, I/O address, DMA, and base memory address Describe the features.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.
Organization of a computer: The motherboard and its components.
Computer Graphics Graphics Hardware
Buses Warning: some of the terminology is used inconsistently within the field.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Computer Hardware The Processing Unit.
CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system.
Lecture 25 PC System Architecture PCIe Interconnect
1 PCI Express. 2FUJITSU CONFIDENTIAL What is PCI Express PCI-Express, formerly known as 3GIO(3 rd Generation I/O, is a 2-way, serial connection that carries.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Chun-Yuan Lin Brief of GPU&CUDA. What is GPU? Graphics Processing Units 2015/12/16 2 GPU.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
1 Chapter 2 Central Processing Unit. 2 CPU The "brain" of the computer system is called the central processing unit. Everything that a computer does is.
Input/Output Organization III: Commercial Bus Standards CE 140 A1/A2 20 August 2003.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 7: GPU as part of the PC Architecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Instructor: Syed Shuja Hussain Chapter 2: The System Unit.
Instructor: Chapter 2: The System Unit. Learning Objectives: Recognize how data is processed Understand processors Understand memory types and functions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
THE COMPUTER MOTHERBOARD AND ITS COMPONENTS Compiled By: Jishnu Pradeep.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
Computer Graphics Graphics Hardware
Testing PCI Express Generation 1 & 2 with the RTO Oscilloscope
Brief of GPU&CUDA Chun-Yuan Lin.
HyperTransport™ Technology I/O Link
CS 286 Computer Organization and Architecture
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
DRAM Bandwidth Slide credit: Slides adapted from
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Computer Graphics Graphics Hardware
© David Kirk/NVIDIA and Wen-mei W. Hwu,
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming Lecture 20: GPU as part of the PC Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Objective To understand the major factors that dictate performance when using GPU as an compute co-processor for the CPU –The speeds and feeds of the traditional CPU world –The speeds and feeds when employing a GPU –To form a solid knowledge base for performance programming in modern GPU’s

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Global variables declaration Function prototypes – __global__ void kernelOne(…) Main () – allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) – transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) – execution configuration setup – kernel call – kernelOne >>( args… ); – transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) – optional: compare against golden (host computed) solution Kernel – void kernelOne(type args,…) – variables declaration - __local__, __shared__ automatic variables transparently assigned to registers or local memory – syncthreads()… Review- Typical Structure of a CUDA Program repeat as needed

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Bandwidth – Gravity of Modern Computer Systems The Bandwidth between key components ultimately dictates system performance –Especially true for massively parallel systems processing massive amount of data –Tricks like buffering, reordering, caching can temporarily defy the rules in some cases –Ultimately, the performance falls back to what the “speeds and feeds” dictate

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Classic PC architecture Northbridge connects 3 components that must be communicate at high speed –CPU, DRAM, video –Video also needs to have 1 st - class access to DRAM –Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices CPU Core Logic Chipset

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign (Original) PCI Bus Specification Connected to the southBridge –Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate –More recently 66 MHz, 64-bit, 528 MB/second peak –Upstream bandwidth remain slow for device (~256MB/s peak) –Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCI as Memory Mapped I/O PCI device registers are mapped into the CPU’s physical address space –Accessed through loads/ stores (kernel mode) Addresses are assigned to the PCI devices at boot time –All devices listen for their addresses

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCI Express (PCIe) Switched, point-to-point connection –Each card has a dedicated “link” to the central switch, no bus arbitration. –Packet switches messages form virtual channel –Prioritized packets for QoS E.g., real-time video streaming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCIe 2 Links and Lanes Each link consists of one or more lanes –Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5Gb/s in one direction) Upstream and downstream now simultaneous and symmetric –Each Link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. –Each byte data is 8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/s per lane each way. –Thus, the net data rates are 250 MB/s (x1) 500 MB/s (x2), 1GB/s (x4), 2 GB/s (x8), 4 GB/s (x16), each way

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign 8/10 bit encoding Goal is to maintain DC balance while have sufficient state transition for clock recovery The difference of 1s and 0s in a 20-bit stream should be ≤ 2 There should be no more than 5 consecutive 1s or 0s in any stream , , bad , good Find 256 good patterns among 1024 total patterns of 10 bits to encode an 8-bit data An 20% overhead

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCIe PC Architecture PCIe forms the interconnect backbone –Northbridge/Southbridge are both PCIe switches –Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards –Some PCIe I/O cards are PCI cards with a PCI-PCIe bridge Source: Jon Stokes, PCI Express: An Overview – s/paedia/hardware/pcie.ars

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign GeForce 7800 GTX Board Details 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 16x PCI-Express SLI Connector DVI x 2 sVideo TV Out Single slot cooling

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign HyperTransport™ Feeds and Speeds Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe HyperTransport ™ 1.0 Specification –800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way) HyperTransport ™ 2.0 Specification –Added PCIe mapping – GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way) HyperTransport ™ 3.0 Specification – GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) –Added AC coupling to extend HyperTransport ™ to long distance to system-to-system interconnect Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCIe 3 A total of 8 Giga Transfers per second in each direction No more 8/10 encoding but uses a polynomial transformation at the transmitter and its inverse at the receiver to achieve the same effect So the effective bandwidth is double of PCIe 2

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign PCIe Data Transfer using DMA DMA (Direct Memory Access) is used to fully utilize the bandwidth of an I/O bus –DMA uses physical address for source and destination –Transfers a number of bytes requested by OS –Needs pinned memory Main Memory (DRAM) GPU card (or other I/O cards) CPU DMA Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Pinned Memory DMA uses physical addresses The OS could accidentally page out the data that is being read or written by a DMA and page in another virtual page into the same location Pinned memory cannot not be paged out If a source or destination of a cudaMemCpy() in the host memory is not pinned, it needs to be first copied to a pinned memory – extra overhead cudaMemcpy is much faster with pinned host memory source or destination

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Allocate/Free Pinned Memory (a.k.a. Page Locked Memory) cudaHostAlloc() –Three parameters –Address of pointer to the allocated memory –Size of the allocated memory in bytes –Option – use cudaHostAllocDefault for now cudaFreeHost() –One parameter –Pointer to the memory to be freed

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Using Pinned Memory Use the allocated memory and its pointer the same way those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudaMemcpy function should be about 2X faster with pinned memory Pinned memory is a limited resource whose over-subscription can have serious consequences

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Important Trends Knowing yesterday, today, and tomorrow –The PC world is becoming flatter –CPU and GPU are being fused together –Outsourcing of computation is becoming easier…

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ANY MOE QUESTIONS?