Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist +1.970.672.1252 |

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Graphics Processing Units
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Computer Graphics Graphics Hardware
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Course: Presented by: Ahmad Hammad Course:
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Playstation2 Architecture Architecture Hardware Design.
Contemporary Languages in Parallel Computing Raymond Hummel.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Martin Kruliš by Martin Kruliš (v1.1)1.
Computer Engg, IIT(BHU)
Appendix C Graphics and Computing GPUs
The Present and Future of Parallelism on GPUs
Computer Graphics Graphics Hardware
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
CS427 Multicore Architecture and Parallel Computing
Unconventional applications of Intel® Xeon Phi™ Processor (KNL)
Linchuan Chen, Xin Huo and Gagan Agrawal
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Multicore / Multiprocessor Architectures
Computer Graphics Graphics Hardware
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

2 Dell LHC Team nVidia Fermi Architecture Up to 512 cores 16 Streaming multiprocessors each with GHz Parallel DataCache 64 KB Shmem/L1 Cache 768 KB Unified L2 Cache Six 64-bit memory partitions 384-bit memory interface Up to 6 GB GDDR5 DRAM Up to 16 concurrent kernels IEEE floating point math ECC memory

3 Dell LHC Team Fermi Streaming Multiprocssor Architecture 32 Cores 32-bit Integer ALU with 64-bit extensions Full IEEE bit and 64-bit precision 64 KB Shared Memory/L1 cache 16KB Shmem/48KB cache or 48KB Shmem/16KB L1 cache 16 load/store units Dual Warp scheduler (dual instruction issue) Four Special Function Units (SFUs) for sin, cosine, reciprocal, and square root operations

4 Dell LHC Team Comparison to Previous nVidia GPGPUs

Confidential5 Dell LHC Team GHz 4 threads/core, 128 total parallel threads 32KB i-cache, 32KB d-cache 256KB coherent L2 cache (8MB total) 512bit vector unit 16 Single precision FLOPs/clock 8 Double precision FLOPS/clock Intel MIC Architecture Pronounced “Mike” Many cores with many threads per core Standard IA programming and memory model Knights Ferry Software development platform 1-2GB GDDR5 connected to host memory through PCI DMA operations with virtual addressing Intel HPC developer tools

6 Dell LHC Team MIC Programming Environment Inherently supports OpenMP. Virtual memory environment extends back to host memory. Intel Parallel Studio and Cluster Studio support MIC. Optimizing performance will take almost as much effort as for CUDA and OpenCL environments.

7 Dell LHC Team Knights Corner 1 st Production MIC Co-processor Second Half 2012 Knowns: 50+ cores 22nm manufacturing process Unknowns: Core frequency Size of GDDR5 memory on board ECC support

8 Dell LHC Team Co-processor Comparison

9 Dell LHC Team Co-processor Adoption Commercial adoption: Oil & Gas/seismic data processing Financial services Ray tracing Molecular dynamics Commercial applications: MATLAB, ANSYS Barriers to adoption Lack of parallel programming skills Immature software development environment & standards CUDA vs. OpenCL vs. OpenMP Waiting for the compiler or libraries to abstract the accelerator Uncertainty of benefit vs. effort Amdahl’s law is still the law! Maximum Speedup = Huge investment in current codes

10 Dell LHC Team AMD “New Era of Processor Performance”

11 Dell LHC Team Final Thoughts 1. Co-processors are here to stay, but their architectures will continue to evolve. 2. Programing tools will get easier to use and will further integrate co-processing technology. 3. Further abstraction of the underlying co-processor hardware is necessary to achieve broad adoption. 4. Processors from Intel and AMD will integrate co-processors before the end of the decade. 5. Preparing applications for extreme parallelism will enable users to get the most out of future systems.

Thank you! Roger Goff Dell Senior Global CERN/LHC Technologist |