“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Chapter Hardwired vs Microprogrammed Control Multithreading
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
High performance computing architecture examples Unit 2.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
The Present and Future of Parallelism on GPUs
Productive Performance Tools for Heterogeneous Parallel Computing
Lynn Choi School of Electrical Engineering
Lecture 5 Approaches to Concurrency: The Multiprocessor
Visit for more Learning Resources
Lynn Choi School of Electrical Engineering
Presented by: Tim Olson, Architect
Multicore / Multiprocessor Architectures
Chapter 1 Introduction.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
ALF Amdhal’s Law is Forever
CSE 502: Computer Architecture
Presentation transcript:

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA

André Seznec ALF team Irisa-Inria 2 Technology trend for general-purpose HPC « processor »  Up to early 90’s:  Multi-chip vector processors $ Major cost: the memory system –Strided vectors –Scatter-gather for sparse processing  From mid 90’s:  Use of the killer micro’s: shared or distributed memory More cost-effective peak performance Effective when memory hierarchy is leveraged

André Seznec ALF team Irisa-Inria 3 HPC has very limited impact on processor development  For 15 years, HPC does not drive « processor » development:  Niche market  Use (maybe adapt) off-the-shelf components: High-end microprocessors (Alpha, Power, HP-PA, Itanium), and more and more x86 –Need to exploit the memory hierarchy Now: –GPUs (massively threaded vector processors) Specialized form of vector processing –Cell Hand managed local memory

André Seznec ALF team Irisa-Inria 4 For QCD  Vector supercomputers were not cost effective  Too expensive  Limited performance: 1 flop per word  Build your own machine tradition in the QCD community  ApeMille, ApeNext  Exploit the particularities of algorithm: Complex arithmetic, small matrices VLIW architecture, no cache

André Seznec ALF team Irisa-Inria 5 No one can afford designing a high performance chip  Use /(at best) adapt off-the-shelf components A new bio-diversity of high-performance floating-point engines is available just now

André Seznec ALF team Irisa-Inria 6 Intel Terascale prototype: 80 processors 1.81 Teraflops, 265 Watts (just a prototype !)

André Seznec ALF team Irisa-Inria 7 The many core era  4-8 general-purpose cores now:  in Technologically feasible Economic viability ? –// general-purpose applications ? Will the end user accept to pay for 1000 cores when applications exhibit only a 10x speed-up !! –Main memory bandwidth will not scale !!  Which architecture for the many-cores ?  Till 2009, homogeneous multicores for GP heterogeneous for embedded / special purpose: –E.g. Cell, GPU

André Seznec ALF team Irisa-Inria 8 Direction of (single chip) architecture: betting on parallelism success  1 complex 4-way superscalar = 16 simple RISC  (Future) applications are intrinsically parallel:  As much as possible simple cores  (Future) applications are moderately parallel  A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores

André Seznec ALF team Irisa-Inria 9 SSC: Sea of Simple Cores e.g. Intel Larrabee

André Seznec ALF team Irisa-Inria 10 FCC: Few Complex Cores e.g. Intel Nehalem 4-way O-O-O superscalar 4-way O-O-O superscalar Shared L3 cache 4-way O-O-O superscalar

André Seznec ALF team Irisa-Inria 11 Homogeneous vs heteregoneous  Homogeneous: just replicate the same core  Extension of “conventional” multiprocessors  Hetorogeneous :  A la CELL ?: (master processor + slave processors) x N  A la SoC ? : specialized (poorly programmable) coprocessors Unlikely for HPC  Same ISA, but different microarchitectures ? Unlikely in the short term

André Seznec ALF team Irisa-Inria 12 Hardware accelerators ?  SIMD extensions:  Seems to be accepted, report the burden to application developers and compilers 512-bit SIMD instructions on Larrabee –General trend, not such a mess in hardware :-)  Reconfigurable datapaths:  Popular when you get a well defined intrinsically parallel application: Programmability ?  Real vector extensions (strides and scatter gather)  Would be a good move for HPC Are there mainstream applications benefiting ? Not very useful for QCD

André Seznec ALF team Irisa-Inria 13 Reconsider the “on-chip memory/processors” tradeoff  The uniprocessor credo was: “Use the remaining silicon for caches”  New issue: “An extra processor or more cache” recently local memory (e.g. Cell) Extra processor = more processing power –increased memory bandwidth demand –increased power consumption, more temperature hot spots More cache or local memory = decreased (main) memory demand

André Seznec ALF team Irisa-Inria 14 Memory hierarchy organization ?

André Seznec ALF team Irisa-Inria 15 Flat organization ? μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ Local or distributed memory or cache? Manage cache locality through software or hardware ?

André Seznec ALF team Irisa-Inria 16 Hierarchical organization ? μP μP$ μP μP$ L2 $ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 $

André Seznec ALF team Irisa-Inria 17 Hierarchical organization ?  Arbitration at all levels  Coherency at all levels  Interleaving at all levels  Bandwidth dimensioning

André Seznec ALF team Irisa-Inria 18 Hardware Multithreading of course !!  Execute several threads on a single core  Pentium 4, Nehalem, Sun Niagara  Just an extra level of thread parallelism !!  If you have 100 processes, likely to afford 1,000 !  A major mean to tolerate memory latency:  GPUs are featuring 100’s of threads

André Seznec ALF team Irisa-Inria 19 For HPC and QCD?  Unprecedented potential performance off-the-shelf:  Single chip teraflops is nearly there 200 Gflops Cell : 25 GB/s to memory 500 Gflops GPU : 80 GB/s to memory 50 Gflops Nehalem: 25 GB/s to memory Larabbee: 1 Tflops ? 50 GB/s ?  :  1000 cores- 5 Ghz - 32 flops/cycle (e.g. SSE 512 bits) 160 teraflops: integration promises it 4096 bits memory channel- 2Ghz: –1 terabyte/s to memory, but quite optimistic Will they deliver ?

André Seznec ALF team Irisa-Inria 20 HPC and QCD: the « processor » architecture issue (for the user)  It is the (main) memory, stupid !  The « old » vector supercomputers: –Around 1 word per flop: per word granularity  The superscalar microprocessors: –Around 1 word per 10 flops: per 64B block  GPU, Cell –Around 1 word per 25 flops: large contiguous blocks  –Around 1 word per flops : large granularity For QCD: need to find new locality

André Seznec ALF team Irisa-Inria 21 HPC and QCD and GPUs  In 2009, GPUs are very cost-effective floating-point engines:  High peak performance  High memory bandwidth  SIMD-like control  DP performance ? Locality exploitation ?  Cost-effective hardware solutions (in 2009) for massive vector applications:  Contiguous vectors of data  Limited control  Ad’hoc programming (CUDA) ?  Coprocessor model ?

André Seznec ALF team Irisa-Inria 22 HPC and contiguous vector parallelism  Can be exploited by any architecture:  GPUs are cost effective and tolerate memory latency « Vector » instructions - application portability  Cells-like: Necessitates explicit data move - application portability  Many-cores (Larrabee) with wide SIMD instructions: Software prefetch + « vector » instructions + applications portability - cache sharing

André Seznec ALF team Irisa-Inria 23 Conclusion  HPC and QCD will have to use off-the-shelf « processors »  Massive thread parallelism might be available on-chip before 2015:  cores ? (if killer applications appear !!)  Contiguous vector parallelism allows huge peak performance in the mid-term  GPUs, SIMD instructions  Real vectors (strides, scatter-gather) ? : unlikely to appear