“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA
André Seznec ALF team Irisa-Inria 2 Technology trend for general-purpose HPC « processor » Up to early 90’s: Multi-chip vector processors $ Major cost: the memory system –Strided vectors –Scatter-gather for sparse processing From mid 90’s: Use of the killer micro’s: shared or distributed memory More cost-effective peak performance Effective when memory hierarchy is leveraged
André Seznec ALF team Irisa-Inria 3 HPC has very limited impact on processor development For 15 years, HPC does not drive « processor » development: Niche market Use (maybe adapt) off-the-shelf components: High-end microprocessors (Alpha, Power, HP-PA, Itanium), and more and more x86 –Need to exploit the memory hierarchy Now: –GPUs (massively threaded vector processors) Specialized form of vector processing –Cell Hand managed local memory
André Seznec ALF team Irisa-Inria 4 For QCD Vector supercomputers were not cost effective Too expensive Limited performance: 1 flop per word Build your own machine tradition in the QCD community ApeMille, ApeNext Exploit the particularities of algorithm: Complex arithmetic, small matrices VLIW architecture, no cache
André Seznec ALF team Irisa-Inria 5 No one can afford designing a high performance chip Use /(at best) adapt off-the-shelf components A new bio-diversity of high-performance floating-point engines is available just now
André Seznec ALF team Irisa-Inria 6 Intel Terascale prototype: 80 processors 1.81 Teraflops, 265 Watts (just a prototype !)
André Seznec ALF team Irisa-Inria 7 The many core era 4-8 general-purpose cores now: in Technologically feasible Economic viability ? –// general-purpose applications ? Will the end user accept to pay for 1000 cores when applications exhibit only a 10x speed-up !! –Main memory bandwidth will not scale !! Which architecture for the many-cores ? Till 2009, homogeneous multicores for GP heterogeneous for embedded / special purpose: –E.g. Cell, GPU
André Seznec ALF team Irisa-Inria 8 Direction of (single chip) architecture: betting on parallelism success 1 complex 4-way superscalar = 16 simple RISC (Future) applications are intrinsically parallel: As much as possible simple cores (Future) applications are moderately parallel A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores
André Seznec ALF team Irisa-Inria 9 SSC: Sea of Simple Cores e.g. Intel Larrabee
André Seznec ALF team Irisa-Inria 10 FCC: Few Complex Cores e.g. Intel Nehalem 4-way O-O-O superscalar 4-way O-O-O superscalar Shared L3 cache 4-way O-O-O superscalar
André Seznec ALF team Irisa-Inria 11 Homogeneous vs heteregoneous Homogeneous: just replicate the same core Extension of “conventional” multiprocessors Hetorogeneous : A la CELL ?: (master processor + slave processors) x N A la SoC ? : specialized (poorly programmable) coprocessors Unlikely for HPC Same ISA, but different microarchitectures ? Unlikely in the short term
André Seznec ALF team Irisa-Inria 12 Hardware accelerators ? SIMD extensions: Seems to be accepted, report the burden to application developers and compilers 512-bit SIMD instructions on Larrabee –General trend, not such a mess in hardware :-) Reconfigurable datapaths: Popular when you get a well defined intrinsically parallel application: Programmability ? Real vector extensions (strides and scatter gather) Would be a good move for HPC Are there mainstream applications benefiting ? Not very useful for QCD
André Seznec ALF team Irisa-Inria 13 Reconsider the “on-chip memory/processors” tradeoff The uniprocessor credo was: “Use the remaining silicon for caches” New issue: “An extra processor or more cache” recently local memory (e.g. Cell) Extra processor = more processing power –increased memory bandwidth demand –increased power consumption, more temperature hot spots More cache or local memory = decreased (main) memory demand
André Seznec ALF team Irisa-Inria 14 Memory hierarchy organization ?
André Seznec ALF team Irisa-Inria 15 Flat organization ? μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ Local or distributed memory or cache? Manage cache locality through software or hardware ?
André Seznec ALF team Irisa-Inria 16 Hierarchical organization ? μP μP$ μP μP$ L2 $ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 $
André Seznec ALF team Irisa-Inria 17 Hierarchical organization ? Arbitration at all levels Coherency at all levels Interleaving at all levels Bandwidth dimensioning
André Seznec ALF team Irisa-Inria 18 Hardware Multithreading of course !! Execute several threads on a single core Pentium 4, Nehalem, Sun Niagara Just an extra level of thread parallelism !! If you have 100 processes, likely to afford 1,000 ! A major mean to tolerate memory latency: GPUs are featuring 100’s of threads
André Seznec ALF team Irisa-Inria 19 For HPC and QCD? Unprecedented potential performance off-the-shelf: Single chip teraflops is nearly there 200 Gflops Cell : 25 GB/s to memory 500 Gflops GPU : 80 GB/s to memory 50 Gflops Nehalem: 25 GB/s to memory Larabbee: 1 Tflops ? 50 GB/s ? : 1000 cores- 5 Ghz - 32 flops/cycle (e.g. SSE 512 bits) 160 teraflops: integration promises it 4096 bits memory channel- 2Ghz: –1 terabyte/s to memory, but quite optimistic Will they deliver ?
André Seznec ALF team Irisa-Inria 20 HPC and QCD: the « processor » architecture issue (for the user) It is the (main) memory, stupid ! The « old » vector supercomputers: –Around 1 word per flop: per word granularity The superscalar microprocessors: –Around 1 word per 10 flops: per 64B block GPU, Cell –Around 1 word per 25 flops: large contiguous blocks –Around 1 word per flops : large granularity For QCD: need to find new locality
André Seznec ALF team Irisa-Inria 21 HPC and QCD and GPUs In 2009, GPUs are very cost-effective floating-point engines: High peak performance High memory bandwidth SIMD-like control DP performance ? Locality exploitation ? Cost-effective hardware solutions (in 2009) for massive vector applications: Contiguous vectors of data Limited control Ad’hoc programming (CUDA) ? Coprocessor model ?
André Seznec ALF team Irisa-Inria 22 HPC and contiguous vector parallelism Can be exploited by any architecture: GPUs are cost effective and tolerate memory latency « Vector » instructions - application portability Cells-like: Necessitates explicit data move - application portability Many-cores (Larrabee) with wide SIMD instructions: Software prefetch + « vector » instructions + applications portability - cache sharing
André Seznec ALF team Irisa-Inria 23 Conclusion HPC and QCD will have to use off-the-shelf « processors » Massive thread parallelism might be available on-chip before 2015: cores ? (if killer applications appear !!) Contiguous vector parallelism allows huge peak performance in the mid-term GPUs, SIMD instructions Real vectors (strides, scatter-gather) ? : unlikely to appear