“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA

André Seznec ALF team Irisa-Inria 2 Technology trend for general-purpose HPC « processor »  Up to early 90’s:  Multi-chip vector processors 10000000 $ Major cost: the memory system –Strided vectors –Scatter-gather for sparse processing  From mid 90’s:  Use of the killer micro’s: shared or distributed memory More cost-effective peak performance Effective when memory hierarchy is leveraged

André Seznec ALF team Irisa-Inria 3 HPC has very limited impact on processor development  For 15 years, HPC does not drive « processor » development:  Niche market  Use (maybe adapt) off-the-shelf components: High-end microprocessors (Alpha, Power, HP-PA, Itanium), and more and more x86 –Need to exploit the memory hierarchy Now: –GPUs (massively threaded vector processors) Specialized form of vector processing –Cell Hand managed local memory

André Seznec ALF team Irisa-Inria 4 For QCD  Vector supercomputers were not cost effective  Too expensive  Limited performance: 1 flop per word  Build your own machine tradition in the QCD community  ApeMille, ApeNext  Exploit the particularities of algorithm: Complex arithmetic, small matrices VLIW architecture, no cache

André Seznec ALF team Irisa-Inria 5 No one can afford designing a high performance chip  Use /(at best) adapt off-the-shelf components A new bio-diversity of high-performance floating-point engines is available just now

André Seznec ALF team Irisa-Inria 6 Intel Terascale prototype: 80 processors 1.81 Teraflops, 265 Watts (just a prototype !)

André Seznec ALF team Irisa-Inria 7 The many core era  4-8 general-purpose cores now:  100-1000 in 2015-2020 Technologically feasible Economic viability ? –// general-purpose applications ? Will the end user accept to pay for 1000 cores when applications exhibit only a 10x speed-up !! –Main memory bandwidth will not scale !!  Which architecture for the many-cores ?  Till 2009, homogeneous multicores for GP heterogeneous for embedded / special purpose: –E.g. Cell, GPU

André Seznec ALF team Irisa-Inria 8 Direction of (single chip) architecture: betting on parallelism success  1 complex 4-way superscalar = 16 simple RISC  (Future) applications are intrinsically parallel:  As much as possible simple cores  (Future) applications are moderately parallel  A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores

André Seznec ALF team Irisa-Inria 9 SSC: Sea of Simple Cores e.g. Intel Larrabee

André Seznec ALF team Irisa-Inria 10 FCC: Few Complex Cores e.g. Intel Nehalem 4-way O-O-O superscalar 4-way O-O-O superscalar Shared L3 cache 4-way O-O-O superscalar

André Seznec ALF team Irisa-Inria 11 Homogeneous vs heteregoneous  Homogeneous: just replicate the same core  Extension of “conventional” multiprocessors  Hetorogeneous :  A la CELL ?: (master processor + slave processors) x N  A la SoC ? : specialized (poorly programmable) coprocessors Unlikely for HPC  Same ISA, but different microarchitectures ? Unlikely in the short term

André Seznec ALF team Irisa-Inria 12 Hardware accelerators ?  SIMD extensions:  Seems to be accepted, report the burden to application developers and compilers 512-bit SIMD instructions on Larrabee –General trend, not such a mess in hardware :-)  Reconfigurable datapaths:  Popular when you get a well defined intrinsically parallel application: Programmability ?  Real vector extensions (strides and scatter gather)  Would be a good move for HPC Are there mainstream applications benefiting ? Not very useful for QCD

André Seznec ALF team Irisa-Inria 13 Reconsider the “on-chip memory/processors” tradeoff  The uniprocessor credo was: “Use the remaining silicon for caches”  New issue: “An extra processor or more cache” recently local memory (e.g. Cell) Extra processor = more processing power –increased memory bandwidth demand –increased power consumption, more temperature hot spots More cache or local memory = decreased (main) memory demand

André Seznec ALF team Irisa-Inria 14 Memory hierarchy organization ?

André Seznec ALF team Irisa-Inria 15 Flat organization ? μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ Local or distributed memory or cache? Manage cache locality through software or hardware ?

André Seznec ALF team Irisa-Inria 16 Hierarchical organization ? μP μP$ μP μP$ L2 $ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 $

André Seznec ALF team Irisa-Inria 17 Hierarchical organization ?  Arbitration at all levels  Coherency at all levels  Interleaving at all levels  Bandwidth dimensioning

André Seznec ALF team Irisa-Inria 18 Hardware Multithreading of course !!  Execute several threads on a single core  Pentium 4, Nehalem, Sun Niagara  Just an extra level of thread parallelism !!  If you have 100 processes, likely to afford 1,000 !  A major mean to tolerate memory latency:  GPUs are featuring 100’s of threads

André Seznec ALF team Irisa-Inria 19 For HPC and QCD?  Unprecedented potential performance off-the-shelf:  Single chip teraflops is nearly there 200 Gflops Cell : 25 GB/s to memory 500 Gflops GPU : 80 GB/s to memory 50 Gflops Nehalem: 25 GB/s to memory Larabbee: 1 Tflops ? 50 GB/s ?  2015-2020:  1000 cores- 5 Ghz - 32 flops/cycle (e.g. SSE 512 bits) 160 teraflops: integration promises it 4096 bits memory channel- 2Ghz: –1 terabyte/s to memory, but quite optimistic Will they deliver ?

André Seznec ALF team Irisa-Inria 20 HPC and QCD: the « processor » architecture issue (for the user)  It is the (main) memory, stupid !  The « old » vector supercomputers: –Around 1 word per flop: per word granularity  The superscalar microprocessors: –Around 1 word per 10 flops: per 64B block  GPU, Cell –Around 1 word per 25 flops: large contiguous blocks  2015-2020 –Around 1 word per 100-1000 flops : large granularity For QCD: need to find new locality

André Seznec ALF team Irisa-Inria 21 HPC and QCD and GPUs  In 2009, GPUs are very cost-effective floating-point engines:  High peak performance  High memory bandwidth  SIMD-like control  DP performance ? Locality exploitation ?  Cost-effective hardware solutions (in 2009) for massive vector applications:  Contiguous vectors of data  Limited control  Ad’hoc programming (CUDA) ?  Coprocessor model ?

André Seznec ALF team Irisa-Inria 22 HPC and contiguous vector parallelism  Can be exploited by any architecture:  GPUs are cost effective and tolerate memory latency « Vector » instructions - application portability  Cells-like: Necessitates explicit data move - application portability  Many-cores (Larrabee) with wide SIMD instructions: Software prefetch + « vector » instructions + applications portability - cache sharing

André Seznec ALF team Irisa-Inria 23 Conclusion  HPC and QCD will have to use off-the-shelf « processors »  Massive thread parallelism might be available on-chip before 2015:  100-1000 cores ? (if killer applications appear !!)  Contiguous vector parallelism allows huge peak performance in the mid-term  GPUs, SIMD instructions  Real vectors (strides, scatter-gather) ? : unlikely to appear

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Similar presentations

Presentation on theme: "“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Similar presentations

Presentation on theme: "“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA."— Presentation transcript:

Similar presentations

About project

Feedback