Presentation is loading. Please wait.

Presentation is loading. Please wait.

Manycores in the Future Rob Schreiber hp labs. Dont Forget These views are mine, not necessarily HPs Never make forecasts, especially about the future.

Similar presentations


Presentation on theme: "Manycores in the Future Rob Schreiber hp labs. Dont Forget These views are mine, not necessarily HPs Never make forecasts, especially about the future."— Presentation transcript:

1 Manycores in the Future Rob Schreiber hp labs

2 Dont Forget These views are mine, not necessarily HPs Never make forecasts, especially about the future Sam Goldwyn

3 hp labs, 1939

4 HP/ HP Labs Today Worlds biggest technology company, 2006 sales $91B, #14 in the US. Printing, PCs, servers, software, services HP Labs has 700 researchers Palo Alto, Bristol, Haifa, Beijing, Bangalore, Tokyo, St. Petersburg Invests in medium and long-term research that has a good potential for return on the investment New director -- Prith Banerjee, dean of UIC College of Engineering

5 The Future. It seems clear that: Single-thread performance is not getting better All machines will be parallel Further speedup will come to the extent that we can use the parallel hardware effectively Parallelism has been a huge success in scientific computing Communication bandwidth and energy efficiency are the key limits to improved performance We should not make the next generation of parallel machines any harder to program than they are now

6 Moores Law Number of transistors per chip is 1.59 year-1959 Now slope is less; but we should see X or more growth (65 nm – sub 10 nm) Classical performance scaling model – performance grows as O(n 3 ) With feature size scaling of n You get O(n 2 ) transistors They run O(n) times faster

7 How long will this last? Theres no getting around the fact that we make these things out of atoms – Gordon Moore

8 Single core/thread performance Moores Law says number of transistors scaling as O(n 2 ) and speed as O(n) Microprocessor performance should scale as O(n 3 ) For quite some time, it hasnt

9 N 3 Era Expansion of data paths from 4 to 32 bits Pipelining, floating point hardware N 2 Era Large caches – miss rate ~ (cache size) 1/2 Wide issue – double the IPC with quad issue N 1 Era Very little benefit from increases in issue width and cache size for many applications Slowdown due to size, long wires

10 Microprocessor Power Figure source: Shekhar Borkar, Low Power Design Challenges for the Decade, Proceedings of the 2001 Conference on Asia South Pacific Design Automation, IEEE.

11 Voltage Scaling Power is CV 2 f Lowered voltage has reduced power (12/1.1) 2 = 119X over 24 years! ITRS projects minimum voltage of 0.7V in 2018 Only (1.1/0.7) 2 = 2.5X reduction left in next 14 years! Conclusion: Where GHz is concerned, we are close to the practical limit.

12 How Big? The Memory Wall The Power Wall

13 Data center thermal management Modeling datacenters with CFD Static (design time) and dynamic Smart Cooling

14 Does it matter, the end of GHz? Word wont go any faster The problem in commercial computing is to keep up with the enormous volume of data The problem in scientific computing is to keep up with the enormous volume of data Throughput is needed. Parallelism works 491 of TOP500 have > 256 processors 512 – 2048 processors is the sweet spot today for scientific machines

15 Where are we today? Intel Xeon: 2007: 45nm – 4 cores 2008: 32 nm – 8 cores 2010: 22 nm – 16 cores Intel ships more multi than unicore chips, Q406 All these have < 3GHz clocks 80 small, low power cores are possible in 65 nm

16 The Future, Part I More than 100 cores, perhaps 1000, will be possible in server-oriented parts optimized for maximum performance per watt In 10 –15years we may be looking at 10 Tflops on a socket

17 What changes with manycores? Flops are really free Communication (between cores, with memory) is costly Memory bandwidths of 5 GB/s today, going up to 20 – 40 GB/s Flop rates headed towards 1Tf per socket Fixed clock rates means latency does not get any worse But the needed bandwidth scales linearly

18 How Much Bandwidth Is Enough? Scientific and Commercial data-centric computing has high BW demands I/O bandwidth is critical in commercial computing HPCC Benchmarks (icl.cs.utk.edu/hpcc) show the ratio (bytes/flop) of bandwidth to compute 0.5 < (bytes/flop) < 2.0 for almost all the machines on the HPCC list A typical PC has much less bandwidth/flop

19 How much bandwidth can we get? 1000 pins would provide TB/s bandwidths But at a minimum of 2 x 10^{-12} J/b * 10^13 b/s = 20 W 10TB/s = 200 W or more

20 Dont Caches Make BW Less Important? Some kernels (dense matrix ops) cache perfectly, need very little memory BW Unfortunately, handling large meshes and graphs, iterative solution methods, multigrid do not Even when cache works, writing the programs is a formidable job vendor BLAS self tuned libraries multiple levels of blocking doing more work to save time

21 What about communication? On chip networks two-dimension meshes are a natural thing on a chip but they have been tried and rejected in HPC Stacked memory capacity cooling Optics (integrated on board and on chip) the energy costs can be low and the bandwidth can be high more onchip and offchip bandwidth at reasonable power? cost, reliability, manufacturability…

22 The Future, Part II Without a breakthrough in memory bandwidth, a lot of the potential parallel applications that could use manycore chips wont be able to do so This will be a serious problem for the industry and its customers

23 Architectures, Accelerators 1985 – 2005: The killer micro made all other machines obsolete Slowdown of single cores appears to open the door to other architectures FPGAs, GPGPUs, and accelerators Example: Clearspeed 32 SIMD lanes with local memory block data transfers from main memory under program control, overlapped with computation

24 But if flops are free… Move functions into the chip, onto the cores NICs Computational kernels Graphics Makes it tough to sell a machine that accelerates computation

25 Writing the Programs There are some new things worth trying GAS languages for scientific computing Transactions, for more complicated algorithms There now is a parallel Matlab Improvements to the architecture can have a big impact on programmability Lower latency across chip than board Higher bandwidth to memory Fast synchronization Use some of the cores to help with communication

26 I hope it is even more clear that: Single-thread performance not getting better All machines will be parallel very soon There are a lot of apps involving enormous datasets that have plenty of parallelism Further throughput by using the parallel hardware effectively Communication bandwidth and energy efficiency are the key limits to improved performance We may not need to make parallel machines any harder to program than they are now

27 hp labs, 2007


Download ppt "Manycores in the Future Rob Schreiber hp labs. Dont Forget These views are mine, not necessarily HPs Never make forecasts, especially about the future."

Similar presentations


Ads by Google