Presentation is loading. Please wait.

Presentation is loading. Please wait.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Similar presentations


Presentation on theme: "Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs."— Presentation transcript:

1 Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs

2 Agenda Motivation & Goals Brief background about Multicore / CMPs Technical details presented in the paper Key results and contributions Conclusions Drawbacks How paper relates to class and my project Q&A

3 Motivation & Goals Motivation Superscalar paradigm is reaching diminishing returns Wire delays will limit area of the chip that is useful for single conventional processing core Goals Compare area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue and how big the per- processor on-chip cache should be. Related Work Compaq Piranha

4 Brief background on CMPs Metrics to evaluate CMPs Maximize total chip performance = Maximize Job throughput Maximizing Job throughput involves comparing Processor Organization : Out -of-order or smaller in-order issue Cache hierarchy : Amount of cache memory per processor Off-chip bandwidth: Finite bandwidth limits number of cores that can be placed on the chip Application characteristics: Applications with different memory access patterns require different CMP designs to attain maximum throughput

5 Brief background on CMPs Chip Multiprocessor model L1 and L2 cache per processor L2 cache directly connected to off-chip DRAM through a set of distributed memory channels Shared L2 cache Large cache bandwidth requirements Vs slow global wires

6 Technical Details Area models The model expresses all area in terms of CBE (unit area for one byte of cache) In-order and Out-of-order issue processors were considered taking cache sizes in to consideration Performance per unit area – 2-way in-order (P IN ) and 4-way out-of-order (P OUT )

7 Technical Details I/O Pin Bandwidth Number of I/O pins built on a single chip is limited by physical technology and does not scale with transistors Decrease in number of pins per transistor as technology advances I/O pin speeds have not increased at the same rate as processor clock rates

8 Technical Details Maximizing throughput Performance on server workloads can be defined as aggregate performance of all the cores on the chip If Number of cores (N c ) and performance of each core (P i ) are given Peak performance (P cmp ) of a server CMP will be P cmp = ∑ i=1 to Nc P i Performance of individual core in a CMP is dependent on application characteristics such as available instruction level parallelism, cache behavior, and communication overhead among threads

9 Technical Details Application characteristics Ten SPEC benchmarks were chosen –mesa, mgrid, equake, gcc, ammp, vpr, parser, perlbmk, art and mcf Taxonomy of applications Processor Bound – Applications whose working set can be captured easily in L2 cache (Mesa, mgrid, equake) Cache-sensitive – Applications whose performance is limited by L2 cache capacity (Gcc, ammp, vpr, parser and perlbmk) Bandwidth-bound – Applications whose performance is limited strictly by the rate that data can be moved between processor and DRAM (Art, mcf and sphinx) Applications are not bound to one class or another, they move along these three domains as processor, cache and bandwidth capacities are modulated

10 Technical Details Experimental methodology Used Simple scalar tool set to model both in-order and out-of-order processors P IN and P OUT

11 Results – Effect of varying L2 cache size

12 Results – Performance Scalability versus channel sharing

13 Maximizing CMP Throughput Combine area analysis and performance simulations to find out which CMP configuration will be most area efficient for future technology Fixed chip area – 400mm² Calculate the number of cores and cores/channel based on the chip area with different cache sizes

14 Results – Application type versus Throughput

15 Results – Technology Scaling

16 CMPs for Server Applications Most commonly used server workloads OLTP DSS DSS workloads – cache sensitive applications (1MB / 2MB) L2 cache OLTP workloads – bandwidth bound applications

17 Conclusions Transistor counts are projected to increase faster than pins – which limit the number of cores that can be used in future technology Out-of-order issue cores are more efficient than in-order issue cores For different workloads the impact of insufficient bandwidth causes throughput optimal L2 cache sizes to grow from 128KB at 100nm to 1MB at 50 and 35nm As technology advances wire delays may be too high to add more cache per each processor

18 Drawbacks SPEC benchmarks were used which are not similar to server workloads Power consumption was not at all considered while trying to maximize performance area The evaluation metrics estimated signaling speed to be increasing linearly with 1.5 times the processor clock. Technology advances in this area may permit larger number of processors than predicted

19 Paper Related to Class and Project Relation with class – We have been studying from the beginning of this semester regarding multi-core architecture. This paper presented how we can design CMP architecture based on the application taking in to consideration the current technology. Relation with Project – My project is studying the CMP architecture in relation to Mobile Edge Computing devices.

20 Q & A


Download ppt "Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs."

Similar presentations


Ads by Google