Presentation is loading. Please wait.

Presentation is loading. Please wait.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Similar presentations


Presentation on theme: "Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM"— Presentation transcript:

1 Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM ilker.yildirim@boun.edu.tr

2 Outline Introduction Related Work Using Workloads for Multi-core Design Monotonicity vs Non-monotonicity Methodology Analysis and Results Conclusion

3 Introduction Multiple-core processors are becoming more popular. More flexibility in design. Heterogeneity across cores. But how to design CMPs of such heterogeneity?

4 Introduction Workloads, power and area constraints, level of threading, etc. The best design: Combination of good general purpose cores vs Combination of specialized cores?

5 Related Work Related work on heterogenous CMP design In terms of power efficiency. Improved processor performance. No one touched how to come up with such a design. They all assume a given design to go off with.

6 Using Workloads for Design Best design for what? On a set of applications. Certainly applications with a representative set of workloads. Searching for one optimum CPU is already expensive. It explodes for Multi-processors. Assume: Sum of performance = Performance of sum. Private caches. Consider only major blocks to be configurable. Consider only a fixed number (4) of cores.

7 Monotonicity vs Non-monotonicity Cores of a CMP posses monotonicity if they can be fully ordered. In terms of performance In terms of voltage/frequency Non-monotonicity; when there is no full ordering, but partial ordering. One is good for memory required jobs The other is good for many # of instrs.

8 Methodology Modeling of CPU cores Modeling Power and Area Modeling Performance

9 CPU cores 4-core multiprocessors, 0.10 micron, 1.2 V technology. Private L2 caches. In-order cores (Alpha EV5), Out-of- order cores (MIPS R10000) Evaluate 480 (96 in-order, 384 out-of- order) cores. Possible # of distinct 4 cores over 2.2 billion.

10

11 Power and Area Area budget = Sum of areas of 4 cores. Consider only peak activity power. Each core ranges b/w 4.1-16.3 W of power and 3.3-22 mm2 of area. Aggregate: 13.2 to 88mm2 of area and 16.4 to 65.2 W of power.

12

13 Performance - Workloads SPEC: The standard performance evaluation corporation is a non profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to newest generation of high- performance computers. www.spec.org Combination of Processor bound Bandwidth bound All different (a,b,c,d) All same (a,a,a,a) A wide range of workload

14 Performance - Evaluation 2.2 billion distinct CMP using 480 distinct cores. Performance of 4-core = Sum of performance of each core. Each with its own private L2 cache. Evaluation of performance for each core: 480 #distinct cores x #benchmarks x #cycles 10250 million cyclesxx Metric: Weighted speed up: arithmetic sum of each running thread’s IPC over its IPC on the simplest core considered.

15 Analysis and Results Analyzing multi-core processors for a given workload Analyzing multi-core processors for a given budget Quantifying inefficiency due to monotonicity Varying Thread-level Parallelism Efficient Search Techniques

16 Analyzing for a Given Workload All different: eon, mesa, deltablue, mcf. Observe non-monotonicity.

17 mesa mcf deltablue eon

18 Analyzing for a given budget Extended analysis in two ways: Any combination of workloads. Different area and power budgets.

19 All same case: Heterogeneity captures diversity among different homogeneous workloads. Performance depends on power budget. Heterogeneity achieves specialized cores, whereas homogeneity brings envelope cores.

20 Significant benefit, as long as power and area budget are constrained. The diversity required is related with available budgets. The stronger the constraints, more the diversity. Large difference b/w the best heterogeneous and homogeneous CMP designs. Best design is not composition of the same best performing core. Rather it is the combination of tuned cores.

21 Quantifying inefficiency due to monotonicity The best non-monotonic design of 2 cores is better than: The best monotonic design of 2 cores with 7.5%; The best homogeneous design of 2 cores with 15.4%. The more constrained the higher cost of monotonicity.

22 Varying Thread-Level Parallelism Again heterogeneity has benefits. When less competition, performance is better.

23 Again observe the benefits of heterogeneity. Interestingly, for TLP=1, rather than a huge monolithic core and tiny complementary cores, a design of heterogenous and tuned cores it better.

24 Efficient Search Techniques A huge search space: 2.2 billion distinct core combinations. Thousands of 4-thread workloads. Not scalable, what if there are more than 10 different applications, or more than 4 cores? A smarter solution: Hill climbing which is likely to stay at a local maxima. Results are still better when compared to that of homogenous designs.

25 Conclusions How to do good heterogeneous CMP for a given power and area budget and a set of workloads. The best is not to combine cores that are good general-purpose ones. The best way is to combine tuned heterogeneous cores. Such tuning results in non-monotonicity. In the sense that they can only be partially ordered. Heterogeneous design performs better also for homogeneous workloads.


Download ppt "Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM"

Similar presentations


Ads by Google