Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISPASS 2011 Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1.

Similar presentations


Presentation on theme: "ISPASS 2011 Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1."— Presentation transcript:

1 ISPASS 2011 Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1

2 Motivation The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention For multi-threaded workloads, contention is unavoidable To reduce contention, it is necessary to understand where and how the contention is created 2

3 Shared Resource Contention in Chip-Multiprocessors Intel Quad Core Q9550 C0C1C2C3 L2 Memory L1 Front -Side Bus 3 Application 1 Thread Application 2 Thread

4 Scenario 1 Multi-threaded applications With co-runner C0C1C2C3 L2 Memory L1 4 Application 1 Thread Application 2 Thread

5 Without co-runner C0C1C2C3 L2 Memory L1 Application Thread 5 Scenario 2 Multi-threaded applications

6 Shared-Resource Contention Intra-application contention Contention among threads from the same application (No co-runners) Inter-application contention Contention among threads from the co-running application 6

7 Contributions A general methodology to evaluate a multi-threaded application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources Characterizing applications facilitates better understanding of the application’s resource sensitivity Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7

8 Outline Motivation Contributions Methodology Measuring intra-application contention Measuring inter-application contention Related Work Summary 8

9 Methodology 9 Designed to measure both intra- and inter- application contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB) Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource Multiple number of targeted resource Determine contention by comparing performance (gathering hardware performance counters’ values)

10 Outline Motivation Contributions Methodology Measuring intra-application contention (See paper) Measuring inter-application contention Related Work Summary 10

11 L1-cache Baseline Configuration Contention Configuration Measuring inter-application contention C0C1C2C3 L2 Memory L1 Application 1 Thread Application 2 Thread C0C1C2C3 L2 Memory L1 11

12 Measuring inter-application contention L2-cache Baseline Configuration Contention Configuration C0C1C2C3 L2 Memory L1 Application 1 Thread Application 2 Thread C0C1C2C3 L2 Memory L1 12

13 Measuring inter-application contention FSB Baseline Configuration Memory C0C2C4C6 L2 L1 C1C3C5C7 L2 L1 Application 1 Thread Application 2 Thread 13

14 Measuring intra-application contention FSB Contention Configuration Memory C0C2C4C6 L2 L1 C1C3C5C7 L2 L1 Application 1 Thread Application 2 Thread 14

15 PARSEC Benchmarks 15 Application DomainBenchmark(s) Financial AnalysisBlackscholes (BS) Swaptions (SW) Computer VisionBodytrack (BT) EngineeringCanneal (CN) Enterprise StorageDedup (DD) AnimationFacesim (FA) Fluidanimate (FL) Similarity SearchFerret (FE) RenderingRaytrace (RT) Data MiningStreamcluster (SC) Media ProcessingVips (VP) X264 (X2)

16 Experimental platform Platform 1: Yorkfield Intel Quad core Q KB L1-D and L1-I cache 6MB L2-cache 2GB Memory Common FSB C0 L2 cache Memory L1 cache Memory Controller Hub (Northbridge) FSB MB FSB interface L2 cache L2 HW-PF FSB interface L2 HW-PF L1 HW-PF C1C2C3 L1 cache L1 HW-PF L1 cache L1 HW-PF L1 cache L1 HW-PF 16

17 Tanima Dey Experimental platform Memory Memory Controller Hub (Northbridge) FSB MB FSB C0 L2 cache L1 cache FSB interface L2 cache L2 HW-PF FSB interface L2 HW-PF L1 HW-PF C2C4C6 L1 cache L1 HW-PF L1 cache L1 HW-PF L1 cache L1 HW-PF C1 L2 cache L1 cache FSB interface L2 cache L2 HW-PF FSB interface L2 HW-PF L1 HW-PF C3C5C7 L1 cache L1 HW-PF L1 cache L1 HW-PF L1 cache L1 HW-PF Platform 2: Harpertown 17

18 18 Performance Analysis Inter-application contention For i-th co-runner PercentPerformanceDifference i = ( PerformanceBase i – PerformanceContend i ) * 100 PerformanceBase i Absolute performance difference sum APDS = Σ abs ( PercentPerformanceDifference i )

19 Inter-application contention L1-cache – for Streamcluster 19

20 Inter-application L1-cache contention Streamcluster 20

21 21 Inter-application contention 21 L1-cache

22 Inter-application contention 22 L2-cache

23 Inter-application contention FSB 23

24 Characterization 24 BenchmarksL1-cacheL2-cacheFSB Blackscholesnone Bodytrackinter intra Cannealintrainterintra Dedupinterintra, inter Facesiminter intra Ferretintraintra, interintra Fluidanimateinter intra Raytracenone intra Streamclusterinter intra Swaptionsnone Vipsintrainter X264interintra, interintra

25 Summary The methodology generalizes contention analysis of multi-threaded applications New approach to characterize applications Useful for performance analysis of existing and future architecture or benchmarks Helpful for creating new workloads of diverse properties Provides insights for designing improved contention- aware scheduling methods 25

26 Related Work Cache contention Knauerhase et al. IEEE Micro 2008 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011 Characterizing parallel workload Jin et al., NASA Technical Report 2009 PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al. IISWC

27 Thank you! 27


Download ppt "ISPASS 2011 Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1."

Similar presentations


Ads by Google