Presentation is loading. Please wait.

Presentation is loading. Please wait.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Similar presentations


Presentation on theme: "2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),"— Presentation transcript:

1 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE Page 954 – 959 A Helper Thread Based Dynamic Cache Partitioning Scheme for Multithreaded Applications

2 Abstract Related Work Motivation Difference between inter and intra application Proposed Method Experiment Result Conclusion 2

3 Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites. 3

4 4 Off-chip bandwidth[3, 10, 13] Processor cores[6] Resource Management Shared cache[5, 4, 8, 11, 12, 17, 18, 20] Application granularity Intra-application shared cache[16] This paper Improve the cache layer problem

5 Run application of facesim(PARSEC) and art(SPECOMP). Perform six scheme and recorded the Average Memory Access Time(AMAT).  No-partition  Uniform  Nonuniform  Nonuniform-L2  Nonuniform-L3  Dynamic Dynamic outer perform the rest  Divide application into fixed epoch and performs the best. 5

6 The objectives and the implementation are different on cache partition. The intra-application cache partition tries to minimize the latency of the slowest thread.  Runtime system or dynamic compiler The inter-application cache partition tries to optimize workload throughput.  OS problem 6

7 Dynamic Partition System Helper Thread whose main responsibility is to partition the cache space allocated to the application to maximize its performance. System Interfacing Performance Monitoring Performance Modeling

8 Each OS epoch is composed many application, which divided into 5 epoch.  Performance Monitoring  Performance Modeling  Resource Partitioning  System Interfacing  Application Execution

9 Use Average Memory Access Time as measure of the cache performance of a thread. AMAT  The ratio of total cycles spent on memory instructions and total number of instructions  Depends on the cache partition size  Take into account with different level of cache 9

10 Need to predict the impact of increasing and decreasing the cache space to a thread. Expressed a thread with 3D plot  X and Y respectively for cache space allocation from L2 and L3 Thread i, point d(s L2, s L3 ) value to build dynamic model for thread i. Purpose – predict the performance of a thread 10

11 i th L2 cache, q L2,i denotes the total cache way allocated to this application. q L2,i are shared by m L2,i thread(from 0 to m L2,i ) The number of ways allocated to the k th thread is denoted as s L2,i (k) 11

12 P[t] denotes cache resources(numbers of way in L2 & L3). 12

13 New partition information is delivered to the OS using system call. Add new instruction to ISA COID = core ID, CLVL = cache level, CAID = cache ID, W = 64bit wide way allocation 13

14 The experimental environment Compare with other scheme  Average Memory Access Time 。 The main target of the performance monitoring  Execution Cycle 14

15 SIMICS and GEMS to model below multicore architecture. Run SPECOMP and PARSEC application. Use 120 million instruction as application epoch. 15

16 Perform 8 schemes and recorded average memory access time  No-partition  Uniform – as evenly as possible for each core  Static Best – static partition for best result through exhaustive search  Dynamic – the proposed method  Dynamic-L2 – partition only L2  Dynamic-L3 – partition only L3  L2+L3 – a separate performance model for each one.  Ideal – optimal strategy 16

17 17

18 18 Shows that balancing the data access latency of different threads. As the execution went on, they all end up at about 8 AMAT(cycle).

19 Intra-application cache partitioning for multithread Dynamic model, able to partition cache in multiple layer. Average improvement of 17.1% in SECOMP and 18.6% in PARSEC. My Comment  Remind me the importance of software and hardware cooperation.  Thread is a main issue in CMP. 19


Download ppt "2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),"

Similar presentations


Ads by Google