Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Similar presentations


Presentation on theme: "An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA."— Presentation transcript:

1 An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA Speaker: Zong-Cing Lin

2 Outline Introduction Related Work Mathematical Theory Experimental Result Conclusion PAS lab@CSIE,NTU2

3 Introduction In chip multiprocessors, there are several cores that need to access the off-chip memory system Same buses/pins contentions This paper discusses and evaluates a new data reuse framework, specifically customized for embedded CMP executing loop-intensive stencil applications It distinguishes between intra-processor and inter- processor data reuse. PAS lab@CSIE,NTU3

4 Introduction (cont’d) This paper targets at CMP where Embedded CMP Different on-chip processors can share data through an on-chip L2 cache Optimization of stencil computations PAS lab@CSIE,NTU4

5 Related Work Optimization of stencil computation by customized compilers. Issues about CMP memory bandwidth bottleneck. PAS lab@CSIE,NTU5

6 Stencil Computation A common type of computation in embedded array-based application codes. In each iteration of a stencil computation, an array element is updated based on the values of its neighbor elements. PAS lab@CSIE,NTU6

7 Data Sharing V.S. Data Reuse PAS lab@CSIE,NTU7

8 Some Mathematical Representation f : I → A, f(I) = FI + ζ, where F is an n × l matrix and ζ is a n- dimensional constant vector. Linear loop transformations can be used to optimize a loop nest. PAS lab@CSIE,NTU8

9 The Algorithm for Solving a Set of Equations PAS lab@CSIE,NTU9 For V and W processor pairs that share data, the complexity of this algorithm is WY

10 Two Important Lemmas Lemma1: if processor P2 exhibits self-reuse after loop transformation T2, then processor P1 also exhibits self- reuse after loop transformation T1. Keeping original intra-processor data self-reuse pattern. Lemma2: if the last column of F has only one non-zero entry and processor P2 preserves group-reuse after loop transformation T2, then processor P1 also preserves group-reuse after transformation loop T1 Keeping original intra-processor data group-reuse patter in most cases. PAS lab@CSIE,NTU10

11 Experimental Environment Simulation: Simics tool-set Private L1 cache Shared L2 cache PAS lab@CSIE,NTU11

12 Stencil Applications Used in Experiments PAS lab@CSIE,NTU12

13 Stencil Applications Used in Experiments (cont’d) PAS lab@CSIE,NTU13

14 Savings in the Off-chip Memory References and Execution Cycles PAS lab@CSIE,NTU14

15 Reduction in Execution Cycles with Different Processor Counts PAS lab@CSIE,NTU15

16 Conclusion Minimizing the number of off-chip memory references is very important in embedded chip multiprocessors from Performance perspective Power perspective This paper proposes and evaluates a compiler- based solution to stencil computations by re-organizing loop iterations assigned to processors in a coordinated fashion so that the reuse distance to shared data is minimized. PAS lab@CSIE,NTU16

17 Any Questions? 17PAS lab@CSIE,NTU


Download ppt "An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA."

Similar presentations


Ads by Google