Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages 408-415 Tianyi Wang, Gang Quan, Shangping.

Similar presentations


Presentation on theme: "ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages 408-415 Tianyi Wang, Gang Quan, Shangping."— Presentation transcript:

1 ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages 408-415 Tianyi Wang, Gang Quan, Shangping Ren, Meikang Qiu 曾冠維 2013.09.05

2  Introduction  Preliminary  Performance evaluation  Experimental results  Conclusions 2

3  Introduction  Preliminary  Performance evaluation  Experimental results  Conclusions 3

4  IC chip performance variation can cause significant discrepancies.  One major problem caused by manufacturing variations is the fabrication yield. 4

5  Therefore, micro-architecture level and core level redundancies are employed to improve the fabrication yield.  According to“Exploiting micro-architectural redundancy for defect tolerance” Core-level redundancy will achieve better yield performance. 5

6  Another problem caused by manufacturing variations is performance variations. 6

7  How to reduce the total schedule legnth of task graph when realizing its nominal design?  Devoloping a performance metric based on the opportunity cost. 7

8 8

9  Introduction  Preliminary  Performance evaluation  Experimental results  Conclusions 9

10 10 使用 Row Rippling Column Stealing algm(RRCS) 用 redundant core 取代 faulty core

11  task graph G = {V,E}. V = {v1,v2,...,vk }  E = {e(i, j) = (vi,vj )| if task node vi communicates with task node vj } |vi|,represent the execution time of task node vi.  The Logical architecture denoted as, assume it consists of cores. = {,i= 0,...,r − 1; j = 0,...,c− 1 }. 11

12  The nominal design of application G based on the logical architecture (denoted as N (G, ) ).  The Physical architecture is denoted as assume it consists of cores = {,i = 0,...,m− 1; j = 0,...,n− 1 }. 12

13  Problem : Given an application G; a logical architecture ; the nominal design of G on, i.e. N (G, ) ; the physical architecture. 13

14  Find the mapping of M M = { |i =0,...,r − 1; j =0,...,c -1; 0 ≤ x ≤ m − 1;0 ≤ y ≤ n − 1 }. such that the maximum latency to execute G based on N (G, ) is minimized. 14

15  Introduction  NoC virtualization  Performance evaluation  Experimental results  Conclusions 15

16  1. A simple workload/performance matching heuristic.  2. Opportunity cost based workload/performance mapping  3. Logical/physical topology mapping with communication awareness 16

17 17 Time complexity =

18  While Algorithm A is fast and intuitive,it has serveral issues.  Problem1: Larger workloads don’t necessary locate on the critical path.  Problem2: Don’t take their location into consideration. 18

19  The opportunity cost is the cost of any activity measured in terms of the value of the next best alternative forgone (that is not chosen).  It is the sacrifice related to the second best choice available to someone, or group, who has picked among several mutually exclusive choices. 19

20 20

21 21

22  Mapping to  The task graph of this mapping is 51.67  Since the lantency of nominal design is 55,we define that the profit of the decision is 55- 51.67 = 3.33  For the rest of the alternatives to map,the best choice is to map it to,with latency of 53.18. The profit is 55-53.18 = 1.82 22

23  Definition 1:, let its profit be denoted as let its opportunity cost denoted as Then the performance of the decision as 3.33-1.82 = 1.51 23

24  For the example, we have =1.51, =0, =1,9, =0.76 According to Definition 1, mapping the loagical core with the largest workload assignment to the fastest core doesn’t reduce the critical path lantency and thus has the lowest performance. 24

25  In the wrost case, the complexity of the while loop is O(kmn), since mxn different mappings need to be checked, where k is the number of task nodes.  The while loop will be executed for rxc times  Therefor, the overall complexity of algorithm2 is O(krcmn). 25

26  Neither Algorithm 1 nor Algorithm 2 takes the communication cost into consideration.  When the communication cost becomes significant, especially for many-core platforms, the qualities of the mapping results by Algorithm A and Algorithm B can be severely compromised.  we propose an iterative algorithm (shown in Algorithm 3) to improve the performance of existing mapping results with taking the communication into consideration. 26

27 27

28  When calculating the latency for the task graph, the communication cost can be incorporated into the calculation of performance of a mapping decision.  Algorithm 3 can iteratively improve the mapping solution, until the improvement threshold(ε) defined by user can be satisfied. 28

29  Introduction  NoC virtualization  Performance evaluation  Experimental results  Conclusions 29

30  Use TGFF to randomly generate task graphs(60 nodes)  The communication of each edge and execution time of each task are randomly generated.  We assume the P &C _OC algorithm stops after 200 iterations.  Experiments were running on a Window XP/SP3 platform powered by Intel(R) Core(TM)2 Duo CPU@ 2.93GHz with 3.21 GB of RAM 30

31  SWPM to denote Algorithm 1,  P_Only_OC for Algorithm 2,  P&C_OC for Algorithm 3.  also compare with two previous work,i.e. RRCS algorithm, Hungarian algorithm. 31

32 32 A B C 1 2 3

33  Performance vs. different communication/execution ratios.  Communication cost be generated within interval [a,b].  Execution time of task node be generated within interval [c,d].  C/E ratio = 33

34 34 2 3

35 35

36 36

37  Introduction  NoC virtualization  Performance evaluation  Experimental results  Conclusions 37

38  Introduce a framework to maximize the performance of the nominal design.  Heuristics based on the concept of opportunity cost.  The proposed approach can achieve up to 30% and with an average 15% of performance improvement. 38

39 39


Download ppt "ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages 408-415 Tianyi Wang, Gang Quan, Shangping."

Similar presentations


Ads by Google