Presentation is loading. Please wait.

Presentation is loading. Please wait.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Similar presentations


Presentation on theme: "International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems."— Presentation transcript:

1 International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai University of Minnesota – Twin Cities

2 2 Network-on-Chips Core R  Leads to latency  Leads to energy consumption Scalable Provides high bandwidth Core R R R R R R R

3 Heterogeneous System Data Parallel Data Parallel Data Parallel Data Parallel Super- scalar 3 Only some routers are fully utilized

4 4 DVFS for Reducing NoC Energy Dynamic Voltage and Frequency Scaling Router energy dominates DVFS reduces router energy, but leads to delay Previous work are conservative on aggressiveness We need more aggressive DVFS

5 5 Limitations of Aggressive DVFS Dynamic Voltage Frequency Scaling Our Previous Work * This Work LatencyThroughput DVFS to reduce energy Limitations of Aggressive DVFS – Increase latency – Reduce throughput Work for limited traffic pattern SensitiveInsensitive High Latency Throughput Low Contention * Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011

6 Flexible-Pipeline Routers Frequency = 0.5F T Flexible pipeline reduces router pipeline delay T T 6

7 7 Exploiting DVFS Opportunity (a) Minimal path routing High utilization Mid utilization Low utilization 1 Src1 Dest1 (b) Non-minimal path routing 1’ Src1 Dest1

8 8 Dynamic Energy: E Dyn ∝ V dd 2 Static Energy: E Sta ∝ V dd Clock Energy: E Clk ∝ (Freq* V dd 2 ) Router Speed DVFS ParametersNormalized Energy Freq (GHz)V dd (V) High Mid Low Exploiting DVFS Opportunity (cont.) Operating at Mid-frequency gets most benefit

9 9 (a) Minimal path routing 100% frequency 50% frequency 25% frequency 1 Src1 Dest1 (b) Non-minimal path routing 1’ Src1Dest1 Exploiting DVFS Opportunity (cont.) 1. Performance 2. Dynamic Energy 3. Static Energy More benefit with bigger network

10 10 Introduction Non-minimal path selection - Issue - Solution - Challenges Infrastructure (CPU+GPU) Results Conclusion Outline

11 11 Non-minimal Path Routing (a) Minimal path routing High utilization Mid utilization Low utilization Src Dest (b) Non-minimal path routing Src Dest

12 12 Too Close ! (a) Minimal path routing (b) Non-minimal path routing High utilization Mid utilization Low utilization SrcDest SrcDest Performance Static Energy Dynamic Energy

13 13 Non-minimal path routing Too Aggressive ! Src1Dest1 High utilization Mid utilization Low utilization Static Energy Dynamic Energy

14 14 Dynamic Network Tuning Input Slack == 1 Slack = 0 Output D x >=3 || D y >=3 Y Min. path port N N Y Least busy port Initial State Utilization Monitor V/F Scaling Router:Packet: Busy information propagation How to determine Slack?

15 Busy Information Propagation Busy Metrics - Buffer utilization - Crossbar utilization - Router utilization Propagation - Regional congestion awareness [Grot et al. HPCA08] 15

16 Regional Congestion Awareness 16 Local data collection Propagation to neighboring routers Aggregation of local & non-local data

17 Slack in Applications Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time Thread 0 Thread 1Thread 2Thread nThread 0 read miss Thread 0 ready Thread 0 schedule CPU: Not necessarily, but assume NO slack GPU: Based on # of threads 17

18 M G C L2 18 Tile-Based Multicore System CPU Core/ GPU SM/ L2 Cache/ MC R R GG MEM C L2C GGGG M C MEM CL2 GGGG GM C GG CM C GG

19 19 Benchmark Benchmarks – CPU: afi, ammp, art, equake, kmeans, scalparc – GPU: blackscholes, lps, lib, nn, bfs Evaluate ALL 30 CPU+GPU combinations For presentation purpose, classify -CPU: 1) Memory-bound 2) Computation-bound -GPU: 1) Latency-tolerant 2) Latency-intolerant Based on: L1 cache miss rate Based on: Slack cycles

20 20 Benchmark Categorization SensitiveInsensitive High Latency Throughput Low (I)memory-bound CPU + latency-tolerant GPU (II)computation-bound CPU + latency-tolerant GPU (III)memory-bound CPU + latency-intolerant GPU (IV)computation-bound CPU + latency-intolerant GPU

21 21 Network Energy Saving (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Energy saving is significant on certain workloads

22 22 Performance Impact (CPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU

23 23 Performance Impact (GPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Performance penalty is minimal compared to DVFS

24 24 Non-minimal Path NoC + Balance on-chip workloads + Reduce NoC energy Workload Mix High throughput Latency Insensitive SensitiveInsensitive High Low Latency Throughput Conclusion Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed

25 25 Thank You!

26 Exploiting Slack in GPU 26

27 Predict slack based on # of available warps Exploiting Slack in GPU 27


Download ppt "International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems."

Similar presentations


Ads by Google