A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

Slides:



Advertisements
Similar presentations
George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝
Advertisements

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.
A Novel 3D Layer-Multiplexed On-Chip Network
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Microprocessors and Microsystems Volume 35, Issue 2, March 2011, Pages 230–245 Special issue on Network-on-Chip Architectures and Design Methodologies.
Yu Cai Ken Mai Onur Mutlu
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice ProLiant G5 to G6 Processor Positioning.
Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
UH-MEM: Utility-Based Hybrid Memory Management
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
Effective mechanism for bufferless networks at intensive workloads
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
On-Time Network On-chip
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Short Circuiting Memory Traffic in Handheld Platforms
Rahul Boyapati. , Jiayi Huang
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Using Packet Information for Efficient Communication in NoCs
Managing GPU Concurrency in Heterogeneous Architectures
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Presentation transcript:

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu

Executive summary Problem: Current day NoC designs are agnostic to application requirements and are provisioned for the general case or worst case. Applications have widely differing demands from the network Our goal: To design a NoC that can satisfy the diverse dynamic performance requirements of applications Observation: Applications can be divided into two general classes in terms of their requirements from the network: bandwidth-sensitive and latency-sensitive - Not all applications are equally sensitive to bandwidth and latency Key idea: Design two NoC - Each sub-network customized for either BW or LAT sensitive applications - Propose metrics to classify applications as BW or LAT sensitive - Prioritize applications’ packets within the sub-networks based on their sensitivity Network design: BW optimized network has wider link width but operates at a lower frequency and LAT optimized network has narrow link width but operates at a higher frequency Results: Our proposal is significantly better when compared to an iso-resource monolithic network (5%/3% weighted/instruction throughput improvement and 31% energy reduction) 2

Channel bandwidth affects network latency, throughput and energy/power Increase in channel BW leads to - Reduction in packet serialization - Increase in router power Resource requirements of various applications - I 3 Impact of channel bandwidth on application performance

Resource requirements of various applications - I 4 Impact of channel bandwidth on application performance Simulation settings: 8x8 multi-hop packet based mesh network Each node in the network has an OoO processor (2GHz), private L1 cache and a router (2GHz) Shared 1MB per core shared L2 6VC/PC, 2 stage router

Resource requirements of various applications - I 5 Impact of channel bandwidth on application performance

Resource requirements of various applications - I 6 Impact of channel bandwidth on application performance 1.18/30 (21/36 total) applications’ performance is agnostic to channel BW (8x BW inc. → less than 2x performance inc.)

Resource requirements of various applications - I 7 Impact of channel bandwidth on application performance 1.18/30 (21/36 total) applications’ performance is agnostic to channel BW (8x BW inc. → less than 2x performance inc.) 2. 12/30 (15/36 total) applications’ performance scale with increase in channel BW (8x BW inc. → at least 2x performance inc.)

8 Reduction in router latency (by increasing frequency) -Reduction in packet latency -Increase in router power consumption Impact of network latency on application performance Resource requirements of various applications - II

9 Simulation settings: … same as last experiment 128b links Added dummy stages (2-cycle and 4-cycle ) to each router Impact of network latency on application performance

Resource requirements of various applications - II 10 Impact of network latency on application performance

Resource requirements of various applications - II /30 (21/36 total) applications’ performance is sensitive to network latency (3x latency reduction → at least 25% performance improvement) Impact of network latency on application performance

Resource requirements of various applications - II /30 (15/36 total) applications’ performance is marginally sensitive to network latency (3x latency increase → less than 15% performance improvement) 1.18/30 (21/36 total) applications’ performance is sensitive to network latency (3x latency reduction → at least 25% performance improvement) Impact of network latency on application performance

Application-aware approach to designing multiple NoCs 13

Application-aware approach to designing multiple NoCs 14 Based on the observations: 1.Applications can be classified into distinct classes: typically LAT/BW sensitive 2.LAT sensitive applications can benefit from low network latency 3.BW sensitive applications can benefit from high network bandwidth 4.Not all applications are equally sensitive to either LAT or BW 5.Monolithic network cannot optimize both classes simultaneously

Application-aware approach to designing multiple NoCs 15 Solution Two NoCs where each (sub)network is optimized for either LAT or BW sensitive applications

Design methodology 16 Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor

Design methodology 17 Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor 1 Identify LAT/BW sensitive applications - Proposes a novel dynamic application classification scheme 1

Design methodology 18 Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor 1 Identify LAT/BW sensitive applications - Proposes a novel dynamic application classification scheme 1 2 Design sub-networks based on applications’ demand - This network architecture is better than a monolithic iso-resource network 2

Design methodology 19 Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor 1 Identify LAT/BW sensitive applications - Proposes a novel dynamic application classification scheme 1 2 Design sub-networks based on applications’ demand - This network architecture is better than a monolithic iso-resource network 2

Design: Dynamic classification of applications 20 Network episode Compute episode time Application life cycle Outstanding network packets

Design: Dynamic classification of applications 21 Network episode Compute episode time Application life cycle App. has at least one outstanding packet Processor is likely stalling → low IPC Outstanding network packets

Design: Dynamic classification of applications 22 Network episode Compute episode time Application life cycle App. has at least one outstanding packet Processor is likely stalling → low IPC App. has no outstanding packet High IPC Outstanding network packets

Design: Dynamic classification of applications 23 Network episode Compute episode Episode height Episode length time Application life cycle App. has at least one outstanding packet Processor is likely stalling → low IPC App. has no outstanding packet High IPC Episode length = Number of consecutive cycles there are net. packets Episode height = Avg. number of L1 packets injected during an episode Outstanding network packets

Design: Dynamic classification of applications 24 Network episode Compute episode time Application life cycle App. has at least one outstanding packet Processor is likely stalling → low IPC App. has no outstanding packet High IPC Outstanding network packets Short episode ht.: Low MLP, each request is critical (LAT sensitive) Tall episode ht.: High MLP (BW sensitive) Short episode len.: Packets are very critical (LAT sensitive) Long episode len.: Latency tolerant (could be de-prioritized) Episode length Episode height

Classification and ranking 25 ClassificationLength LongMediumShort Tallgems, mcfsphinx, lbm, cactus, xalansjeng, tonto HeightMediumomnetpp, apsi ocean, sjbb, sap, bzip, sjas, soplex, tpc applu, perl, barnes, gromacs, namd, calculix, gcc, povray, h264, gobmk, hmmer, astar Shortleslieart, libq, milc, swimwrf, deal Classification: LAT/BW

Classification and ranking 26 Classification: LAT/BW RankingLength LongMediumShort HighRank-4Rank-2Rank-1 HeightMediumRank-3Rank-2 ShortRank-4Rank-3Rank-1 Ranking: Sensitivity to LAT/BW ClassificationLength LongMediumShort Tallgems, mcfsphinx, lbm, cactus, xalansjeng, tonto HeightMediumomnetpp, apsi ocean, sjbb, sap, bzip, sjas, soplex, tpc applu, perl, barnes, gromacs, namd, calculix, gcc, povray, h264, gobmk, hmmer, astar Shortleslieart, libq, milc, swimwrf, deal

Network design 27 1N-128

Network design 28 1N-128 2N-64x256-ST (Steering)

Network design 29 1N-128 2N-64x256-ST (Steering) 2N-64x256-ST-RK (Steering+Ranking)

Network design 30 1N-128 2N-64x256-ST (Steering) 2N-64x256-ST-RK (Steering+Ranking) 2N-64x256-ST-RK(FS) (Steering+Ranking and Frequency Scaling)

Network design 31 1N-128 2N-64x256-ST (Steering) 2N-64x256-ST-RK (Steering+Ranking) 2N-64x256-ST-RK(FS) (Steering+Ranking and Frequency Scaling) 1N-512 (High BW) 2N-128X128 1N-320 (iso-BW) 1N-320(FS) (iso-resource)

Analysis 32 Performance (25 WL with 50% BW and 50% LAT)

Analysis 33 Performance (25 WL with 50% BW and 50% LAT)

Analysis % Performance (25 WL with 50% BW and 50% LAT)

Analysis 35 Performance (25 WL with 50% BW and 50% LAT) +7% +18%

Analysis 36 Performance (25 WL with 50% BW and 50% LAT) +5% +7% +18%

Analysis 37 Performance (25 WL with 50% BW and 50% LAT) 5% +5% +7% +18%

Analysis 38 Performance (25 WL with 50% BW and 50% LAT) w. 2% 5% +5% +7% +18%

Analysis 39 Performance (25 WL with 50% BW and 50% LAT) w. 2% 5% +5% +7% +18%

Analysis 40 Performance (25 WL with 50% BW and 50% LAT) w. 2% 5% +5% +7% +18% Energy (25 WL with 50% BW and 50% LAT) -47% - 59%

Analysis 41 Performance (25 WL with 50% BW and 50% LAT) w. 2% 5% +5% +7% +18% Energy (25 WL with 50% BW and 50% LAT) -47% - 59% Best EDP across all designs

Conclusions Problem: Current day NoC designs are agnostic to application requirements and are provisioned for the general case or worst case. Applications have widely differing demands from the network Our goal: To design a NoC that can satisfy the diverse dynamic performance requirements of applications Observation: Applications can be divided into two general classes in terms of their requirements from the network: bandwidth-sensitive and latency-sensitive - Not all applications are equally sensitive to bandwidth and latency Key idea: Design two NoC - Each sub-network customized for either BW or LAT sensitive applications - Propose metrics to classify applications as BW or LAT sensitive - Prioritize applications’ packets within the sub-networks based on their sensitivity Network design: BW optimized network has wider link width but operates at a lower frequency and LAT optimized network has narrow link width but operates at a higher frequency Results: Our proposal is significantly better when compared to an iso-resource monolithic network (5%/3% weighted/instruction throughput improvement and 31% energy reduction) 42

Thank you 43 Q?Q? Asit Mishra

Backup Slides... 44

Other metrics considered for application classification 45

Analysis of network episode length and height 46 Short length/height Medium length/height Long length/High height 0.3M 10K 0.4M 18K

Analysis of network episode length and height 47 Short length/height Medium length/height Long length/High height Based on performance scaling sensitivity to bandwidth and frequency 0.3M 10K 0.4M 18K

Empirical results to support the classification 48 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications

Empirical results to support the classification 49 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications

Empirical results to support the classification 50 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications

Empirical results to support the classification 51 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications

Empirical results to support the classification 52 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications Why 9 clusters?

Empirical results to support the classification 53 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications Why 9 clusters? 13x

Empirical results to support the classification 54 SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications Why 9 clusters?

Analysis with varying workload combinations 55

Comparison to prior works 56

Dynamic steering of packets 57

Design: Putting it all together 58 Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor

Design: Putting it all together 59 Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor Classify applications based on sensitivity to network BW/LAT

Design: Putting it all together 60 Episode LEN/HT Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor Classify applications based on sensitivity to network BW/LAT Use network episode length/height to dynamically identify apps

Design: Putting it all together 61 Episode LEN/HT Design LAT/BW optimized networks Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor Classify applications based on sensitivity to network BW/LAT Use network episode length/height to dynamically identify apps

Design: Putting it all together 62 Classify applications based on sensitivity to network BW/LAT Episode LEN/HT Design LAT/BW optimized networks Processors/L1$NetworkL2$ and Mem. Controllers Logical view of a multicore processor Use network episode length/height to dynamically identify apps Prioritization within networks

Summary 63 A NoC paradigm based on top-down approach (application demand/requirement analysis) An efficient design paradigm for future heterogeneous multicores

Summary 64 A NoC paradigm based on top-down approach (application demand/requirement analysis) An efficient design paradigm for future heterogeneous multicores

Summary 65 A NoC paradigm based on top-down approach (application demand/requirement analysis) An efficient design paradigm for future heterogeneous multicores Providing all these guarantees in one network is hard Multiple networks: each customized for one metric Throughput (BW) critical Latency critical (real- time constraints)

Summary 66 A NoC paradigm based on top-down approach (application demand/requirement analysis) An efficient design paradigm for future heterogeneous multicore m/c