Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Slides:

Advertisements

Similar presentations

TRIPS Primary Memory System Simha Sethumadhavan 1.

Advertisements

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

A Novel 3D Layer-Multiplexed On-Chip Network

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Do We Need Wide Flits in Networks-On-Chip? Junghee Lee, Chrysostomos Nicopoulos, Sung Joo Park, Madhavan Swaminathan and Jongman Kim Presented by Junghee.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.

McRouter: Multicast within a Router for High Performance NoCs

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Communication issues for NOC By Farhadur Arifin. Objective: Future system of NOC will have strong requirment on reusability and communication performance.

José Vicente Escamilla José Flich Pedro Javier García 1.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Sampling Dead Block Prediction for Last-Level Caches

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

Network On Chip Platform

Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

University of Michigan, Ann Arbor

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

OASIS NoC Revisited Adam Esch (m ). Outline Pre-Research OASIS Overview Research Contributions Remarks OASIS Suggestions Future Work.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

NoCVision: A Network-on-Chip Dynamic Visualization Solution

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Rahul Boyapati. , Jiayi Huang

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Using Packet Information for Efficient Communication in NoCs

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing End-to-End Memory Access Latency in NoC Based Multicores

2 Outline Introduction and Motivation Details of the Proposed Schemes Implementation Experimental Setup and Results

3 Target System Tiled multicore architecture  Mesh NoC  Shared, banked L2 cache (S-NUCA)  MCs Core L2 bank Router Node L1 Communication Link MC

4 Components of Memory Latency Many components add to end-to-end memory access latency MC1 Request Message Response Message

5 End-to-end Memory Latency Distribution Significant contribution from network Higher contribution for longer latencies Motivation  Reduce the contribution from the network  Make delays more uniform

6 Out-of-Order Execution and MLP Load-ALoad-BLoad-CLoad-D miss Network L1-hitL2-hit Network Instruction Window begin end miss Network commit

7 Memory Bank Utilization MC R-2 R-1 R-0 Bank 0 Bank 1 Bank 2

8 Proposed Schemes Scheme 1  Identify and expedite “late” memory response messages Reduce NoC latency component Provide more uniform end-to-end memory latency Scheme 2  Identify and expedite memory request messages targeting idle memory banks Improve memory bank utilization Improve memory performance

9 Scheme 1

10 Scheme 1: Calculating Age Age = “so-far” delay of a message  12 bits Part of 128-bit header flit No extra flit needed (assuming 12-bits available) Updated at each router and MC locally  No global clock needed Frequency taken into account  DVFS at routers/nodes supported

11 MC0MC1 MC3 MC2 L2 L1 core-1 Scheme 1: Example MC1 receives request from core-1 R0 is the response message MC1 updates age field  Adds memory queuing/service Use age to decide if “late”  Mark as “high-priority” Inject into network as “high- priority” R0 Age Late

12 Scheme 1: Late Threshold Calculation Each application is treated independently  Not uniform latency across whole-system  Uniform latency for each core/app Delay_avg Delay_so-far-avg Threshold Delay_avg Threshold

13 Scheme 2: Improving Bank Utilization Based on second motivation  High idleness at memory banks  Uneven utilization Improving bank utilization using network prioritization  Problem: No global information available  Approach: Prioritize at the routers using router-local information Bank History Table per-router  Number of requests sent to each bank in last T cycles  If a message targets an idle bank, then prioritize Route a diverse set of requests to all banks Keep all banks busy

14 Network Prioritization Implementation Routing  5 stage router pipeline  Flit-buffering, virtual channel (VC) credit-based flow control  Messages split into several flits  Wormhole routing Our Method  Message to expedite gets higher priority in VC and SW arbitrations  Employ pipeline bypassing [Kumar07] Fewer number of stages Pack 4 cycles into 1 cycle BWRCVASASTsetupST BaselinePipeline Bypassing

15 Experimental Setup Simulator: GEMS (Simics+Opal+Ruby) MC 4x8 Mesh NoC Core L2 bank Router L1 32KB 64B/line 3 cycle latency 512KB 64B/line 10 cycle latency 1 bank/node (32 banks total) 32 OoO cores 128 entry instruction window 64 entry LSQ 5 stage 128-bit flit size 6-flit buffer size 4 VC per port X-Y Routing DDR-800 Memory Bus Multiplier = 5 Bank Busy Time = 22 cycles Rank Delay = 2 cycles Read-Write Delay = 3 cycles Memory CTL Latency = 20 cycles 16 banks per MC, 4 MCs

16 Experimental Setup Benchmarks from SPEC CPU2006  Applications categorized based on memory intensity (L2 MPKI)  High memory intensity vs. low memory intensity [Henning06] 18 multiprogrammed workloads: 32-applications each  Workload Categories WL 1-6: Mixed (50% high intensity -50% low intensity) WL 7-12: Memory intensive (100% high intensity) WL 13-18: Memory non-intensive (100% low intensity)  1-1 application-to-core mapping Metric

17 Experimental Results 6%11% 10% 15% 13% Higher intensity benchmarks benefit more from Scheme 1  More traffic  More “late” messages w-2 and w-9 degrade  Prioritizing some messages hurts some other messages

18 Experimental Results Cumulative Distribution of latencies  8 threads of WL-1  90% point delay reduced  from ~700 cycles to ~600 cycles Probability Density  Moved from region 1 to region 2  Not all can be moved New distribution Region 1 Region 2 Fewer accesses with high delays Before Scheme-1 After Scheme-1

19 Experimental Results Reduction in idleness of banks Dynamic behavior  Scheme-2 reduces the idleness consistently over time

20 Sensitivity Analysis System Parameters  Lateness threshold  Bank History Table history length  Number of memory controllers  Number of cores  Number of router stages Analyze sensitivity of results to system parameters  Experimented with different values

21 Sensitivity Analysis – “Late” Threshold

22 Sensitivity Analysis – History Length Scheme 2: History length  History kept at the routers for past T cycles  Default value T=200 cycles Shorter history: T=100 cycles  Cannot find idle banks precisely Longer history: T=400 cycles  Less number requests prioritized

23 Sensitivity Analysis – Number of MCs Less MCs means  More pressure on each MC  Higher queuing latency More late requests  More room for Scheme 1 Less idleness at banks  Less room for Scheme 2 Slightly higher improvements with 2 MC

24 Sensitivity Analysis – 16 Cores Scheme-1 + Scheme-2  8%, 10%, 5% speedup  About 5% less than 32-cores Proportional with the # of cores  Higher network latency  More room for our optimizations

25 Sensitivity Analysis – Router Stages NoC Latency depends on number of stages in the routers  5 stage vs. 2 stage routers  Scheme 1+2 speedup ~7% on average (for mixed workloads)

26 Summary Identified  Some memory accesses suffer long network delays and block the cores  Banks utilization low and uneven Proposed two schemes 1. Network prioritization and pipeline bypassing on “late” memory response messages to expedite them 2. Network prioritization of memory request messages to improve bank utilization Demonstrated  Scheme 1 achieves 6%, 10%, 11% speedup  Scheme 1+2 achieves 10%, 13%, 15% speedup

27 Questions? Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing End-to-End Memory Access Latency in NoC Based Multicores Thank you for attending this presentation.

28 References [1] A. Kumar, L. Shiuan Peh, P. Kundu, and N.K. Jha, “Express Virtual Channels: Towards the Ideal Interconnection Fabric”, in ISCA, 2007 [2] J. L. Henning, “SPEC CPU2006 Benchmark Descriptions”, SIGARCH Comput. Archit. News, 2006