Short Circuiting Memory Traffic in Handheld Platforms

Slides:



Advertisements
Similar presentations
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Jamie Unger-Fink John David Eriksen. Outline Intro to LCDs Power Issues Energy Model New Reduction Techniques Results Conclusion.
Junxian Huang 1 Feng Qian 2 Yihua Guo 1 Yuanyuan Zhou 1 Qiang Xu 1 Z. Morley Mao 1 Subhabrata Sen 2 Oliver Spatscheck 2 1 University of Michigan 2 AT&T.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures 1 Tae Jun Ham (Princeton Univ.) Juan Luis Aragón (Univ. of Murcia)
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.
RT-OPEX: Flexible Scheduling for Cloud-RAN Processing
Lecture 2. A Computer System for Labs
Authors: Jiang Xie, Ian F. Akyildiz
Chapter 1 Computer System Overview
COSC3330 Computer Architecture
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
Neha Jain Shashwat Yadav
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
CS4961 Parallel Programming Lecture 11: Data Locality, cont
SLP1 design Christos Gentsos 9/4/2014.
HPE Persistent Memory Microsoft Ignite 2017
CS 105 Tour of the Black Holes of Computing
Cache Memory Presentation I
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Chapter 8: Main Memory.
Morgan Kaufmann Publishers The Processor
Rachata Ausavarungnirun
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
CS 105 Tour of the Black Holes of Computing
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Final Project presentation
CS 3410, Spring 2014 Computer Science Cornell University
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Lecture 8: Efficient Address Translation
Device Mgmt © 2004, D. J. Foreman.
Chapter-5 Traffic Engineering.
Device Mgmt © 2004, D. J. Foreman.
Cache Memory and Performance
COMP755 Advanced Operating Systems
Haonan Wang, Adwait Jog College of William & Mary
Presentation transcript:

Short Circuiting Memory Traffic in Handheld Platforms Paper presented in MICRO 2014 Review by Ajay Sekar *Graphs and figures used from paper “Short Circuiting memory traffic in Handheld platforms”

Problem Statement Many media IP’s being integrated in the SoC. Applications related to media and graphics generate a lot of memory traffic LPDDR3 Memory throughput - 5.3 GB/s 1 CH , 1333 MHz. Memory Controller turns out to be the bottleneck … http://blog.imgtec.com/multimedia/the-day-i-designed-my-own-mobile-soc

Usage Application Characteristics 1.Real time responsiveness - to maintain quality. 2. Frame transactions contain MB’s of data transfer. 3. Data has to be transferred with multiple IP’s This could overwhelm the memory system

Frames from multiple IP’s can stress the memory . Consequences IP’s stall for the memory requests Energy Wastage… IP’s powered on for a longer time Core requests have to contend with high IP memory traffic.

High Level Solutions Optimize memory transactions between IP’s in a flow Decongest System Agent. Improve memory latency

Defining a Flow … (1) Device driver issues a request to the IP with physical address location (2) IP sends data read requests … gets queued up and prioritized… response received (3) Complete frame is written back to the memory (4) Core invokes IP2 (5) IP2 reads data written by IP1

Memory stalls - Analysis IP that spend more time on compute = Less stalls Camera and Graphics App send huge frames of data … leading to stalls Stalls leads to frame drops Demand way higher than supply

4x IP’s have more stalls Improving Mem Freq does not help Problem : Queuing latency in the Mem Controller

Performance Optimization Is there reuse of data between IP’s ? Yes !! Reuse distances vary with respect to the application. Large frames - 10 ms reuse distance Small frames - <1 ms reuse distance Can locality be used to exploit the distance … coming up next

SOL 1 : Can caches help in data reuse between IP’s ? Idea - Have a shared cache between the memory and the IP which can be used by all IP’s Observation - Avg Hit Rates Vs the Cache sizes A2, A3 - Have small frames - fit inside 4 MB cache A1,A4,A5 benefit from increasing the cache size , better hit rates A8 doesn't show any improvement - - Frame size too large to fit into a 32 MB cache - Reuse distance is large - data would get evicted by frames in the flow or by other flows in the system

Bigger caches affect area and power budgets. Observation - Average cycles consumes by the IP’s SND - small frames , larger caches would mean longer lookups DC - frames fit into larger caches(32 MB), better performance AE is latency tolerant - more computation - thus cache size does not affect performance Conclusions Bigger caches affect area and power budgets. Large IP-IP reuse distance - main cause of the memory stalls.

SOL 2 : Sub-Framing Blocked Matrix Multiply Goal - To reduce IP-IP reuse distance Similar to “tiling” concept used in matrix multiplication. Blocked Matrix Multiply http://www.eecs.harvard.edu/~mdw/course/cs61/mediawiki/images/b/bb/Lectures-cacheperf.pdf

Sub-Framing (contd) Frames are divided into sub-frames IP’s process data in sub-frame granularity - IP2 would be be invoked if IP1 had completed a sub-frame Observation : Sub frame Size Vs reuse distance re-use distance exponentially decreases with decrease in sub-frame size. Flow buffers and IP-IP short circuiting utilize the sub-framing concept.

Flow Buffers Private buffers which are place between IP’s Buffers reuse the data and prevent consume IP from accessing the memory. - Flow buffers require extra hardware IP A IP B IP C read subframe from memory write subframe to memory Reduces intermediate memory accesses

IP-IP Short-circuiting Similar to “store-load forwarding” Forward the the stores of IP1 in memory queues to loads of IP2 Works only if the sub-frame reuse distance is small.

Implementation OS and HW Support SA-HAL layer is added - maintains the flow information SA holds matrix of consumer and producer IP’s in a flow. SA-HAL logs the producer and consumer of a flow in the flow. Consumer is informed when the sub-frame is completed.

Effect of sub framing with Flow buffering and Short circuiting Youtube has better hit rates with Flow buffer and IP-IP short circuiting. Hit rates increase with increase in buffer sizes Memory bottleneck is reduced …

Performance Gains % of frames completed with baseline, subframe-cache and subframe-fwd Application that benefit SB-C and SB-F see 45 % reduction in cycles per frame VD-DC , loss of 17 % - parallel working IP’s stress the memory

Energy Gains Avg 40% reduction in active cycles Memory energy consumption is reduced by 33% Reduced DRAM accesses

Conclusion Optimized System Agent to reduce memory traffic Exploring locality to reuse the data. - reducing memory accesses. Avg 45 % performance gains for media/graphics application (exceptions) Energy consumption of IP’s and memory reduced.

Backup Slides

Flows

Platform Configuration

System Agent Application makes API calls to invoke the HW Device Drivers transforms API calls in IP invocations Control signal given to the IP - registers and memory pointers IP’s reads and write values into the Memory using DMA