CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

MEMORY ORGANIZATION Memory Hierarchy Main Memory Auxiliary Memory

CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Eliminating Stack Overflow by Abstract Interpretation John Regehr Alastair Reid Kirk Webb University of Utah.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

CASTNESS‘11 Computer Architectures and Software Tools for Numerical Embedded Scalable Systems Workshop & School: Roma January 17-18th 2011 Frédéric ROUSSEAU.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Microprocessor-based systems Curse 7 Memory hierarchies.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi

Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures Tushar Rawat and Aviral Shrivastava Arizona State University, USA CML.

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Programmability Hiroshi Nakashima Thomas Sterling.

Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Reducing Code Management Overhead in Software-Managed Multicores

Software Coherence Management on Non-Coherent-Cache Multicores

Chapter 1: A Tour of Computer Systems

High Performance Computing (HIPC)

For Massively Parallel Computation The Chaotic State of the Art

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Splitting Functions in Code Management on Scratchpad Memories

Ke Bai，Aviral Shrivastava Compiler Micro-architecture Lab

Interconnect with Cache Coherency Manager

Jian Cai, Aviral Shrivastava Presenter: Yohan Ko

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Optimizing Heap Data Management on Software Managed Many-core Architectures By: Jinn-Pean Lin.

Spring 2008 CSE 591 Compilers for Embedded Systems

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Lecture 4: Instruction Set Design/Pipelining

Presentation transcript:

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA

CML Web page: aviral.lab.asu.edu CML Memory Scaling Challenge 2  In multi-core processors,, caches provide the illusion of a large unified memory  Bring required data from wherever into the cache  Make sure that the application gets the latest copy of the data  Cache coherency protocols do not scale well  Intel 48-core Single-chip Cloud Computer  8-core DSP from Texas Instruments, TI 6678  Caches consume too much power  44% power, and greater than 34 % area Intel 48 core chip

CML Web page: aviral.lab.asu.edu CML SPM based Multicore Architecture Data Array Tag Array Tag Comparators, Muxes Address Decoder CacheSPM  Scratchpad Memory  Fast and low-power memory close to the core  30% less area and power than direct mapped cache of the same effective capacity  SPM based Multicore  A truly distributed memory architecture on-a-chip Execution Core SPM Execution Core SPM Execution Core SPM Execution Core SPM Execution Core SPM Execution Core SPM Interconnect Bus DMA

CML Web page: aviral.lab.asu.edu CML Need for Data Management 4  Programmability  Restructuring of existing applications  Transferring data in and out of the local scratchpad memory  Programmers need to be aware of :  Local memory availability  Task requirement at every point of time  Portability of the application  Applications are tuned for specific hardware int global; f1(){ int a,b; global = a + b; f2(); } int global; f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2(); }

CML Web page: aviral.lab.asu.edu CML Manage Stack Data 5  Local scratchpad memory is shard between  Code  Heap data  Stack data  Global data  Important to have approaches to manage stack data  64% of all accesses in embedded applications are to stack variables  Stack management is difficult  ‘liveness’ depends on call path  Function stack size is known at compilation time, but not stack depth code global stack heap

CML Web page: aviral.lab.asu.edu CML State Of Art: Circular Stack Management 6 Ke Bai et.al., "Stack Data Management for Limited Local Memory (LLM) Multi-core Processors", ASAP, FunctionFrame Size (bytes) main28 F140 F260 F354 main F1 F2 F3 main F1 F2 Stack Size = 128 bytes SP F3 Stack region in Local Memory Stack region in Global Memory GM_SP Need to be evicted

CML Web page: aviral.lab.asu.edu CML Challenge of Stack Data Management 7  Not performing management when not absolutely needed  fewer DMA calls  memory latency of a task will be very strongly dependent on the number of memory requests  Performing minimal work each time management is performed  Transfer stack data at the whole stack space granularity  management library (_sstore and _sload) becomes simpler  Avoiding thrashing  Place management functions judiciously

CML Web page: aviral.lab.asu.edu CML Contributions of This Paper 8.c.c Runtime Library.a Weighted Call Graph compiler Executable SSDM Place info about where to perform management  formulate the optimization problem of where to insert the management functions so as to minimize the management overhead  Finding an optimal cutting of a weighted call graph F0 32 F1 128 F2 32 F3 20 F4 32 Cut 1 Cut 0  A new runtime library with less management complexity  An effective heuristic (SSDM)  Takes Weighted Call Graph as input  Generates an effective management function placement scheme

CML Web page: aviral.lab.asu.edu CML Reduction of Management Overhead 9

CML Web page: aviral.lab.asu.edu CML Improvement of Overall Performance 10

CML Web page: aviral.lab.asu.edu CMLSummary 11  Scaling the memory hierarchy is becoming more and more challenging as we scale the number of cores in each processor  One promising solution: scratchpad  Does not have data management implemented in hardware  Data management needs to be performed in software  Important to have approaches to manage stack data  64% of all accesses in embedded applications are to stack variables  Contributions of this paper:  1. Problem formulation  2. New runtime stack data management library  2. Efficient heuristic for stack data management  Reduce stack data management overhead by 13X over the state- of-the-art