CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Part IV: Memory Management

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

A Coherent and Managed Runtime for ML on the SCC KC SivaramakrishnanLukasz Ziarek Suresh Jagannathan Purdue University SUNY Buffalo Purdue University.

Memory Management: Overlays and Virtual Memory

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

CSS430 Memory Management Textbook Ch8

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Breno de MedeirosFlorida State University Fall 2005 Buffer overflow and stack smashing attacks Principles of application software security.

(1) ICS 313: Programming Language Theory Chapter 10: Implementing Subprograms.

CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,

Memory Management Design & Implementation Segmentation Chapter 4.

Memory Management (II)

Honors Compilers Addressing of Local Variables Mar 19 th, 2002.

Run-Time Storage Organization

Memory Management Chapter 5.

Computer Organization and Architecture

Compiler Construction Lecture 17 Mapping Variables to Memory.

Chapter 7: Runtime Environment –Run time memory organization. We need to use memory to store: –code –static data (global variables) –dynamic data objects.

Operating System Chapter 7. Memory Management Lynn Choi School of Electrical Engineering.

CS 346 – Chapter 8 Main memory –Addressing –Swapping –Allocation and fragmentation –Paging –Segmentation Commitment –Please finish chapter 8.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Memory Management Chapter 7.

1 Memory Management Memory Management COSC513 – Spring 2004 Student Name: Nan Qiao Student ID#: Professor: Dr. Morteza Anvari.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Computer Architecture Lecture 28 Fasih ur Rehman.

Runtime Environments Compiler Construction Chapter 7.

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

Computer Science and Software Engineering University of Wisconsin - Platteville 2. Pointer Yan Shi CS/SE2630 Lecture Notes.

Memory Management. Roadmap Basic requirements of Memory Management Memory Partitioning Basic blocks of memory management –Paging –Segmentation.

Subject: Operating System.

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

Buffer Overflow Attack Proofing of Code Binary Gopal Gupta, Parag Doshi, R. Reghuramalingam, Doug Harris The University of Texas at Dallas.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Architecture Lecture 27 Fasih ur Rehman.

CSC 8505 Compiler Construction Runtime Environments.

Memory Management: Overlays and Virtual Memory. Agenda Overview of Virtual Memory –Review material based on Computer Architecture and OS concepts Credits.

CSCI 156: Lab 11 Paging. Our Simple Architecture Logical memory space for a process consists of 16 pages of 4k bytes each. Your program thinks it has.

Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

Hello world !!! ASCII representation of hello.c.

File-System Management

Memory Management Chapter 7.

Computer Organization

Chapter 2 Memory and process management

Lecture: Large Caches, Virtual Memory

Chapter 9 – Real Memory Organization and Management

Main Memory Management

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Splitting Functions in Code Management on Scratchpad Memories

Ke Bai，Aviral Shrivastava Compiler Micro-architecture Lab

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Lecture 3: Main Memory.

Operating System Chapter 7. Memory Management

COMP755 Advanced Operating Systems

Run-time environments

Presentation transcript:

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics and Decision Systems Engineering 30 th April 2010

CML a MANY Core Future Today A few large cores on each chip Only option for future scaling is to add more cores Still some shared global structures: bus, L2 caches BUS p p L1 L2 Cache Tomorrow 100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007] Simple cores are more power and area efficient MIT RAWSun Ultrasparc T2 IBM XCell 8i Tilera TILE64

CML Multi-core Challenges Power – Cores are less power hungry ex. No Speculative Execution Unit| – Power efficient memories, hence No caches (Caches consume 44% in core) Scalability – Maintaining illusion of shared memory is difficult – Cache Coherency protocols do not scale to a very large number of cores – Shared resources cause higher latencies as cores scale. Programming – As there is no unified memory, programming becomes a challenge – Low power,limited sized, software controlled memory – Programmer has to perform data management and ensure coherency

CML Limited Local Memory Architecture Distributed memory platform with each core having its own small sized local memory Cores can access only local memory Access to global memory is accomplished with the help of DMA Ex. IBM Cell BE

CML LLM Programming Model LLM architecture ensures: – The program can execute extremely efficiently if all code and application data can fit in the local memory #include extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid = spe_create_thread (&hello_spu); spe_wait( speid, &status); return 0; } Main Core int main(speid, argp) { printf("Hello world!\n"); return 0; } Local Core int main(speid, argp) { printf("Hello world!\n"); return 0; } Local Core int main(speid, argp) { printf("Hello world!\n"); return 0; } Local Core int main(speid, argp) { printf("Hello world!\n"); return 0; } Local Core int main(speid, argp) { printf("Hello world!\n"); return 0; } Local Core int main(speid, argp) { printf("Hello world!\n"); return 0; } Local Core

CML Managing Data on Limited Local Memory WHY MANAGEMENT ? To ensure efficient execution in the small size of the local memory. Stack data challenge Estimation of stack depth may not be possible at compile- time The stack data may be unbound as in case of recursion. Stack data enjoys 64.29% of total data accesses MiBench Suite How to we manage Stack Data? Stack Heap Code Global Local Memory

CML Working of Regular Stack F1 F2 F3 F1 50 F2 20 Stack Size = 100 bytes SP F3 30 Local Memory Function Frame Size (bytes) F150 F220 F

CML Not Enough Stack Space F1 F2 F3 F1 50 F2 20 Stack Size = 70 bytes SP F3 30 Local Memory Function Frame Size (bytes) F150 F220 F No space for F3

CML CML Related Work Techniques have been developed to manage data in constant memory – Code: Janapsatya2006, Egger2006, Angiolini2004, Nguyen2005, Pabalkar2008 – Heap: Francesco2004 – Stack: Udayakumaran2006, Dominguez2005, Kannan2009 Udayakumaran2006, Dominguez2005 maps non recursive and recursive functions respectively to stack using scratchpad – Both works keep frequently used stack portion to scratchpad memories. – They use profiling to formulate an ILP Only work that maps the entire stack to SPM is circular management scheme of Kannan2009 – Applicable only for Extremely Embedded Systems. LLM in multi-cores are very similar to scratchpad memories (SPM) in embedded systems.

CML Agenda Trend towards Limited Local memory multi-core architectures Background Related work  Circular Stack Management  Our Approach  Experimental Results  Conclusion

CML Kannans’ Circular Stack Management F1 F2 F3 F1 50 F2 20 Stack Size = 70 bytes SP F3 30 Local MemoryMain Memory Main MemPtr Function Frame Size (bytes) F150 F220 F

CML Kannans’ Circular Stack Management F1 F2 F3 F1 50 F2 20 Stack Size = 70 bytes SP F3 30 Local MemoryMain Memory Main MemPtr Function Frame Size (bytes) F150 F220 F

CML Circular Stack Management API Original Code F1() { int a,b; F2(); } F2() { F3(); } F3() { int j=30; } Only suitable for extremely embedded systems where application size is known. fci()- Function Check in Assures enough space on stack for a called function by eviction of existing function if needed. fco()- Function Check out Assures that the caller function exists in the stack when the called function returns. Stack Managed Code F1() { int a,b; fci(F2); F2(); fco(F1); } F2() { fci(F3); F3(); fco(F2); } F3() { int j=30; }

CML Limitations of Previous Technique Pointer Threat Memory Overflow Overflow of the Main Memory buffer Overflow of the Stack Management Table

CML Limitations: Pointer Threat Stack Size= 70 bytes Stack Size= 100 bytes F1() { int a=5, b; fci(F2); F2(&a); fco(F1); } F2(int *a) { fci(F3); F3(a); fco(F2); } F3(int *a) { int j=30; *a = 100; } Aha! FOUND “a” F2 20 SP F3 30 F1 50 a F2 20 SP F3 30 F1 50 a Wrong value of “a” 90 a Local Memory EVICTED

CML Limitations: Table Overflow j=5; F1() { int a=5, b; fci(F2); F2(); fco(F1); } F2() { fci(F3); F3(); fco(F2); } F3() { j--; if(j>0){ fci(F3); F3(); fco(F3); } TABLE_SIZE = 3 Stack Management Table (Local Memory) Entry 1F2 Entry 2F3 Entry 3F3 Entry 4F3 OVERFLOW

CML Limitations: Main Memory Overflow j=5; F1() { int a=5, b; fci(F2); F2(); fco(F1); } F2() { fci(F3); F3(); fco(F2); } F3() { j--; if(j>0){ fci(F3); F3(); fco(F3); } Static buffer quickly gets filled as recursion can result in an unbounded stack. F2 20 SP F3 30 F1 50 OVERFLOW ! Local Memory Main Memory 70 0 F3 30 Size=70

CML Our Contribution Our technique is comprehensive and works for all LLM architectures without much loss of performance. We Dynamically manage the Main Memory Manage the stack management table in fixed size Resolve all pointer references

CML Managing Main Memory Buffer The local processor cannot allocate buffer in the main memory. If dynamically allocated,the local processor needs address of the main memory buffer to store evicted frames using DMA Main Memory Buffer Local Processor Thread Hence a STATIC buffer How to send buffer address Solution!! Run a Main Memory Manager Thread! Main Memory Management Thread If DYNAMIC

CML Dynamic Management of Main Memory Main Memory Management Thread Need To Evict ==TRUE Local Program Thread Allocate Memory Send main memory buffer address Evict Frames to Main Memory fci() F1 50 F2 20 F3 30 Local Memory Main Memory 70 0

CML CML Dynamic Management of Stack Management Table If FULL – EXPORT to main memory – Reset pointer If EMPTY – Import TABLE_SIZE entries to local memory. – Set Pointer to MAX size Stack Management Table (Local Memory) Entry 1F1 Entry 2F2 Export to Main Memory (DMA) Table Pointer Entry 2F3 Entry 1F3 The same Main Memory Manager Thread can allocate space for evicting the table to the main memory

CML CML Pointer Resolution F1 50 F2 20 F Space for stack = 70 bytes Offset = (100-0) – 90 = 10 Global Address = – 10 = F1() { int a=5,b; fci(F2); F2(&a); fco(F1); } F2(int *a) { fci(F3); F3(a); fco(F2); } F3(int *a) { int j=30; a = getVal(a); *a = 100; a = putVal(a); } ACTUAL STACK Main memory STACK WITHOUT ANY MANAGEMENT 00 getVal calculates linear address &fetches the pointer variable to the local memory putVal places it back to the main memory Local memory a F1 50 F3 30 F2 20 a Displacement= = 90

CML Agenda Trend towards Limited Local memory multi-core architectures Background Related work Circular Stack Management Our Approach  Experimental Results  Conclusion

CML Experimental Setup Sony PlayStation 3 running a Fedora Core 9 Linux. MiBench Benchmark Suite The runtimes are measured with spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1 Each benchmark is executed 60 times and average is taken to abstract away timing variability. Each Cell BE SPE has a 256KB local memory.

CML Results We test the effectiveness of our technique by 1.Enabling Unlimited Stack Depth 2.Testing runtime in least amount of stack with our and previous stack management 3.Wider Applicability 4.Scalability over number of cores

CML 1. Enabling Limitless Stack Depth We executed a recursive benchmark with – No Management – Previous Technique of Stack Management – Our Approach Size of Each Function frame is 60 bytes int rcount(int n) { if (n==0) return 0; return rcount(n-1) + 1; }

CML 1. Enabling Limitless Stack Depth Our technique works for arbitrary stack sizes where as previous technique works for limited values of N Our Technique works for any Large Stack sizes. The previous technique crashes as there is no management of stack table and thus occupies a very large space for the table. Without management the program crashes there is no space left in local memory for the stack.

CML 2. Better Performance in Lesser Space Our technique utilizes much lesser space in local memory and still has comparable runtimes with previous technique. Our technique resolves pointers hence gets the correct result. The previous technique fails for lesser stack sizes as it cannot resolve pointers as the referenced frames are evicted.

CML 3. Wider Applicability Our technique gives similar runtimes when we match the stack space as compared to the previous technique. Our technique runs in smaller space and still WORKS!!!

CML 4. Scalability Graph of Performance v/s Scalability for our technique Runtime increases as the single PPU thread gets flooded with the allocation requests

CML Summary LLM architectures are scalable architectures and have a promising future. For efficient execution of applications on LLM, Data Management is needed. We propose a comprehensive stack data management technique for LLM architecture that: Manages any arbitrary stack depth Resolves pointers and thus ensures correct results Ensures memory management of main memory thus enabling scaling Our API is semi automatic, consisting of only 4 simple functions

CML CML Outcomes International Conference for Compilers Architectures and Synthesis for Embedded Systems ( CASES ), “Managing Stack Data on Limited Local Memory(LLM) Multi-core Processors” Software release: “LLM Stack data manager plug-in” – Implementing in GCC for SPE architecture.

CML CML Thank You! ?