Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Analysis of and Dynamic Page Remapping Technique to Reduce L2 Misses in an SMT Processor CSE 240B Class Project Spring 2005, UCSD Subhradyuti Sarkar Siddhartha.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.
Aarhus University, 2005Esmertec AG1 Implementing Object-Oriented Virtual Machines Lars Bak & Kasper Lund Esmertec AG
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Alex Aiken Chess Kickoff Meeting1 Program Analysis for Embedded Systems Alex Aiken.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Memory Allocation via Graph Coloring using Scratchpad Memory
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Improving Network I/O Virtualization for Cloud Computing.
Automated Design of Custom Architecture Tulika Mitra
1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)
2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Main Memory CS448.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
What is cache memory?. Cache Cache is faster type of memory than is found in main memory. In other words, it takes less time to access something in cache.
Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Mohsen Imani†, Abbas Rahimi‡, Yeseong Kim†, Tajana S. Rosing†
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Cache and Scratch Pad Memory (SPM)
Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof
Seth Pugsley, Jeffrey Jestes,
Memory COMPUTER ARCHITECTURE
Dynamo: A Runtime Codesign Environment
Evaluating Register File Size
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Automatic Detection of Extended Data-Race-Free Regions
Improving java performance using Dynamic Method Migration on FPGAs
Department of Electrical & Computer Engineering
Improving Program Efficiency by Packing Instructions Into Registers
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Energy-Efficient Address Translation
Experiment Evaluation
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Reducing Memory Reference Energy with Opportunistic Virtual Caching
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
ECE Dept., University of Toronto
Yikes! Why is my SystemVerilog Testbench So Slooooow?
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Binding Times Binding is an association between two things Examples:
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Spring 2008 CSE 591 Compilers for Embedded Systems
Increasing Effective Cache Capacity Through the Use of Critical Words
Presentation transcript:

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures

New Memory Architectures NVMs (STT-RAM, MRAM, etc.) –Energy efficient –Higher density –High write latency (3x slower than reads) –Low write endurance Solution  Hybrid Memories 2International Conference on Compilers, Architectures and Synthesis of Embedded Systems NVM SRAM/ DRAM

Hybrid Caches SRAM + STT-RAM hybrid design Data allocation –Reducing writes to NVM partition –Redirecting write intensive data to SRAM partition Performance Impact –Data movement between partitions is expensive Energy Impact –High writes to NVM might offset energy savings 3International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Motivation Different solutions (previous works) for each level of memory –Not co-operative. Conflicting. –Not holistic for hybrid memory hierarchy 4 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Motivation ♯ Stack data layout for hybrid L1 cache (Li et.al. ISLPED’12) ♯ Reuse distance based data allocation for hybrid L2 cache (Chen et.al. LCTES’12) d c b a x1 x2 x3 x4 a d b c x1 x2 x3 x4 Write reuse sequence Read reuse sequence Write intensive Read intensive Stack 5 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Motivation Different solutions (previous works) for each level of memory –Not co-operative. Conflicting. –Not holistic for hybrid memory hierarchy Hardware solutions  heavy modifications and energy overheads Software solutions  partial support or profile based techniques 6 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Our Approach - EnVM Makes use of virtual memory to provide for all hybrid memory hierarchy Handles static and dynamic data, no profiling required Utilizes existing hardware Advocates migration-less cache design 7 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

EnVM 8 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Static Analysis a = p b = q a = a-5b = b*2 c = p+q d = p-q b1 b2b3 b4 a  (0,1) p  (1,0) b  (0,1) q  (1,0) c  (0,1) d  (0,1) b  (1,2) a  (1,2) p  (2,0) q  (2,0)q  (3,0) p  (3,0) (variable, read count, write count) Abstract interpretation based dataflow analysis Heuristic estimate of memory access intensity 9 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Read Intensive Write intensive Static Analysis c  (0,1) d  (0,1) b  (1,2) a  (1,2) q  (3,0) p  (3,0) Clustering based on unsupervised machine learning algorithm Classification to 4 classes and then to 2 partition Read intensive allocated to STT-RAM partition Write intensive allocated to SRAM partition Classes Low Read – Low Write Low Read – High Write High Read – Low Write High Read – High Write STT-RAM SRAM 10 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Memory Access Types Variables show high read OR write affinity 11 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Dynamic Memory Hard to analyze Exposed to programmer Dynamic memory library support –Enable dual heap structure Two distinct system calls ( r_malloc, w_malloc ) 12 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

EnVM Layout Existing virtual memory layoutProposed virtual memory layout X86 Segment registers do boundary checking Minimum modification to fit other architectures Allocating the data from each segment to either STT-RAM or SRAM 13 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Evaluation Comparison –Hardware only method (HW) on hybrid L1 –Software method based on stack layout (SW1) on hybrid L1 –Software method based on reuse distance (SW2) on hybrid L2 –Our method on hybrid L1 (EnVM) MARSSx86 Cycle Accurate Simulator Processor : Unicore, 3GHz, Commit Width - 4 Memory - Hybrid L1 Design L1 I-Cache (SRAM)64K, 64B Line, 3 Cycles L1 D-Cache (Hybrid)SRAM : 4K, 4-way, 3 Cycles STTRAM : 64K, 4-way, Read - 3 Cycle, Write - 10 Cycles L2 (SRAM)2M, 8-way, 15 Cycles, 64B Lines Memory - Hybrid L2 Design L1 I-Cache (SRAM)64K, 8-way, 3 Cycles, 64B Line L1 D-Cache (SRAM)32K, 8-way, 3 Cycles, 64B Line L2 (Hybrid)SRAM : 1M, 4-way, 3 Cycles STTRAM : 2M, 8-way, Read - 11 Cycle, Write - 30 Cycles 14 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Write Reduction Normalized to HW Reduces 47.6% (HW) & 15% (SW1) 15 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Energy Savings Normalized to pure SRAM configuration Max. energy reduction 50% for 458.sjeng Reduces 21% (HW) & 6% (SW1) 16 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Performance Impact Normalized to pure SRAM configuration Comparable IPC Write latency is offset by bigger cache capacities 17 of 60International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Summary Holistic management of process memory to aid hybrid memory hierarchy Reduces writes % (HW) & 15% (SW1) Reduces energy - 21% (HW) & 6% (SW1) Minimum hardware modification No profiling of applications No migration of data Improvements –Dynamic memory management 18International Conference on Compilers, Architectures and Synthesis of Embedded Systems

Thank You 19International Conference on Compilers, Architectures and Synthesis of Embedded Systems