Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.

Slides:



Advertisements
Similar presentations
Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community.
Advertisements

Fabián E. Bustamante, Spring 2007
COREY: AN OPERATING SYSTEM FOR MANY CORES
Virtual Memory Chapter 18 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
CMPT 300: Operating Systems Review THIS REIVEW SHOULD NOT BE USED AS PREDICTORS OF THE ACTUAL QUESTIONS APPEARING ON THE FINAL EXAM.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan,
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel - Presentation By Sathish P.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
1 Concurrency: Deadlock and Starvation Chapter 6.
OPERATING SYSTEMS Introduction
Exokernel: An Operating System Architecture for Application-Level Resource Management Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr. M.I.T.
1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Chapter 8 Windows Outline Programming Windows 2000 System structure Processes and threads in Windows 2000 Memory management The Windows 2000 file.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.
The Linux Kernel: A Challenging Workload for Transactional Memory Hany E. Ramadan Christopher J. Rossbach Emmett Witchel Operating Systems & Architecture.
Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin.
Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
The Performance of Microkernel-Based Systems
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.
System Components ● There are three main protected modules of the System  The Hardware Abstraction Layer ● A virtual machine to configure all devices.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Full and Para Virtualization
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
Hardware and Software transactional memory and usages in MRE
Solving Difficult HTM Problems Without Difficult Hardware Owen Hofmann, Donald Porter, Hany Ramadan, Christopher Rossbach, and Emmett Witchel University.
Sunpyo Hong, Hyesoon Kim
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
April 6, 2016ASPLOS 2016Atlanta, Georgia. Yaron Weinsberg IBM Research Idit Keidar Technion Hagar Porat Technion Eran Harpaz Technion Noam Shalev Technion.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Introduction to Operating Systems Concepts
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Memory Caches & TLB Virtual Memory
Transactional Memory : Hardware Proposals Overview
Outline Other synchronization primitives
University of Texas at Austin
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Other Important Synchronization Primitives
Effective Data-Race Detection for the Kernel
Operating System Structure
Semester Review Chris Gill CSE 422S - Operating Systems Organization
OS Virtualization.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Page Replacement.
Xen Network I/O Performance Analysis and Opportunities for Improvement
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E
Chapter 2: Operating-System Structures
Major Topics in Operating Systems
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
LINUX System : Lecture 7 Lecture notes acknowledgement : The design of UNIX Operating System.
CS703 - Advanced Operating Systems
Chapter 2: Operating-System Structures
Chapter 1: Introduction CSS503 Systems Programming
Presentation transcript:

Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture Fall Prof. Burger

Motivation What would a realistic HTM system actually support? (primitives/design choices) Current Transactional Memory proposals make architectural design choices with inadequate information: –shared counter, linked list benchmarks –focus on user mode: avoids OS issues

HTM + OS: are you nuts? Large concurrent program with complex data access patterns Complex code: simplify programming model Many apps spend a lot of time in kernel Diverse synchronization primitives –spinlocks, semaphores, per-CPU variables, RCU, seqlocks, completions, mutexes

Our HTM System Basic primitives: –xbegin, xend OS-specific primitives: –xpush, xpop –stack management: interrupts on x86 re-use stack Configurable Hardware Parameters –Conflict detection granularity –Commit & abort penalties –Overflow costs Configurable contention management –Conflict resolution policies: which tx restarts? –Backoff policies: how long to wait before restart

An Issue Unique to an OS: Using transactions in interrupt handlers 0x10 0x20 0x30 0x40 TX #1 { 0x10 } system_call() { XBEGIN modify 0x10 XEND } intr_handler() { XPUSH XBEGIN modify 0x30 XEND XPOP } No tx in interrupts TX #1 { 0x10 } TX #2 { 0x30 } Interrupts abort active tx TX #1 { 0x10, 0x30 } Nest the transactions TX #1 { 0x10 } TX #2 { 0x30 } Multiple active transactions TX #1 { 0x10 } interrupt

Converting Linux to TxLinux TxLinux based on kernel Converted “core” primitives to use transactions –spin-locks, RCU primitives, r/w locks –critical sections become transactions Converted high traffic subsystems –memory allocators, FS directory cache, mapping addresses to pages data structures, memory mapping files into address spaces, ip routing, and socket locking Modified interrupt-handling code to use primitives in our HTM model (xpush, xpop)

HTM Implementation Implemented HTM model as x86 extensions Simulation environment –Simics machine simulator –transactional L1 cache (variable: 4k-32k) –4MB L2 ; 1GB RAM –1 cycle/instruction, 16 cycle/L1 miss, 200 cycle/L2 miss –4 & 8 processors

Experimental Setup Benchmarks –micro: kernalloc, Counter, directory cache “punisher” –macro: pmake, netcat, MAB, configure, find Measurements –Execution time –Transactions statistics: created/restarted/overflowed, working sets, footprint –Cache statistics (e.g. miss rate) Variables –Contention management (conflict/backoff policies) –Transactional cache size –Commit, abort, overflow penalties –Conflict granularity (byte vs. word vs. cache line)

TxLinux Results (4 processors) Performance change minimal, lots of transactions Unique Transaction restarts were < 0.07% Data cache miss rates do not change appreciably Transactions Created 105,972425,888475,8601,810,6021,408,610243,934

Contention Management Matters! linear back off policy, 4 processors

Conclusions TxLinux is cooler than, and has comparable performance to Linux Cache line granularity is good enough 16KB Transactional cache covers the vast majority of transactions Best contention management policy is workload dependent. Exponential back off is too conservative

Backup Slides

Contention Management Restart Rates

Conflict Granularity & Backoff Policy

Stack Management Issue Treating the Stack as a shared resource –Checkpoint –Partition

Tx’l Memory Allocator Investigation Examine Tx complexity/performance trade-off The “slab” is the default Kernel memory allocator –Highly tuned for performance –Avoids contention/locks, uses per-CPU structures –About ~3,880 lines of code The “slob” is a drop-in replacement –Designed for minimal bookkeeping memory overhead –Uses two coarse-grained locks (386 lines) The “slob-opt” is “slob” with modifications –Removed “obvious” transaction bottlenecks –Only a couple of dozen lines of code changed

Tx’l Memory Allocator Results (4 proc) KernallocPmakeMABconfigureFind slab %0.04%0.07%0.04%0% slob %19.72%5.93%0.71% slob- optimized %0.45%8.48%1.42%0.12% Execution time (in seconds) Unique restarts

Transactional Memory Issues Hardware vs. Software –Different interfaces –strong (HW) vs. weak (SW) atomicity Will transactions make programming easier? Transactions for blocking primitives? Using transactions for security?