We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byRalph Ailey
Modified about 1 year ago
© Microsoft Corporation1 Windows Kernel Internals Cache Manager David B. Probert, Ph.D. Windows Kernel Development Microsoft Corporation
© Microsoft Corporation2 What is the Cache Manager? Set of kernel-mode routines and asynchronous worker routines that form the interface between filesystems and the memory manager for Windows NT
© Microsoft Corporation3 Cache Manager Functionality Access methods for pages of file data on opened files Automatic asynchronous read ahead Automatic asynchronous write behind (lazy write) Supports “Fast I/O” – IRP bypass
© Microsoft Corporation4 Who Uses the Cache Manager? Disk File Systems (NTFS, FAT, CDFS, UDFS) Windows File Server(s) Windows Redirector Registry (as of Windows XP)
© Microsoft Corporation5 What Can Be Cached? User data streams File system metadata –directories –transaction logs –NTFS MFT –synthetic structures – FAT’s Virtual Volume File … anything that can be represented as a stream of bytes
© Microsoft Corporation6 Virtual Block vs Logical Block Cache More traditional approach is logical block cache (SmartDrive): –File+FileOffset translated by file system into one or more partition offsets –Each partition offset translated by cache manager to cache address In a virtual block cache: –File+FileOffset translated by cache manager to cache address, its memory mapping!
© Microsoft Corporation7 Advantages of Virtual Block Cache integrated with VM Single memory manager –the actual “cache” manager! Data cache is dynamically sized – just another working set Cache coherency with user mapped files is free
© Microsoft Corporation8 How Does It Work? Mapped stream model integrated with memory management Cached streams are mapped with fixed-size views (256KB) Pages are faulted into memory via MM Pages may be modified in memory and written back MM manages global memory policy
© Microsoft Corporation9 Cache Addresses MM allocated kernel VA range (512MB +) –common: 0xc1000000 – 0xe0000000 Visible in all kernel-mode contexts Member of the System Cache working set (includes paged pool and code) –this is what Task Manager shows you! –not just the “file” cache, though it does frequently dominate Competes for physical memory “Owned” by Cache Manager
© Microsoft Corporation10 Datastructure Layout File Object == Handle (U or K), not one per file Section Object Pointers and FS File Context are the same for all file objects for the same stream KernelKernel Handle File Object Filesystem File Context FS Handle Context (2) Section Object Pointers Data Section (Mm) Image Section (Mm) Shared Cache Map (Cc) Private Cache Map (Cc)
© Microsoft Corporation11 Datastructures File Object –FsContext – per physical stream context –FsContext2 – per user handle stream context, not all streams have handle context (metadata) –SectionObjectPointers – the point of “single instancing” DataSection – exists if the stream has had a mapped section created (for use by Cc or user) SharedCacheMap – exists if the stream has been set up for the cache manager ImageSection – exists for executables –PrivateCacheMap – per handle Cc context (readahead) that also serves as reference from this file object to the shared cache map
© Microsoft Corporation12 Single Instancing & Metadata Although filesystems represent metadata as streams, they are not exported to user mode Directories require a level of indirection to escape single instancing exposing the data Filesystems create a second internal “stream” fileobject –user’s fileobject has NULL members in its Section Object Pointers –stream fileobjects have no FsContext2 (user handle context) All metadata streams are built like this (MFTs, FATs, etc.) FsContext2 == NULL plays an important role in how Cc treats these streams, which we’ll discuss later.
© Microsoft Corporation13 View Management A Shared Cache Map has an array of View Access Control Block (VACB) pointers which record the base cache address of each view –promoted to a sparse form for files > 32MB Access interfaces map File+FileOffset to a cache address Taking a view miss results in a new mapping, possibly unmapping an unreferenced view in another file (views are recycled LRU) Since a view is fixed size, mapping across a view is impossible – Cc returns one address Fixed size means no fragmentation …
© Microsoft Corporation14 View Mapping 0-256KB256KB-512KB512KB-768KB c1000000 File Offfset cf0c0000 VACB Array
© Microsoft Corporation15 Interface Summary File objects start out unadorned CcInitializeCacheMap to initiate caching via Cc on a file object –setup the Shared/Private Cache Map & Mm if neccesary Access methods (Copy, Mdl, Mapping/Pinning) Maintenance Functions CcUninitializeCacheMap to terminate caching on a file object –teardown S/P Cache Maps –Mm lives on. Its data section is the cache!
© Microsoft Corporation16 The Cache Manager Doesn’t Stand Alone Cc is an extension of either Mm or the FS depending how you look at it Cc is intimately tied into the filesystem model Understanding Cc means we have to take a slight detour to mention some concepts filesystem folks think are interesting. Raise your hand if you’re a filesystem person :-)
© Microsoft Corporation17 The Big Block Diagram Cache Manager Memory Manager Filesystem Storage Drivers Disk Fast IO Read/WriteIRP-based Read/Write Page Fault Cache Access, Flush, Purge Noncached IO Cached IO
© Microsoft Corporation18 The Slight Filesystem Digression Three basic types of IO on NT: cached, noncached and “paging” Paging IO is simply IO generated by Mm – flushing or faulting –the data section implies the file is big enough –can never extend a file A filesystem will re-enter itself on the same callstack as Mm dispatches cache pagefaults This makes things exciting! (ERESOURCEs)
© Microsoft Corporation19 The Three File Sizes FileSize – how big the file looks to the user –1 byte, 102 bytes, 1040592 bytes AllocationSize – how much backing store is allocated on the volume –multiple of cluster size, which is 2 n * sector size –... a more practical definition shortly ValidDataLength – how much of the file has been written by the user in cache, zeros seen beyond (some OS use sparse allocation) ValidDataLength <= FileSize <= AllocationSize
© Microsoft Corporation20 Valid Data Length The Win32 model expects full allocation of files (STATUS_DISK_FULL is uncool) Writing zeroes is expensive, but users tend to write files front to back Windows FS keep track of this as a high-water mark If the user reads beyond VDL, we may be able to get clever and not bother the filesystem at all. If a user writes beyond VDL, zeroing of a “hole” may be required Never SetEndOfFile and write at the end if you can help it!
© Microsoft Corporation21 How to get data into the cache
© Microsoft Corporation22 Fast IO – Who Needs an FS? Fast IO paths short circuit the IO to a common FsRtl routine or filesystem- provided call This is just memory mapped IO, synchronizing with the FS for … Extending FileSize up to AllocationSize! –VDL zeroing means the cache data is already good –Hint set in fileobject so FS will update directory Extending ValidDataLength up to FileSize
© Microsoft Corporation23 Regular Cached IO Filesystems also implement a cached path Basically the same logic as the Fast IO path (or vice versa, depending) Reuses the same Cc functions Why not use Fast IO all the time? –file locks –oplocks –extending files (and so forth)
© Microsoft Corporation24 Copy Method Used for user cached IO, both fast and IRP based CcCopyRead maps and copies a mapped cache byte range into a buffer CcCopyWrite copies a buffer into a mapped cache byte range and marks the range for writing “Fast” versions of each are really the same code, but only taking 32bit fileoffsets –NT used to run on 386s! :-)
© Microsoft Corporation25 Mdl (DMA) Method Used by network transport layers CcMdlRead returns an Mdl describing specified byte range CcMdlReadComplete frees the Mdl CcPrepareMdlWrite returns an Mdl describing specified byte range (may contain “smart” zeros with respect to VDL) CcMdlWriteComplete frees the Mdl and marks range for writing
© Microsoft Corporation26 Pagefault Cluster Hints Taking a pagefault can result in Mm opportunistically bringing surrounding pages in (up 7/15 depending) Since Cc takes pagefaults on streams, but knows a lot about which pages are useful, Mm provides a hinting mechanism in the TLS –MmSetPageFaultReadAhead() Not exposed to usermode …
© Microsoft Corporation27 Readahead CcScheduleReadAhead detects patterns on a handle and schedules readahead into the next suspected ranges –Regular motion, backwards and forwards, with gaps –Private Cache Map contains the per-handle info –Called by CcCopyRead and CcMdlRead Readahead granularity (64KB) controls the scheduling trigger points and length –Small IOs – don’t want readahead every 4KB –Large IOs – ya get what ya need (up to 8MB, thanks to Jim Gray) CcPerformReadAhead maps and touch-faults pages in a Cc worker thread, will use the new Mm prefetch APIs in a future release
© Microsoft Corporation28 Unmap Behind Recall how views are managed (misses) On view miss, Cc will unmap two views behind the current (missed) view before mapping Unmapped valid pages go to the standby list in LRU order and can be soft-faulted. In practice, this is where much of the actual cache is as of Windows 2000. Unmap behind logic is default due to large file read/write operations causing huge swings in working set. Mm’s working set trim falls down at the speed a disk can produce pages, Cc must help.
© Microsoft Corporation29 Cache Hints Cache hints affect both read ahead and unmap behind Two flags specifiable at Win32 CreateFile() FILE_FLAG_SEQUENTIAL_SCAN –doubles readahead unit on handle, unmaps to the front of the standby list (MRU order) if all handles are SEQUENTIAL FILE_FLAG_RANDOM_ACCESS –turns off readahead on handle, turns off unmap behind logic if any handle is RANDOM Unfortunately, there is no way to split the effect
© Microsoft Corporation30 Write Throttling Avoids out of memory problems by delaying writes to the cache –Filling memory faster than writeback speed is not useful, we may as well run into it sooner Throttle limit is twofold –CcDirtyPageThreshold – dynamic, but ~1500 on all current machines (small, but see above) –MmAvailablePages & pagefile page backlog CcCanIWrite sees if write is ok, optionally blocking, also serving as the restart test CcDeferWrite sets up for callback when write should be allowed (async case) !defwrites debugger extension triages and shows the state of the throttle
© Microsoft Corporation31 Writing Cached Data There are three basic sets of threads involved, only one of which is Cc’s –Mm’s modified page writer the paging file –Mm’s mapped page writer almost anything else –Cc’s lazy writer pool executing in the kernel critical work queue writes data produced through Cc interfaces
© Microsoft Corporation32 The Lazy Writer Name is misleading, its really delayed All files with dirty data have been queued onto CcDirtySharedCacheMapList Work queueing – CcLazyWriteScan() –Once per second, queues work to arrive at writing 1/8 th of dirty data given current dirty and production rates –Fairness considerations are interesting CcLazyWriterCursor rotated around the list, pointing at the next file to operate on (fairness) –16 th pass rule for user and metadata streams Work issuing – CcWriteBehind() –Uses a special mode of CcFlushCache() which flushes front to back (HotSpots – fairness again)
© Microsoft Corporation33 Write Throttling Avoids out of memory problems by delaying writes to the cache –Filling memory faster than writeback speed is not useful, we may as well run into it sooner Throttle limit is twofold –CcDirtyPageThreshold – dynamic, but ~1500 on all current machines (small, but see above) –MmAvailablePages & pagefile page backlog CcCanIWrite sees if write is ok, optionally blocking, also serving as the restart test CcDeferWrite sets up for callback when write should be allowed (async case) !defwrites debugger extension triages and shows the state of the throttle
© Microsoft Corporation34 Valid Data Length Calls Cache Manager knows highest offset successfully written to disk – via the lazy writer File system is informed by special FileEndOfFileInformation call after each write which extends/maintains VDL FS which persist VDL to disk (NTFS) push that down here FS use it as a hint to update directory entries (recall Fast IO extension, one among several) CcFlushCache() flushing front to back is important so we move VDL on disk as soon as possible.
© Microsoft Corporation35 Letting the Filesystem Into The Cache Two distinct access interfaces –Map – given File+FileOffset, return a cache address –Pin – same, but acquires synchronization – this is a range lock on the stream Lazy writer acquires synchronization, allowing it to serialize metadata production with metadata writing Pinning also allows setting of a log sequence number (LSN) on the update, for transactional FS –FS receives an LSN callback from the lazy writer prior to range flush
© Microsoft Corporation36 Remember FsContext2? Synchronization on Pin interfaces requires that Cc be the writer of the data Mm provides a method to turn off the mapped page writer for a stream, MmDisableModifiedWriteOfSection() –confusing name, I know (modified writer is not involved) Serves as the trigger for Cc to perform synchronization on write
© Microsoft Corporation37 Mapping/Pinning Method CcMapData to map byte range for read access CcPinRead to map byte range for read/write access CcPreparePinWrite CcPinMappedData CcSetDirtyPinnedData CcUnpinData
© Microsoft Corporation38 BCBs and Lies Thereof Mapping and Pinning interfaces return opaque Buffer Control Block (BCB) pointers Unpin receives BCBs to indicate regions BCBs for Map interfaces are usually VACB pointers BCBs for Pin interfaces are pointers to a real BCB structure in Cc, which references a VACB for the cache address
© Microsoft Corporation39 Basic Maintenance Functions CcSetFileSizes –used by FS to tell Cc/Mm when it changes file sizes –updates VDL goal for the callback –extends data sections, purges on truncate CcFlushCache CcPurgeCacheSection CcZeroData –used during VDL zeroing for MDL hack... and a few others, of course
© Microsoft Corporation40 Cc Grab Bag Every component has them, lets take a short tour of some of the more interesting ones This is the R-rated portion of today’s presentation :-)
© Microsoft Corporation41 We Can’t Use That 16GB It takes on the order of 2-3KB pool (minimum) to cache a single stream once the handle has been closed Cc structures are torn down at handle close File object, Filesystem contexts (FCB, CCB, MCB) Mm section structures, PTEs, etc. remain, and add up. If cache is dominated by small files, pool limits dominate ability to cache. Mm must trim files from the cache if pool fills. This dramatically limits the effective cache size on large machines in reasonable scenarios Motivates thinking about a large-machine cache at the volume level (block cache!), much to our chagrin – lower overhead per page of data
© Microsoft Corporation42 Single Instancing is Security Context Insensitive What is single instancing and why does it matter? Recall the datastructures – Mm and Cc need a fileobject to reference. They choose the first fileobject seen for the file. Terminal Server creates an unanticipated case – multiple security contexts – a lot more often If multiple security contexts reference a file, page faults and flushes still only occur in the context of that first fileobject User logs out or SMB connection backing user’s fileobject is torn down – oops!
© Microsoft Corporation43 Single Instancing is Security Context Insensitive Windows XP SMB RDR must break single instancing down to the security context level to solve. This means that each TS session will have a unique data section, and thus cache, for each file that otherwise could be shared between sessions. Non optimal. An optimal solution will require a lot of work that isn’t well understood yet.
© Microsoft Corporation44 Mapping Is An Expense Requires Mm to pick up spinlocks –easy case – MP scale problem Cc MDL functions have to map part of the cache to build MDLs (and then unmap) Mm may provide an API which does not need a virtual address –recall comment about readahead and mapping
© Microsoft Corporation45 Views Are Large Mm provides views at a fixed size, 256KB Many files are smaller, some are larger If we had a pool of views at 64KB, we may avoid view misses under heavy load Mediating factors –Once misses start, they’ll happen as fast with small as large –Can’t be used for large files or mapping cost will skyrocket –As a result, benefit depends on file size mix –Overallocation of small views would eat into VA for large views An area to be investigated
© Microsoft Corporation46 Flushes Are Synchronous Mm only has a synchronous flush API An asynchronous paging IO flush would require FS rework as well There are only so many critical worker threads, and so many workers Cc can schedule As a result, effective bandwidth of the mapped and lazy writers is limited
© Microsoft Corporation47 Cache Manager Summary Virtual block cache for files not logical block cache for disks Memory manager is the ACTUAL cache manager Cache Manager context integrated into FileObjects Cache Manager manages views on files in kernel virtual address space I/O has special fast path for cached accesses The Lazy Writer periodically flushes dirty data to disk Filesystems need two interfaces to CC: map and pin
© Microsoft Corporation48 Discussion
© Microsoft Corporation Windows Kernel Internals II Advanced File Systems University of Tokyo – July 2004 Dave Probert, Ph.D. Advanced Operating.
计算机系 信息处理实验室 Lecture 14 Cache Manager
© Microsoft Corporation1 Windows Kernel Internals Virtual Memory Manager *David B. Probert, Ph.D. Windows Kernel Development Microsoft Corporation.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] = 1 block[i]
Windows 2000 Memory Management Computing Department, Lancaster University, UK.
CSE451 Introduction to Operating Systems Spring 2008 Tour of the Windows NT File Systems Gary Kimura & Mark Zbikowski.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Transactions and Reliability. File system components Disk management Naming Reliability What are the reliability issues in file systems? Security.
File Systems May 12, 2000 Instructor: Gary Kimura.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
Background Often, only part of a running program actually needs to be in memory. The ability to execute a program only partially in memory would provide.
CS307 Operating Systems Virtual Memory Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Spring 2012.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Processes and Virtual Memory.
Storage Systems CSE 598d, Spring 2007 OS Support for DB Management DB File System April 3, 2007 Mark Johnson.
Paging (continued) & Caching CS-3013 A-term Paging (continued) & Caching CS-3013 Operating Systems A-term 2008 (Slides include materials from Modern.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
CS333 Intro to Operating Systems Jonathan Walpole.
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
Problems discussed in the review session for the final COSC 4330/6310 Summer 2012.
CDA 5155 Virtual Memory Lecture 27. Memory Hierarchy Cache (SRAM) Main Memory (DRAM) Disk Storage (Magnetic media) CostLatencyAccess.
1 Adapted from UC Berkeley CS252 S01 Lecture 19: Virtual Memory Virtual Memory concept, Virtual- physical translation, page table, TLB, Alpha memory.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.
File Systems. Workloads Workloads provide design target of a system Common file characteristics – Most files are small (~8KB) – Large files use most of.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
File Systems Examples. 2 MS-DOS File System Naming: 8+3 in upper case Directories: Hierarchical directory structure –No soft or hard links –32 byte directory.
Wince File systems. File system on embedded File system choice on embedded is important –File system size can be an issue –Different media are used –
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 9 th Edition Chapter 9: Virtual-Memory Management.
Computer Organization and Architecture Operating System Support.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
Recap of Feb 25: Physical Storage Media Issues are speed, cost, reliability Media types: –Primary storage (volatile): Cache, Main Memory –Secondary or.
1 Some Real Problem What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?
Lecture Topics: 11/17 Page tables –flat page tables –paged page tables –inverted page tables TLBs Virtual memory.
Chapter 4 Memory Management Virtual Memory. Swapping （ 1 ） Comes from the basis that when a process is blocked, it does not need to be in memory Thus,
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
11/13/01CS-550 Presentation - Overview of Microsoft disk operating system. 1 An Overview of Microsoft Disk Operating System.
By Teacher Asma Aleisa Year 1433 H. Goals of memory management To provide a convenient abstraction for programming. To allocate scarce memory.
© 2017 SlidePlayer.com Inc. All rights reserved.