Linux Review (2nd Half) COMS W4118 Spring 2008.

Linux Review (2nd Half) COMS W4118 Spring 2008

Interrupts in Linux COMS W4118 Spring 2008

Interrupts Forcibly change normal flow of control
Similar to context switch (but lighter weight) Hardware saves some context on stack; Includes interrupted instruction if restart needed Enters kernel at a specific point; kernel then figures out which interrupt handler should run Execution resumes with special “iret” instruction Many different types of interrupts

Types of Interrupts Asynchronous Synchronous (also called exceptions)
From external source, such as I/O device Not related to instruction being executed Synchronous (also called exceptions) Processor-detected exceptions: Faults — correctable; offending instruction is retried Traps — often for debugging; instruction is not retried Aborts — major error (hardware failure) Programmed exceptions: Requests for kernel intervention (software intr/syscalls)

CPU’s ‘fetch-execute’ cycle
User Program Fetch instruction at IP ld add st mul sub bne jmp … Decode the fetched instruction Save context IP Get INTR ID Execute the decoded instruction Lookup ISR Advance IP to next instruction Execute ISR IRQ? yes IRET no

Faults Instruction would be illegal to execute Examples:
Writing to a memory segment marked ‘read-only’ Reading from an unavailable memory segment (on disk) Executing a ‘privileged’ instruction Detected before incrementing the IP The causes of ‘faults’ can often be ‘fixed’ If a ‘problem’ can be remedied, then the CPU can just resume its execution-cycle

Traps A CPU might have been programmed to automatically switch control to a ‘debugger’ program after it has executed an instruction That type of situation is known as a ‘trap’ It is activated after incrementing the IP

Error Exceptions Most error exceptions — divide by zero, invalid operation, illegal memory reference, etc. — translate directly into signals This isn’t a coincidence. . . The kernel’s job is fairly simple: send the appropriate signal to the current process force_sig(sig_number, current); That will probably kill the process, but that’s not the concern of the exception handler One important exception: page fault An exception can (infrequently) happen in the kernel die(); // kernel oops

Programmable Interval-Timer
Interrupt Hardware Legacy PC Design (for single-proc systems) IRQs Slave PIC (8259) Master PIC (8259) x86 CPU Ethernet INTR SCSI Disk Real-Time Clock Keyboard Controller Programmable Interval-Timer I/O devices have (unique or shared) Interrupt Request Lines (IRQs) IRQs are mapped by special hardware to interrupt vectors, and passed to the CPU This hardware is called a Programmable Interrupt Controller (PIC)

The `Interrupt Controller’
Responsible for telling the CPU when a specific external device wishes to ‘interrupt’ Needs to tell the CPU which one among several devices is the one needing service PIC translates IRQ to vector Raises interrupt to CPU Vector available in register Waits for ack from CPU Interrupts can have varying priorities PIC also needs to prioritize multiple requests Possible to “mask” (disable) interrupts at PIC or CPU Early systems cascaded two 8 input chips (8259A)

APIC, IO-APIC, LAPIC Advanced PIC (APIC) for SMP systems
Used in all modern systems Interrupts “routed” to CPU over system bus IPI: inter-processor interrupt Local APIC (LAPIC) versus “frontend” IO-APIC Devices connect to front-end IO-APIC IO-APIC communicates (over bus) with Local APIC Interrupt routing Allows broadcast or selective routing of interrupts Ability to distribute interrupt handling load Routes to lowest priority process Special register: Task Priority Register (TPR) Arbitrates (round-robin) if equal priority

Assigning IRQs to Devices
IRQ assignment is hardware-dependent Sometimes it’s hardwired, sometimes it’s set physically, sometimes it’s programmable PCI bus usually assigns IRQs at boot Some IRQs are fixed by the architecture IRQ0: Interval timer IRQ2: Cascade pin for 8259A Linux device drivers request IRQs when the device is opened Note: especially useful for dynamically-loaded drivers, such as for USB or PCMCIA devices Two devices that aren’t used at the same time can share an IRQ, even if the hardware doesn’t support simultaneous sharing

Assigning Vectors to IRQs
Vector: index (0-255) into interrupt descriptor table Vectors usually IRQ# + 32 Below 32 reserved for non-maskable intr & exceptions Maskable interrupts can be assigned as needed Vector 128 used for syscall Vectors used for IPI

Interrupt Descriptor Table
The ‘entry-point’ to the interrupt-handler is located via the Interrupt Descriptor Table (IDT) IDT: “gate descriptors” Segment selector + offset for handler Descriptor Privilege Level (DPL) Gates (slightly different ways of entering kernel) Task gate: includes TSS to transfer to (not used by Linux) Interrupt gate: disables further interrupts Trap gate: further interrupts still allowed

Interrupt Masking Two different types: global and per-IRQ
Global — delays all interrupts Selective — individual IRQs can be masked selectively Selective masking is usually what’s needed — interference most common from two interrupts of the same type

Putting It All Together
Memory Bus IRQs PIC CPU idtr IDT INTR vector N handler Mask points 255

Dispatching Interrupts
Each interrupt has to be handled by a special device- or trap-specific routine Interrupt Descriptor Table (IDT) has gate descriptors for each interrupt vector Hardware locates the proper gate descriptor for this interrupt vector, and locates the new context A new stack pointer, program counter, CPU and memory state, etc., are loaded Global interrupt mask set The old program counter, stack pointer, CPU and memory state, etc., are saved on the new stack The specific handler is invoked

Handling Nested Interrupts
As soon as possible, unmask the global interrupt As soon as reasonable, re-enable interrupts from that IRQ But that isn’t always a great idea, since it could cause re-entry to the same handler IRQ-specific mask is not enabled during interrupt-handling

Nested Execution Interrupts can be interrupted
By different interrupts; handlers need not be reentrant No notion of priority in Linux Small portions execute with interrupts disabled Interrupts remain pending until acked by CPU Exceptions can be interrupted By interrupts (devices needing service) Exceptions can nest two levels deep Exceptions indicate coding error Exception code (kernel code) shouldn’t have bugs Page fault is possible (trying to touch user data)

Interrupt Handling Philosophy
Do as little as possible in the interrupt handler Defer non-critical actions till later Structure: top and bottom halves Top-half: do minimum work and return (ISR) Bottom-half: deferred processing (softirqs, tasklets) Again — want to do as little as possible with IRQ interrupts masked No process context available

No Process Context Interrupts (as opposed to exceptions) are not associated with particular instructions They’re also not associated with a given process The currently-running process, at the time of the interrupt, as no relationship whatsoever to that interrupt Interrupt handlers cannot refer to current Interrupt handlers cannot sleep!

Interrupt Stack When an interrupt occurs, what stack is used?
Exceptions: The kernel stack of the current process, whatever it is, is used (There’s always some process running — the “idle” process, if nothing else) Interrupts: hard IRQ stack (1 per processor) SoftIRQs: soft IRQ stack (1 per processor) These stacks are configured in the IDT and TSS at boot time by the kernel

First-Level Interrupt Handler
Often in assembler Perform minimal, common functions: save registers, unmask other interrupts Eventually, undoes that: restores registers, returns to previous context Most important: call proper second-level interrupt handler (C program)

Finding the Proper Handler
On modern hardware, multiple I/O devices can share a single IRQ and hence interrupt vector First differentiator is the interrupt vector Multiple interrupt service routines (ISR) can be associated with a vector Each device’s ISR for that IRQ is called; the determination of whether or not that device has interrupted is device-dependent

Deferrable Work We don’t want to do too much in regular interrupt handlers: Interrupts are masked We don’t want the kernel stack to grow too much Instead, interrupt handlers schedule work to be performed later Three deferred work mechanisms: softirqs, tasklets, and work queues Tasklets are built on top of softirqs For all of these, requests are queued

Softirqs Statically allocated: specified at kernel compile time
Limited number: Priority Type 0 High-priority tasklets 1 Timer interrupts 2 Network transmission 3 Network reception 4 SCSI disks 5 Regular tasklets

Running Softirqs Run at various points by the kernel
System calls, exceptions Most important: after handling IRQs and after timer interrupts Softirq routines can be executed simultaneously on multiple CPUs: Code must be re-entrant Code must do its own locking as needed Hardware interrupts always enabled when softirqs are running.

Rescheduling Softirqs
A softirq routine can reschedule itself This could starve user-level processes Softirq scheduler only runs a limited number of requests at a time The rest are executed by a kernel thread, ksoftirqd, which competes with user processes for CPU time

Tasklets Similar to softirqs Created and destroyed dynamically
Individual tasklets are locked during execution; no problem about re-entrancy, and no need for locking by the code Only one instance of tasklet can run, even with multiple CPUs Preferred mechanism for most deferred activity

Work Queues Always run by kernel threads
Softirqs and tasklets run in an interrupt context; work queues have a process context Because they have a process context, they can sleep However, they’re kernel-only; there is no user mode associated with it

Synchronization in Linux
COMS W4118 Spring 2008

Linux Synch Primitives
Memory barriers avoids compiler, cpu instruction re-ordering Atomic operations memory bus lock, read-modify-write ops Interrupt/softirq disabling/enabling Local, global Spin locks general, read/write, big reader Semaphores general, read/write

Choosing Synch Primitives
Avoid synch if possible! (clever instruction ordering) Example: inserting in linked list (needs barrier still) Use atomics or rw spinlocks if possible Use semaphores if you need to sleep Can’t sleep in interrupt context Don’t sleep holding a spinlock! Complicated matrix of choices for protecting data structures accessed by deferred functions

Architectural Dependence
The implementation of the synchronization primitives is extremely architecture dependent. This is because only the hardware can guarantee atomicity of an operation. Each architecture must provide a mechanism for doing an operation that can examine and modify a storage location atomically. Some architectures do not guarantee atomicity, but inform whether the operation attempted was atomic.

Barriers: Definition Barriers are used to prevent a processor and/or the compiler from reordering instruction execution and memory modification. Barriers are instructions to hardware and/or compiler to complete all pending accesses before issuing any more read memory barrier – acts on read requests write memory barrier – acts on write requests Intel – certain instructions act as barriers: lock, iret, control regs rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory") add 0 to top of stack with lock prefix wmb – Intel never re-orders writes, just for compiler

Atomic Operations Many instructions not atomic in hardware (smp)
Read-modify-write instructions: inc, test-and-set, swap unaligned memory access Compiler may not generate atomic code even i++ is not necessarily atomic! If the data that must be protected is a single word, atomic operations can be used. These functions examine and modify the word atomically. The atomic data type is atomic_t. Intel implementation lock prefix byte 0xf0 – locks memory bus

Serializing with Interrupts
Basic primitive in original UNIX Doesn’t protect against other CPUs Intel: “interrupts enabled bit” cli to clear (disable), sti to set (enable) Enabling is often wrong; need to restore local_irq_save() local_irq_restore()

Interrupt Operations Services used to serialize with interrupts are:
local_irq_disable - disables interrupts on the current CPU local_irq_enable - enable interrupts on the current CPU local_save_flags - return the interrupt state of the processor local_restore_flags - restore the interrupt state of the processor Dealing with the full interrupt state of the system is officially discouraged. Locks should be used.

Spin Locks A spin lock is a data structure (spinlock_t ) that is used to synchronize access to critical sections. Only one thread can be holding a spin lock at any moment. All other threads trying to get the lock will “spin” (loop while checking the lock status). Spin locks should not be held for long periods because waiting tasks on other CPUs are spinning, and thus wasting CPU execution time.

__raw_spin_lock_string
1: lock; decb %0 # atomically decrement jns 3f # if clear sign bit jump forward to 3 2: rep; nop # wait cmpb $0, %0 # spin – compare to 0 jle 2b # go back to 2 if <= 0 (locked) jmp 1b # unlocked; go back to 1 to try again 3: # we have acquired the lock … From linux/include/asm-i386/spinlock.h spin_unlock merely writes 1 into the lock field.

Spin Lock Operations Functions used to work with spin locks:
spin_lock_init – initialize a spin lock before using it for the first time spin_lock – acquire a spin lock, spin waiting if it is not available spin_unlock – release a spin lock spin_unlock_wait – spin waiting for spin lock to become available, but don't acquire it spin_trylock – acquire a spin lock if it is currently free, otherwise return error spin_is_locked – return spin lock state

Spin Locks & Interrupts
The spin lock services also provide interfaces that serialize with interrupts (on the current processor): spin_lock_irq - acquire spin lock and disable interrupts spin_unlock_irq - release spin lock and reenable spin_lock_irqsave - acquire spin lock, save interrupt state, and disable spin_unlock_irqrestore - release spin lock and restore interrupt state

RW Spin Locks & Interrupts
The read/write lock services also provide interfaces that serialize with interrupts (on the current processor): read_lock_irq - acquire lock for read and disable interrupts read_unlock_irq - release read lock and reenable read_lock_irqsave - acquire lock for read, save interrupt state, and disable read_unlock_irqrestore - release read lock and restore interrupt state Corresponding functions for write exist as well (e.g., write_lock_irqsave).

The Big Reader Lock Reader optimized RW spinlock
RW spinlock suffers cache contention On lock and unlock because of write to rwlock_t Per-CPU, cache-aligned lock arrays One for reader portion, another for writer portion To read: set bit in reader array, spin on writer Acquire when writer lock free; very fast! To write: set bit and scan ALL reader bits Acquire when reader bits all free; very slow!

Semaphores A semaphore is a data structure that is used to synchronize access to critical sections or other resources. A semaphore allows a fixed number of tasks (generally one for critical sections) to "hold" the semaphore at one time. Any more tasks requesting to hold the semaphore are blocked (put to sleep). A semaphore can be used for serialization only in code that is allowed to block.

wait_queue_head_t wait
Semaphore Structure Struct semaphore count (atomic_t): > 0: free; = 0: in use, no waiters; < 0: in use, waiters wait: wait queue sleepers: 0 (none), 1 (some), occasionally 2 implementation requires lower-level synch atomic updates, spinlock, interrupt disabling struct semaphore atomic_t count int sleepers wait_queue_head_t wait lock prev next

Semaphores optimized assembly code for normal case (down())
C code for slower “contended” case (__down()) up() is easy atomically increment; wake_up() if necessary uncontended down() is easy atomically decrement; continue contended down() is really complex! basically increment sleepers and sleep loop because of potentially concurrent ups/downs still in down() path when lock is acquired

RW Semaphores A rw_semaphore is a semaphore that allows either one writer or any number of readers (but not both at the same time) to hold it. Any writer requesting to hold the rw_semaphore is blocked when there are readers holding it. A rw_semaphore can be used for serialization only in code that is allowed to block. Both types of semaphores are the only synchronization objects that should be held when blocking. Writers will not starve: once a writer arrives, readers queue behind it Increases concurrency; introduced in 2.4

Mutexes A mutex is a data structure that is also used to synchronize access to critical sections or other resources, introduced in Why? (Documentation/mutex-design.txt) simpler (lighter weight) tighter code slightly faster, better scalability no fastpath tradeoffs debug support – strict checking of adhering to semantics

Linux Virtual File System (VFS) and Ext2
COMS W4118 Spring 2008

Linux File System Model
basically UNIX file semantics File systems are mounted at various points Files identified by device inode numbers VFS layer just dispatches to fs-specific functions libc read() -> sys_read() what type of filesystem does this file belong to? call filesystem (fs) specific read function maintained in open file object (file) example: file->f_op->read(…) similar to device abstraction model in UNIX

VFS System Calls fundamental UNIX abstractions
files (everything is a file) ex: /dev/ttyS0 – device as a file ex: /proc/123 – process as a file processes users lots of syscalls related to files! (~100) most dispatch to filesystem-specific calls some require no filesystem action example: lseek(pos) – change position in file others have default VFS implementations

VFS System Calls (cont.)
filesystem ops – mounting, info, flushing, chroot, pivot_root directory ops – chdir, getcwd, link, unlink, rename, symlink file ops – open/close, (p)read(v)/(p)write(v), seek, truncate, dup fcntl, creat, inode ops – stat, permissions, chmod, chown memory mapping files – mmap, munmap, madvise, mlock wait for input – poll, select flushing – sync, fsync, msync, fdatasync file locking – flock

Big Four Data Structures
struct file information about an open file includes current position (file pointer) struct dentry information about a directory entry includes name + inode# struct inode unique descriptor of a file or directory contains permissions, timestamps, block map (data) inode#: integer (unique per mounted filesystem) struct superblock descriptor of a mounted filesystem

Two More Data Structures
struct file_system_type name of file system pointer to implementing module including how to read a superblock On module load, you call register_file_system and pass a pointer to this structure struct vfsmount Represents a mounted instance of a particular file system One super block can be mounted in two places, with different covering sub mounts Thus lookup requires parent dentry and a vfsmount

Data Structure Relationships
task_struct inode cache inode inode inode open file object fds open file object superblock dentry open file object disk dentry cache

Sharing Data Structures
calling dup() – shares open file objects example: 2>&1 opening the same file twice – shares dentries opening same file via different hard links – shares inodes mounting same filesystem on different dirs – shares superblocks

Superblock important fields mounted filesystem descriptor
usually first block on disk (after boot block) copied into (similar) memory structure on mount distinction: disk superblock vs memory superblock dirty bit (s_dirt), copied to disk frequently important fields s_dev, s_bdev – device, device-driver s_blocksize, s_maxbytes, s_type s_flags, s_magic, s_count, s_root, s_dquot s_dirty – dirty inodes for this filesystem s_op – superblock operations u – filesystem specific data

Inode inode # - unique integer (per-mounted filesystem)
"index" node – unique file or directory descriptor meta-data: permissions, owner, timestamps, size, link count data: pointers to disk blocks containing actual data data pointers are "indices" into file contents (hence "inode") inode # - unique integer (per-mounted filesystem) what about names and paths? high-level fluff on top of a "flat-filesystem" implemented by directory files (directories) directory contents: name + inode

File Links UNIX link semantics
hard links – multiple dir entries with same inode # equal status; first is not "real" entry file deleted when link count goes to 0 restrictions can't hard link to directories (avoids cycles) or across filesystems soft (symbolic) links – little files with pathnames just aliases for another pathname no restrictions, cycles possible, dangling links possible

(Open) File Object struct file (usual variable name - filp)
association between file and process no disk representation created for each open (multiple possible, even same file) most important info: file pointer file descriptor (small ints) index into array of pointers to open file objects file object states unused (memory cache + root reserve (10)) get_empty_filp() inuse (per-superblock lists) system-wide max on open file objects (~8K) /proc/sys/fs/file-max

Dentry abstraction of directory entry directory api: dentry iterators
ex: line from ls -l either files (hard links) or soft links or subdirectories every dentry has a parent dentry (except root) sibling dentries – other entries in the same directory directory api: dentry iterators posix: opendir(), readdir(), scandir(), seekdir(), rewinddir() syscall: getdents() why an abstraction? UNIX: directories are really files with directory "records" MSDOS, etc.: directory is just a big table on disk (FAT) no such thing as subdirectories! just fields in table (file->parentdir), (dir->parentdir)

Dentry Cache very important cache for filesystem performance
every file access causes multiple dentry accesses! example: /tmp/foo dentries for "/", "/tmp", "/tmp/foo" (path components) dentry cache "controls" inode cache inodes released only when dentry is released dentry cache accessed via hash table hash(dir, filename) -> dentry

Dentry Cache (continued)
dentry states free (not valid; maintained by slab cache) in-use (associated with valid open inode) unused (valid but not being used; LRU list) negative (file that does not exist) dentry ops just a few, mostly default actions ex: d_compare(dir, name1, name2) case-insensitive for MSDOS

Process-related Files
current->fs (fs_struct) root (for chroot jails) pwd umask (default file permissions) current->files (files_struct) fd[] (file descriptor array – pointers to file objects) 0, 1, 2 – stdin, stdout, stderr originally 32, growable to 1,024 (RLIMIT_NOFILE) complex structure for growing … see book close_on_exec memory (bitmap) open files normally inherited across exec

Filesystem Types Linux must "know about" filesystem before mount
multiple (mounted) instances of each type possible special (virtual) filesystems (like /proc) structuring technique to touch kernel data examples: /proc, /dev (devfs) sockfs, pipefs, tmpfs, rootfs, shmfs associated with fictitious block device (major# 0) minor# distinguishes special filesystem types

Registering a Filesystem Type
must register before mount static (compile-time) or dynamic (modules) register_filesystem() / unregister_filesystem adds file_system_type object to linked-list file_systems (head; kernel global variable) file_systems_lock (rw spinlock to protect list) file_system_type descriptor name, flags, pointer to implementing module list of superblocks (mounted instances) read_super() – pointer to method for reading superblock most important thing! filesystem specific

Data Structure Relationships (2)
open file object dentry f_dentry d_subdirs d_inode d_parent i_sb d_sb i_dentries open file object dentry dentry dentry inode inode inode ┴ open file object superblock

Ext2 “Standard” Linux File System Uses FFS like layout
Previously it was the most commonly used Serves as a basis for Ext3 which adds journaling Uses FFS like layout Each file system is composed of identical block groups Allocation is designed to improve locality Inodes contain pointers (32 bits) to blocks Direct, Indirect, Double Indirect, Triple Indirect Maximum file size: 4.1TB (4K Blocks) Maximum file system size: 16TB (4K Blocks)

Ext2 Disk Layout Files in the same directory are stored in the same block group Files in different directories are spread among the block groups Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

Block Addressing in Ext2
Twelve “direct” blocks Data Block Inode Data Block Data Block BLKSIZE/4 Indirect Blocks Data Block Data Block Data Block Data Block Indirect Blocks (BLKSIZE/4)2 Data Block Data Block Double Indirect Indirect Blocks Data Block Data Block Data Block Data Block (BLKSIZE/4)3 Data Block Indirect Blocks Data Block Triple Indirect Double Indirect Data Block Indirect Blocks Data Block Data Block Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

Ext2 Directory Structure
(a) A Linux directory with three files. (b) The same directory after the file voluminous has been removed. Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

Linux Review (2nd Half) COMS W4118 Spring 2008.

Similar presentations

Presentation on theme: "Linux Review (2nd Half) COMS W4118 Spring 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linux Review (2nd Half) COMS W4118 Spring 2008.

Similar presentations

Presentation on theme: "Linux Review (2nd Half) COMS W4118 Spring 2008."— Presentation transcript:

Similar presentations

About project

Feedback