Kernel Synchronization

Kernel Synchronization
國立中正大學資訊工程研究所羅習五　老師少部分內容參酌自薛智文老師

Chapter 5: Kernel Synchronization
Kernel Control Paths When Synchronization is Not Necessary Synchronization Primitives Synchronizing Accesses to Kernel Data Structures Examples of Race Condition Prevention

Kernel You could think of the kernel as a server that answers requests; these requests can come either from a process running on a CPU or an external device issuing an interrupt request. Top halves Bottom halves

Kernel Control Paths Kernel Control Path (KCP)
a sequence of instructions executed by the kernel to handle interrupts (/exception) of different kinds Each kernel request is handled by a different KCP system call request (software interrupt): system_call  ret_from_sys_call System call Top halves Bottom halves

Kernel Requests A process executing in User Mode causes an exception. (e.g., x/0) A process executing in Kernel Mode causes a Page Fault exception. An external device sends a signal to a programmable interrupt controller (PIC), and the corresponding interrupt is enabled A process running raises an interprocessor interrupt (IPI).

Kernel Control Paths The CPU interleaves KCPs when:
A process switch occurs. (it relinquishes control of CPU, e.g., sleep/wait) An interrupt occurs. A deferrable function is executed. Interleaving improves the throughput of PIC and device controllers.

A fully preemptable kernel
Nonpreemptive kernel? & preemptive kernel? Nonpreemptive kernel: Linux kernel ~2.4 preemptive kernel: Linux kernel 2.6 Kernel preempt_count* = kernel 2.6 The value is greater than 0 when … The kernel is executing an ISR The deferrable functions are disabled The kernel preemption level has been explicitly disabled *: This field is in the thread_info descriptor.

When Sync. Is Not Necessary
Simplifying assumptions: Interrupt handlers and tasklets need not to be coded as reentrant functions Interrupt handlers, softirqs, and tasklets are both nonpreemptable and non-blocking Per-CPU variable accessed by softirqs and tasklets only do not require sync. A data structure access by only one kind of tasklet does not require sync.

Synchronization Primitives
Per-CPU variables only on SMP systems; keep them short! One element per each CPU in the system general, read/write, big reader Atomic operations Semaphores memory bus lock, read-modify-write (rmw) ops general, read/write Local interrupt disabling Memory barriers Local softirq disabling avoids compiler, CPU instruction re-ordering Read-copy-update (RCU) Spin locks

Synchronization Primitives
Technique Description Scope Atomic operation Atomic read-modify-write instruction to a counter All CPUs Memory barrier Avoid instruction re-ordering Local CPU Spin lock Lock with busy wait Semaphore Lock with blocking wait Local interrupt disabling Forbid interrupt handling on a single CPU Local softirq disabling Forbid deferrable function handling on a single CPU Global interrupt disabling Forbid interrupt and softirq handling on all CPUs

Atomic Operations Many instructions not atomic in hw (MP)
rmw instructions: inc, test-and-set, swap unaligned memory access rep instructions Compiler may not generate atomic code even i++ is not necessarily atomic! (i=i+1) Linux – atomic_ macros atomic_t – 24 bit atomic counters Intel implementation (atomic, for MP) lock prefix byte 0xf0 – locks memory bus

Atomic operations in Linux
Function Description atomic_read(v) Return *v atomic_set(v,i) Set *v to i atomic_add(i,v) Add i to *v atomic_sub(i,v) Subtract i from *v atomic_sub_and_test(i, v) Subtract i from *v and return 1 if the result is zero; 0 otherwise atomic_inc(v) Add 1 to *v atomic_dec(v) Subtract 1 from *v atomic_dec_and_test(v) Subtract 1 from *v and return 1 if the result is zero; 0 otherwise atomic_inc_and_test(v) Add 1 to *v and return 1 if the result is zero; 0 otherwise atomic_add_negative(i, v) Add i to *v and return 1 if the result is negative; 0 otherwise

Atomic bit handling functions in Linux
Description test_bit(nr, addr) Return the value of the nrth bit of *addr set_bit(nr, addr) Set the nrth bit of *addr clear_bit(nr, addr) Clear the nrth bit of *addr change_bit(nr, addr) Invert the nrth bit of *addr test_and_set_bit(nr, addr) Set the nrth bit of *addr and return its old value test_and_clear_bit(nr, addr) Clear the nrth bit of *addr and return its old value test_and_change_bit(nr, addr) Invert the nrth bit of *addr and return its old value atomic_clear_mask(mask, addr) Clear all bits of addr specified by mask atomic_set_mask(mask, addr) Set all bits of addr specified by mask

Memory Barriers Compilers and hw re-order memory accesses
as an optimization true on SMP and even UP systems! Memory barrier – instruction to hw/compiler to complete all pending accesses before issuing more read memory barrier – acts on read requests write memory barrier – acts on write requests Linux macros for UP and MP: mb(), rmb(), wmb() for MP only: smp_mp(), smp_rmb(), smp_wmb()

Memory barriers in Linux
Macro Description mb( ) Memory barrier for MP and UP rmb( ) Read memory barrier for MP and UP wmb( ) Write memory barrier for MP and UP smp_mb( ) Memory barrier for MP only smp_rmb( ) Read memory barrier for MP only smp_wmb( ) Write memory barrier for MP only

Peterson’s Solution Two process solution
Assume that the LOAD and STORE instructions are atomic; that is, cannot be interrupted. The two processes share two variables: int turn; Boolean flag[2] The variable turn indicates whose turn it is to enter the critical section. The flag array is used to indicate if a process is ready to enter the critical section. flag[i] = true implies that process Pi is ready!

Algorithm for Process Pi
while (true) { flag[i] = TRUE; turn = j; while ( flag[j] && turn == j); /*CRITICAL SECTION*/ flag[i] = FALSE; /*REMAINDER SECTION*/ } ?

flag[i] = False turn = i Task_i Task_j while (true) { while (true) {
turn = j; flag[i] = TRUE; while ( flag[j] && turn == j); /*CRITICAL SECTION*/ flag[i] = FALSE; /*REMAINDER SECTION*/ } while (true) { turn = i; flag[j] = TRUE; while ( flag[i] && turn == i); /*CRITICAL SECTION*/ flag[i] = FALSE; /*REMAINDER SECTION*/ } flag[i] = False turn = i

Peterson’s Solution while (true) { flag[i] = TRUE; mb( ); turn = j; while ( flag[j] && turn == j); /*CRITICAL SECTION*/ flag[i] = FALSE; /*REMAINDER SECTION*/ }

Spin Lock A special kind of lock designed to work in a multiprocessor environment. Spin lock R/W spin lock Sequential lock Useless in a uniprocessor environment (?)

Spin lock functions spin_lock_init( ) spin_lock( ) spin_unlock( )
Description spin_lock_init( ) Set the spin lock to 1 (unlocked) spin_lock( ) Cycle until spin lock becomes 1 (unlocked), then set it to 0 (locked) spin_unlock( ) spin_unlock_wait( ) Wait until the spin lock becomes 1 (unlocked) spin_is_locked( ) Return 0 if the spin lock is set to 1 (unlocked); 1 otherwise spin_trylock( ) Set the spin lock to 0 (locked), and return 1 if the lock is obtained; 0 otherwise

Spin lock functions spin_lock(slp) 1: lock; decb slp jns 3f
2: cmpb $0,slp pause jle 2b jmp 1b 3: spin_unlock(slp) lock; movb $1, slp

Read/Write Spin Locks initial 0x01 000000 lock # of reading write
One read 0x00ffffff Two read 0x00fffffe

Read Spin Lock read_lock(rwlp) movl $rwlp,%eax lock; subl $1,(%eax)
jns 1f call __read_lock_failed 1: __read_lock_failed: lock; incl (%eax) 1:cmpl $1,(%eax) js 1b lock; decl (%eax) js __read_lock_failed ret read_unlock(rwlp) lock; incl rwlp

Write Spin Lock write_lock(rwlp) movl $rwlp,%eax
lock; subl $0x ,(%eax) jz 1f call write_lock_failed 1: __write_lock_failed: lock; addl $0x ,(%eax) 1: cmpl $0x ,(%eax) jne 1b lock; subl $0x ,(%eax) jnz __write_lock_failed ret write_unlock(rwlp) lock; addl $0x ,rwlp

Seqlock (sequential lock)
A seqlock is a locking mechanism Linux for supporting fast writes of shared variables. seqlock := sequence number + lock The lock is to support synchronization between two writers the counter is for indicating consistency in readers

Seqlock (sequential lock)
the writer increments the sequence number, both after acquiring the lock and before releasing the lock. Readers read the sequence number before and after reading the shared data. do { while (((old_seq_num = seq_num)%2) != 0); //READER: critical section } while (old_seq_num != seq_num); Seqlock was first applied to system time counter updating.

MONITOR & MWAIT (x86, for thread synchronization)
MONITOR defines an address range used to monitor write-back stores. MWAIT is used to indicate that the software thread is waiting for a write-back store to the address range defined by the MONITOR instruction.

Read-copy-update (RCU)
It allows extremely low overhead, wait-free reads. RCU updates can be expensive they must leave the old versions of the data structure in place to accommodate pre-existing readers. These old versions are reclaimed after all pre-existing readers finish their accesses. RCU is a new addition in Linux 2.6; it is used in the networking layer and in the virtual file system (VFS). Reference: Paul E. McKenney: Read-copy-update (RCU), IPDPS 2006 Best Paper

reader data Local_PTR PTR RCU allows extremely low overhead, wait-free reads.

reader data Local_PTR PTR writer data (new) kmalloc + copy + update New_PTR RCU updates can be very expensive…

reader data PTR An atomic operation writer data (new) PTR = New_PTR PTR Remove pointers to a data structure, so that subsequent readers cannot gain a reference to it.

reader data PTR writer data (new) PTR = new_PTR PTR Wait for all previous readers to complete their RCU read-side critical sections.

data writer or GC data (new) kfree(old_ptr) PTR The “GC” can safely reclaim the data (the old version).

data (new) PTR

Lock scheduler Unlock scheduler CTX_SW reader writer GC Lock_scheduler := preempt_count++ Unlock_scheduler := preempt_count--

Semaphores Kernel semaphores System V IPC semaphores
used by kernel control paths. can be acquired only by functions that are allowed to sleep; interrupt handlers and deferrable functions cannot use them. System V IPC semaphores used by User Mode processes

Semaphores struct semaphore implementation requires lower-level synch!
count (atomic_t): >0 free; 0 inuse, no waiters; <0 inuse, waiters wait: wait queue sleepers: 0 (none), 1 (some), occasionally 2 implementation requires lower-level synch! atomic updates, spinlock, interrupt disabling optimized assembly code for normal case (down()) C code for slower “contended” case (_ _down())

Semaphores up: movl $sem,%ecx lock; incl (%ecx) jg 1f pushl %eax
pushl %edx pushl %ecx call _ _up popl %ecx popl %edx popl %eax 1: down: movl $sem,%ecx lock; decl (%ecx); jns 1f pushl %eax pushl %edx pushl %ecx call _ _down popl %ecx popl %edx popl %eax 1:

_ _down WaitingQ.ins WaitingQ.del

Read/Write Semaphores
New feature of Linux 2.4 Read/Write Semaphores FIFO complex implementation similar to regular semaphores operations: down_read(), down_write() up_read(), up_write()

Read/Write Semaphores
The first process is always awoken. If it is a writer, the other processes in the wait queue continue to sleep. If it is a reader, any other reader following the first process is also woken up and gets the lock. However, readers that have been queued after a writer continue to sleep. R R R W R W R R

Completions The current implementation of up( ) and down( ) also allows them to execute concurrently on the same semaphore. up( ) might attempt to access a data structure that no longer exists. up( )  complete( ). down( )  wait_for_completion( ).

Completions 1 2 create_sem down up del_sem del_sem

Local Interrupt Disabling
Local interrupt disabling does not protect against concurrent accesses to data structures by interrupt handlers running on other CPUs. In multiprocessor systems, local interrupt disabling is often coupled with spin locks. Spin locks only on SMP systems; keep them short! general, read/write, big reader

Global Interrupt Disabling
A typical scenario consists of a driver that needs to reset the hardware device. Global interrupt disabling significantly lowers the system concurrency level. An interrupt service routine should never execute the cli( ) macro.

_ _global_cli() wait for top and bottom halves to complete
disable local interrupts grab spinlock disable all interrupts

Disabling Deferrable Functions
disabling interrupts disables deferred functions possible to disable deferred functions but not all interrupts ops (macros): local_bh_disable() local_bh_enable()

Choosing Synch Primitives
avoid synch if possible! (clever instruction ordering) example: inserting in linked list (needs barrier still) Example: task migration use atomics or rw spinlocks if possible use semaphores if you need to sleep complicated structures accessed by deferred functions

Example Race Conditions
reference counters for sharing structs get/put functions deallocate when 0 memory map semaphore slab cache list semaphore inode semaphore

Kernel Synchronization

Similar presentations

Presentation on theme: "Kernel Synchronization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernel Synchronization

Similar presentations

Presentation on theme: "Kernel Synchronization"— Presentation transcript:

Similar presentations

About project

Feedback