Cache & SpinLocks Udi & Haim

Cache & SpinLocks Udi & Haim

Agenda Caching background Why do we need caching?
Caching in modern desktop. Cache writing. Cache coherence. Cache & Spinlocks

Agenda Concurrent Systems Synchronization Types
Spinlock Semaphore Mutex Seqlocks RCU Spinlock in linux kernel Caching and locking

Why caching? Slower CPU Faster CPU
Accessing the main memory is expensive. And is becoming the pc performance bottleneck. Slower CPU Faster CPU

Caching in modern desktop
What is caching? “A computer memory with very short access time used for storage of frequently used instructions or data” – webster.com Modern desktop have at least three caches: TLB translation lookaside buffer I-Cache instruction cache D-Cache data cache

Caching in modern desktop
Locality Temporal locality Spatial locality Cache coloring Replacement policies LRU MRU Direct Map cache Cache performance = The proportion of accesses that result in a cache hit

Cache writing There are two basic writing approaches: Write-through
Write is done synchronously both to the cache and to the backing store. Write-back (or Write-behind) Initially, writing is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.

Cache writing Two approaches for situations of write-misses:
No-write allocate (aka Write around) The missed-write location is not loaded to cache, and is written directly to the backing store. In this approach, only system reads are being cached. Write allocate (aka Fetch on write) The missed-write location is loaded to cache, followed by a write-hit operation. In this approach, write misses are similar to read-misses.

Cache coherence Coherence defines the behavior of reads and writes to the same memory location.

Cache coherence The coherence of caches is obtained if the following conditions are met: In a read made by a processor P to a location X that follows a write by the same processor P to X, with no writes of X by another processor occurring between the write and the read instructions made by P, X must always return the value written by P. This condition is related with the program order preservation, and this must be achieved even in monoprocessed architectures. A read made by a processor P1 to location X that follows a write by another processor P2 to X must return the written value made by P2 if no other writes to X made by any processor occur between the two accesses. This condition defines the concept of coherent view of memory. If processors can read the same old value after the write made by P2, we can say that the memory is incoherent. Writes to the same location must be sequenced. In other words, if location X received two different values A and B, in this order, from any two processors, the processors can never read location X as B and then read it as A. The location X must be seen with values A and B in that order

Cache coherence Cache coherence mechanisms Directory-based
Snooping (BUS-based) And many more ….

Cache coherence Directory-based
In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.

Cache coherence Snooping (BUS-based)
Snooping is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. It is called a write invalidate protocol when a write operation is observed to a location that a cache has a copy of. There are two implementation for the invalidate protocol: Write-update When a local cache block is updated, the new data block is broadcast to all caches containing a copy of the block for updating them Write-invalidate Invalidate all remote copies of cache when a local cache block is updated.

Cache coherence Coherence protocol example: Write-invalidate Snooping Protocol For Write-through Writes invalidate all other caches

Cache coherence Write-invalidate Snooping Protocol For Write-back
When a block is first loaded in the cache it is marked "valid". On a read miss to the local cache, the read request is broadcast on the bus. If one has cached that address and it is in the state "dirty", it changes the state to "valid" and sends the copy to requesting node. The "valid" state means that the cache line is current. When writing a block in state "valid" its state is changed to "dirty" and a broadcast is sent out to all cache controllers to “invalidate” their copies.

Cache coherence - MESI MESI Modified Exclusive Shared Invalid

Cache coherence - MESI Every cache line is marked with one of the four following states : Modified - The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state. Exclusive - The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it. Shared - Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time. Invalid - Indicates that this cache line is invalid (unused). To summarize, the MESI is an extension of MSI algo. The MESI add’s division between modifying cache point the exist only in my cache AND modifying cache point the exist also in other caches

Cache coherence - MESI For any given pair of caches, the permitted states of a given cache line are as follows: The Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache line that is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enables modifying a cache line with no bus transaction.

Cache coherence

Cache What is done by the OS and what is done by the hardware?
In Intel X86 series, caching is implement in hardware, all you need and can do it to change the configuration with registers interface called Control registers. The control registers are sets in to 7 groups : CR0, CR1, CR2, CR3, CR4, And another 2 groups called: EFER, CR8 (added to support X64 series) Our main interest in the presentation revolved around caching, but bear in mind that this interface contain every parameter you can set on Intel architecture. CR0 – CD (bit 30) Globally enables/disable the memory cache CR0 – NW (bit 29) Globally enables/disable write-back caching (or write-throw) flushing of TLB entries can be done in Linux using API called vpid_sync_context The implementation is done by using: vpid_sync_vcpu_single or vpid_sync_vcpu_global for single or all Cpus

Caching & Spinlock

Caching and spin lock spin_lock: mov eax, 1 xchg eax, [locked] test eax, eax jnz spin_lock ret spin_unlock: mov eax, 0

Caching and spin lock The other CPU action…
spin_lock: mov eax, 1 xchg eax, [locked] test eax, eax jnz spin_lock ret spin_unlock: mov eax, 0 The other CPU action…

Caching and spin lock spin_lock: mov eax, 1 xchg eax, [locked] test eax, eax jnz spin_lock ret spin_unlock: mov eax, 0

Caching and spin lock spin_lock: mov eax, [locked] test eax, eax jnz spin_lock mov eax, 1 xchg eax, [locked] test eax, eax ret spin_unlock: mov eax, 0

Caching and ticket lock
void spin_lock(spinlock_t *lock){ t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void spin_unlock(spinlock_t *lock){ lock->current_ticket++; struct spinlock_t { int current_ticket; int next_ticket;

void spin_lock(spinlock_t *lock){ t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void spin_unlock(spinlock_t *lock){ lock->current_ticket++; struct spinlock_t { int current_ticket; int next_ticket; SPIN

void spin_lock(spinlock_t *lock){ t = atomic_inc(lock->next_ticket); while (t != lock->current_ticket) } void spin_unlock(spinlock_t *lock){ lock->current_ticket++; struct spinlock_t { int current_ticket; int next_ticket;

Interrupt

Interrupt An interrupt is simply a signal that the hardware can send when it wants the processor’s attention. driver need only register a handler for its device’s interrupts, and handle them properly when they arrive

Interrupt Cont’ Register API: Un register API:
int request_irq(unsigned int irq, irqreturn_t (*handler)(int, void *, struct pt_regs *), unsigned long flags, const char *dev_name, void *dev_id); Un register API: void free_irq(unsigned int irq, void *dev_id);

Interrupt Cont’ unsigned int irq
The interrupt number being requested irqreturn_t (*handler)(int, void *, struct pt_regs *) The pointer to the handling function being installed

Interrupt Cont’ unsigned long flags const char *dev_name
a bit mask of options related to interrupt SA_INTERRUPT - When set, this indicates a “fast” interrupt handler. Fast handlers are executed with interrupts disabled on the current processor SA_SHIRQ - This bit signals that the interrupt can be shared between devices const char *dev_name The string passed to request_irq is used in /proc/interrupts

Interrupt Cont’ void *dev_id
Pointer used for shared interrupt lines. It is a unique identifier that is used when the interrupt line is freed and that may also be used by the driver to point to its own private data area

Top & Bottom Half how to perform lengthy tasks within a handler?
splitting the interrupt handler into two halves. The so-called top half is the routine that actually responds to the interrupt The bottom half is a routine that is scheduled by the top half to be executed later, at a safer time. all interrupts are enabled during execution of the bottom half

Top & Bottom Half Cont’ Two different mechanisms that may be used to implement bottom-half processing Takslet - fast and must be atomic (SW interrupt) Workqueue - higher latency but are allowed to sleep

Tasklet Fast & atomic (can not sleep)
Guaranteed to run on the same CPU as the function that first schedules them Interrupt handler can be secure that a tasklet does not begin executing before the handler has completed

Tasklet Cont’ Another interrupt can certainly be delivered while the tasklet is running, so locking between the tasklet and the interrupt handler may still be required They may be scheduled to run multiple times, but tasklet scheduling is not cumulative, the tasklet runs only once, even if it is requested repeatedly before it is launched

Tasklet Cont’ No tasklet ever runs in parallel with itself, since they run only once, but tasklets can run in parallel with other tasklets on SMP systems, so locking between tasklets are required

Tasklet Example void short_do_tasklet(unsigned long); DECLARE_TASKLET(short_tasklet, short_do_tasklet, 0); irqreturn_t short_tl_interrupt(int irq, void *dev_id, struct pt_regs *regs) { /* Handle fast path IRQ */ tasklet_schedule(&short_tasklet); /* Schedule tasklet*/ return IRQ_HANDLED; }

Workqueue Higher latency but are allowed to sleep
Invoke a function at some future time in the context of a special worker process. Workqueue function runs in process context, it can sleep if need be. You cannot, however, copy data into user space from a workqueue process Workqueue process does not have access to any other process’s address space

Workqueue Example static struct work_struct short_wq; /* this line is in short_init() */ INIT_WORK(&short_wq, (void (*)(void *)) short_do_tasklet, NULL); irqreturn_t short_wq_interrupt(int irq, void *dev_id, struct pt_regs *regs) { /* Handle fast path IRQ */ schedule_work(&short_wq);/* Schedule workqueue*/ return IRQ_HANDLED; }

Concurrent Systems Concurrency - what happens when the system tries to do more than one thing at once! In computer science, concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other. The computations may be executing on: 1. Multiple cores in the same chip. 2. Preemptively time-shared threads on the same processor. 3. Executed on physically separated processors. (wiki)

Concurrent Systems Management
The management of concurrency is one of the core problems in operating systems programming and the resulting outcome can be indeterminate

Concurrent Systems Management
Concurrency faults can lead to: Race condition - uncontrolled access to shared data. Starvation - where a process is perpetually denied necessary resources. Without those resources, the program can never finish its task. Deadlock - is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever doe.

Race Condition Leads to memory leak !!! Example for race condition:
Lock; if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) goto __cleanup; } UnLock; Leads to memory leak !!!

Deadlock Leads to deadlock!!! Example for deadlock: Lock;
if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) return -1; } UnLock; Leads to deadlock!!! goto __cleanup;

Solution Avoid shared resources whenever possible
In many situations it is possible to design data structures that do not require locking, e.g. by using per-thread or per-CPU data and disabling interrupts Problem - such sharing is often required, hardware resources are, by their nature, shared, and software resources also must often be available to more than one thread

Solution Cont mutual exclusion - making sure that only one thread of execution can manipulate a shared resource at any time Not all critical sections are the same, so the kernel provides different primitives for different needs Process context – can sleep Interrupt context – cannot sleep

Spinlock A spinlock is a mutual exclusion device that can have only two values: “locked” and “unlocked If the kernel control path finds the spin lock “unlocked,” it acquires the lock and continues its execution

Spinlock Cont’ if the kernel control path finds the lock “locked”, it “spins” around, repeatedly executing a tight instruction loop, until the lock is released spinlocks may be used in code that cannot sleep, such as interrupt handler

Spinlock Scenario #1 Driver acquires a spinlocks
Driver loses the processor due to: the driver call a function which put the process to sleep (e.g copy_from_user) kernel preemption kicks in - higher priority process push the driver code aside

Spinlock Scenario #1 Result:
Diver holds a spinlocks which will not be free in the near future In the best case, if another thread tries to acquire the locks it will spin for long time In the worst case deadlock can occur

Spinlock Scenario #1 Conclusion
⇒ Code holding a spinlock must be atomic and can not go to sleep (and sometimes not even handle interrupt) ⇒ Preemption is disabled on processor which hold a spinlock

Spinlock Scenario #2 Driver acquires a spinlocks
Async - device issue an interrupt The interrupt handler of the device, trying to acquire the spinlock

Spinlock Scenario #2 Result:
what happens if the interrupt routine executes in the same processor as the code that took out the lock originally? Deadlock !!!

Spinlock Scenario #2 Conclusion
⇒ In this case the acquire of spinlock must disable interrupt ⇒ Spinlocks critical section must be as little as possible ( The longer you hold a lock, the longer another processor may have to spin waiting for you to release it)

Spinlock Scenario #2 Conclusion Cont’
⇒ Long lockhold times also keep the current processor from scheduling, meaning that a higher priority process—which really should be able to get the CPU—may have to wait

Spinlock API Initialize lock API’s:
spinlock_t lock = SPIN_LOCK_UNLOCKED spin_lock_init(spinlock_t *lock)

Spinlock API Lock API’s and, possibly, disabling interrupts:
void spin_lock(spinlock_t *lock) void spin_lock_irqsave(spinlock_t *lock, unsigned long flags) Interrupts can execute in nested fashion the previous interrupt state is stored in flags (safe)

Spinlock API void spin_lock_irq(spinlock_t *lock)
If you are absolutely sure nothing else might have already disabled interrupts on your processor If you are sure that you should enable interrupts when you release your spinlock void spin_lock_bh(spinlock_t *lock) disables software interrupts before taking the lock

Spinlock API Cont’ Try lock API’s:
int spin_trylock(spinlock_t *lock) int spin_trylock_bh(spinlock_t *lock)  Non spinning versions of the above functions

Spinlock API Cont’ Un lock API’s: void spin_unlock(spinlock_t *lock)
void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) void spin_unlock_irq(spinlock_t *lock) void spin_unlock_bh(spinlock_t *lock)

Semaphore single integer value combined with a pair of functions:
one which acquires a semaphore one which releases a semaphore if the value of the semaphore is greater than zero: value is decremented by one process continues

Semaphore Cont’ otherwise
the process goes to sleep till another process will release the semaphore (increment by one the semaphore value) if necessary, wakes up processes that are waiting

Semaphore Struct Semaphore struct can be found at:
include/linux/semaphore.h struct semaphore { raw_spinlock_t lock; unsigned int count; struct list_head wait_list; };

Semaphore Struct Cont’
The ->lock (spinlock) controls access to the other members of the semaphore The ->count variable represents how many more tasks can acquire this semaphore. If it's zero, there may be tasks waiting on the wait_list The ->wait_list is a list of tasks waiting for the semaphore (FIFO)

Semaphore Lock Procedure
acquire spinlock if(count >0) count-- else insert calling task to the tail of wait_list set wakeup flag to 0 repeat: release spinlock put task to sleep if (wakeup flag ==1) exit repeat section

Semaphore Unlock Procedure
acquire spinlock if(wait_list is empty) count ++ else node = get wait_list head remove node from wait_list set wakeup flag to 1 wakeup process release spinlock

Semaphore API’s Create a semaphore with initalize counter value of val
void sema_init(struct semaphore *sem, int val) #define DEFINE_SEMAPHORE(name) Lock a semaphore void down(struct semaphore *sem)

Semaphore API’s Interruptible lock (process can be interrupted by a signal) int down_interruptible(struct semaphore *sem) Try lock (never sleep) int down_trylock(struct semaphore *sem) Un lock semaphore void up(struct semaphore *sem)

RW Semaphore Semaphores perform mutual exclusion for all callers
Many tasks break down into two distinct types of work: Readers Writers

RW Semaphore Allow multiple concurrent readers
Optimize performance An RW semaphore allows either one writer or an unlimited number of readers to hold the semaphore

RW Semaphore cont’ Since multiple readers may hold the lock at once
writer may continue waiting for the lock while new reader threads are able to acquire the lock – write starvation

RW Semaphore API RW Semaphore struct: struct rw_semaphore {
long count; raw_spinlock_t wait_lock; struct list_head wait_list; }

RW Semaphore API Initialize a RW semaphore
init_rwsem(struct rw_semaphore *sem) Obtaining and releasing read access to a reader/writer semaphore void down_read(struct rw_semaphore *sem) int down_read_trylock(struct rw_semaphore *sem) void up_read(struct rw_semaphore *sem)

RW Semaphore API Cont’ Obtaining and releasing write access to a reader/writer semaphore void down_write(struct rw_semaphore *sem) int down_write_trylock(struct rw_semaphore *sem) void up_write(struct rw_semaphore *sem) void downgrade_write(struct rw_semaphore *sem)

Mutex Similar to semaphore Mutex struct struct mutex {
/* 1: unlocked, 0: locked, negative: locked, possible waiters */ atomic_t count; spinlock_t wait_lock; struct list_head wait_list; struct task_struct *owner; };

Mutex Vs Semaphore Only one task can hold the mutex at a time (binary semaphore) Only the owner of the mutex can unlock the mutex Recursive locks Improvement: try to spin for acquisition when we find that there are no pending waiters and the lock owner is currently running on a (different) CPU (it is likely to release the lock soon)

Seqlocks In read write locks: Seqlocks
readers must wait until the writer has finished writer must wait until all readers have finished Seqlocks give a much higher priority to writers writer is allowed to proceed even when readers are active

Seqlocks Cont’ The Seqlock struct can be found at /include/linux/seqlock.h typedef struct { struct seqcoun_t seqcount; spinlock_t lock; } seqlock_t;

Seqlocks Read Access Read access works by:
obtaining an (unsigned) integer sequence value on entry into the critical section do some reading operations Compare the current sequence # with the one obtained if there is a mismatch, the read access must be retried

Seqlocks Write Access The write lock is implemented with a spinlock, so all the usual constraints apply give a much higher priority to writers writer is allowed to proceed even when readers are active Increment the sequence #

Seqlocks Read Example A typical code example will look like:
unsigned int seq; do { seq = read_seqbegin(&the_lock); /* Do what you need to do */ } while (read_seqretry(&the_lock, seq));

Seqlocks Summary Pros Cons
writer never waits (unless another writer is active) free access for readers Cons reader may sometimes be forced to read the same data several times until it gets a valid copy generally cannot be used to protect data structures involving pointers, because the reader may be following a pointer that is invalid while the writer is changing the data structure

Seqlocks API Initialize seqlocks: Obtaining read access:
seqlock_t lock = SEQLOCK_UNLOCKED; seqlock_init(seqlock_t *lock); Obtaining read access: unsigned int read_seqbegin(seqlock_t *lock); unsigned int read_seqbegin_irqsave(seqlock_t *lock, unsigned long flags);

Seqlocks API Cont’ int read_seqretry(seqlock_t *lock, unsigned int seq); int read_seqretry_irqrestore(seqlock_t *lock, unsigned int seq, unsigned long flags);

Seqlocks API Cont’ Obtaining write access:
void write_seqlock(seqlock_t *lock); void write_seqlock_irqsave(seqlock_t *lock, unsigned long flags); void write_seqlock_irq(seqlock_t *lock); void write_seqlock_bh(seqlock_t *lock); int write_tryseqlock(seqlock_t *lock);

Seqlocks API Cont’ Releasing write access:
void write_sequnlock(seqlock_t *lock); void write_sequnlock_irqrestore(seqlock_t *lock, unsigned long flags); void write_sequnlock_irq(seqlock_t *lock); void write_sequnlock_bh(seqlock_t *lock);

RCU – Read Copy Update An improvement for seqlocks
RCU allows many readers and many writers to proceed concurrently (an improvement over seqlocks, which allow only one writer to proceed) optimized for situations where reads are common and writes are rare

RCU Constrains Constraints:
resources being protected should be accessed via pointers all references to those resources must be held only by atomic code (process can not sleep inside a critical region protected by RCU)

RCU How ? On the reader side, code using an RCU-protected data structure should disable\enable preemption struct my_stuff *stuff; rcu_read_lock( );//disable preemption stuff = find_the_stuff(args...); /* Do what you need to do */ rcu_read_unlock( );//enable preemption

RCU How Cont’? On the writer side: Allocates a new structure
Copies data from the old one, Replaces the pointer that is seen by the read code ⇒ At this point from reader perspective, the change is complete. any code entering the critical section sees the new version of the data

RCU Cleanup The only problem is when no free the old pointer (reader might have reference for the pointer) Since all code holding references to this data structure must (by the rules) be atomic, we know that once every processor on the system has been scheduled at least once, all references must be gone. RCU sets aside a callback that waits until all processors have scheduled; that callback is then run to perform the cleanup work

Simple Spinlock Using test and set atomic function
test-and-set instruction is an instruction used to write to a memory location and return its old value as a single atomic operation int test_and_set(int *lock)

Simple Spinlock #define LOCKED 1 int test_and_set(int* lockPtr) { int oldValue; oldValue = SwapAtomic(lockPtr, LOCKED); return (oldValue == LOCKED); }

Test and set mutex Implemention
Lock: int lock(int* lockPtr) { while (TestAndSet(lock)==LOCKED) //wait a bit } UnLock: int un_lock(int* lockPtr) {* lockPtr=0;}

Problems Grants requests in unpredictable order
Accelerate inter-CPU bus traffic (cache)

Ticket Spinlocks A ticket lock works as follows:
Two integer values which initialize to 0 Queue ticket Dequeue ticket

Ticket Spinlocks Cont’
Acquire lock procedure: Obtain & increments queue ticket Compares its ticket's value (before the increment) with the dequeue ticket's value If they are the same, the thread is permitted to enter the critical section else, then another thread must already be in the critical section and this thread must busy-wait or yield

Ticket Spinlocks Cont’
Release lock procedure: Increments the dequeue ticket This permits the next waiting thread to enter the critical section

Ticket Spinlock Summary Problems
Grants requests in FIFO order Problems Accelerate inter-CPU bus traffic (cache)

Linux Scalability What is scalability?
Application does N times as much work on N cores as it could on 1 core. Scalability may be limited by Amdahl's Law: Locks, shared data structures, ... Shared hardware (DRAM, NIC, ...)

Linux Scalability

Linux Scalability Cont’

Test-and-Set Lock Repeatedly test-and-set a Boolean flag indicating whether the lock is held Problem: contention for the flag (read-modify-write instructions are expensive) Causes lots of network traffic, especially on cache-coherent architectures (because of cache invalidations) Variation: test-and-test-and-set – less traffic

Ticket Lock 2 counters (nr_requests, and nr_releases)
Lock acquire: fetch-and-increment on the nr_requests counter, waits until its “ticket” is equal to the value of the nr_releases counter Lock release: increment of the nr_releases counter

Ticket Lock Cont’ Advantage over T&S: polls with read operations only
BUT - Still generates lots of traffic and contention All threads spin on the same shared location causing cache-coherence traffic on every successful lock access

The Problem Busy-waiting techniques is heavily used in synchronization on shared memory Busy-waiting synchronization constructs tend to: Have significant impact on network traffic due to cache invalidations Contention leads to poor scalability

The Problem Cont’ Have significant impact on network traffic due to cache invalidations: Even in the case of two CPUs are repeatedly acquiring a spinlock, the memory location representing that lock will bounce back and forth between those CPUs' caches. Even if neither CPU ever has to wait for the lock, the process of moving it between caches will slow things down considerably

The Problem Cont’ Contention leads to poor scalability:
The simple act of spinning for a lock clearly is not going to be good for performance Cache contention would appear to be less of an issue (CPU spinning on a lock will cache its contents in a shared mode) ⇒No cache bouncing should occur until the CPU owning the lock releases it (Releasing the lock and its acquisition by another CPU requires writing to the lock, and that requires exclusive cache access)

Case 2: lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.

Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. Case 1: lock is uncontended, that access is not a problem, the CPU owning the lock probably owns the cache line as well. Case 2: lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.

The Source of the Problem
Spinning on remote variables

The Proposed Solution Insert delay (backoff)
Minimize access to remote variables - spin on local variables instead

Spin Lock With Backoff Rather than spinning tightly and querying a contended lock's status, a waiting CPU should wait a bit more patiently, only querying the lock occasionally Cause a waiting CPU to loop a number of times doing nothing at all before it gets impatient and checks the lock again

Spin Lock With Backoff Pros
While a CPU is looping without querying the lock it cannot be bouncing cache lines around, so the lock holder should be able to make faster progress Calculate proportional backoff using the value of the ticket minus the number of ticket which is currently served multiply the static backoff loop Cons’ too much looping will cause the lock to sit idle before the owner of the next ticket notices that its turn has come; that, too, will hurt performance All threads spin on the same shared location causing cache-coherence traffic on every successful lock access

Spin Lock With Backoff Cons
too much looping will cause the lock to sit idle before the owner of the next ticket notices that its turn has come; that, too, will hurt performance All threads spin on the same shared location causing cache-coherence traffic on every successful lock access

Array Lock #define NUM_OF_PROC 100 #define HAS_LOCK 1
#define MUST_WAIT 0 struct arrLock{ int slot[NUM_OF_PROC]; int next_slot; }

Array Lock Cont #define INIT_ARR_LOCK(name) \ struct arrLock name;\
name.slot = [0=HAS_LOCK, 1 … NUM_OF_PROC-1 = MUST_WAIT];\ name.next_slot = 0;

Array Lock Acquire int arr_lock_lock (struct arr_lock *arr_lock_p, int *my_slot) { *my_slot = fetch_and_increment (arr_lock_p->next_slot); // returns old value *my_slot %= NUM_OF_PROC ; // get the slot inside the array while (arr_lock_p->slots[*my_slot] = MUST_WAIT) {}; // spin arr_lock_p->slots[*my_slot] = MUST_WAIT; // init for next time return 0; }

Array Lock Release int arr_lock_unlock (struct arr_lock *arr_lock_p, int my_slot) { arr_lock_p->slots[(my_slot + 1) % NUM_OF_PROC] = HAS_LOCK; return 0; } Each CPU clears the lock for its successor (sets it from must-wait to has-lock) Lock-acquire while (slots[my_place] == MUST_WAIT); Lock-release slots[(my_place + 1) % NUM_OF_PROC] = HAS_LOCK;

Array Lock Cons adjacent data items share a single cache line. A write to one item invalidates that item’s cache line

Array Lock Cons How to solve it:
Pad array elements so that distinct elements are mapped to distinct cache lines

Array Lock Cons The ALock is not space-efficien
We dont know NUM_OF_PROC value?

Array Lock Pros Spin on local variables, no cache jumps

Ticket Lock Improvements
3 counters (nr_requests, and 2 nr_releases) each counter is in diffrent cache line (padding with zeroes) Counter init values: nr_requests = 1 array of nr_releases: nr_releases[0] = 1 nr_releases[1] = 0

Ticket Lock Improvements Algo’
The algorithm: Lock: fetch-and-increment (nr_requests) //get ticket while(ticket !=nr_releases[(ticket+1) %2]) //wait for my turn Lock release: nr_releases[(ticket % 2)]+=2 //increment by 2

Ticket Lock Improvements Summary
Advenatege: Divide by half chache miss (linear to the array size of nr_release) Can be generalized for n releases counters Disadventage: The lock is not space-efficient - each counter is in distincit cache line

MCS Lock – List Based Queue Lock
Goals: Reduce bus trafﬁc on cc machines (by spinning on local varibles) Space efficient Requires atomic instructions available on some CPUs: ATOMIC_COMPARE_AND_SWAP: CAS (mem, old, new) If *mem == old, then set *mem = new and return true

MCS Lock – List Based Queue Lock
typedef struct qnode { struct qnode *next; bool locked; } mcs_lock_qnode; A lock is just a pointer to a qnode typedef mcs_lock_qnode *mcs_lock;

MCS Lock Acquire acquire (mcs_lock *L, mcs_lock_qnode *I) { I->next = NULL; qnode *predecessor = I; ATOMIC_SWAP (predecessor, *L); if (predecessor != NULL) { I->locked = true; predecessor->next = I; while (I->locked) ; }

MCS Lock Acquire If unlocked, L is NULL
If locked, no waiters, L is owner’s qnode If waiters, *L is tail of waiter list

MCS Lock Release release (mcs_lock *L, mcs_lock_qnode *I) { if (!I->next) if (ATOMIC_COMPARE_AND_SWAP (*L, I, NULL)) return; } while (!I->next) ; I->next->locked = false;

MCS Lock Release If I->next NULL and *L == I
No one else is waiting for lock, OK to set *L = NULL If I->next NULL and *L != I Another thread is in the middle of aquire Just wait for I->next to be non-NULL If I->next is non-NULL I->next oldest waiter, wake up w. I->next->loked = false

Exclusive Cache Line the technique given below is used to force alignment of data structures on cache boundaries: Dynamic: #define ALIGN 64 void *aligned_malloc(int size) { void *mem = kmalloc(size+ALIGN+sizeof(void*), GFP_KERNEL); void **ptr = (void**)((long)(mem+ALIGN+sizeof(void*)) & ~(ALIGN-1)); ptr[-1] = mem; return ptr; }

Exclusive Cache Line void aligned_free(void *ptr) { free(((void**)ptr)[-1]); } static: int __attribute__((aligned(64))) lock;

Memory pool Lookaside Caches
Allocating many objects of the same size, over and over in the kernel. API kmem_cache_t *kmem_cache_create(const char *name, size_t size, size_t offset, unsigned long flags, void (*constructor)(void *, kmem_cache_t *, unsigned long flags), void (*destructor)(void *, kmem_cache_t *, unsigned long flags)); void *kmem_cache_alloc(kmem_cache_t *cache, int flags); void kmem_cache_free(kmem_cache_t *cache, const void *obj); int kmem_cache_destroy(kmem_cache_t *cache); flags = SLAB_HWCACHE_ALIGN This flag requires each data object to be aligned to a cache line;

Memory pool Memory Pools API mempool_t *mempool_create(int min_nr,
There are places in the kernel where memory allocations cannot be allowed tofail. A memory pool is really just a form of a lookaside cache that tries to always keep a list of free memory around for use in emergencies. API mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data); typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data); typedef void (mempool_free_t)(void *element, void *pool_data); void *mempool_alloc(mempool_t *pool, int gfp_mask); void mempool_free(void *element, mempool_t *pool); int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask); void mempool_destroy(mempool_t *pool);

Memory pool Code example #define ALIGN 64
const char* cachName = “slots” union slot{ char spaceKeeper[ALIGN] ; int val; }; cache = kmem_cache_create(cachName,sizeof(slot),0, SLAB_HWCACHE_ALIGN,...); pool = mempool_create(MY_POOL_MINIMUM, mempool_alloc_slab, mempool_free_slab, cache); union slot * obj = (union slot *) mempool_alloc(pool,..);

Testing

Testing Application The test application is divided into 2 parts:
User mode – a performance test application Kernel mode – a char device

Testing Application - kernel
Create a char device driver Via ioctl control then following: creation of spinlock: ticket lock Array lock MCS lock acquire spinlock (the which was created) Release spinlock (the which was created)

Testing Application - User
The test application will generate each run a different type of spinlock The test application will run several fork\threads (one for each CPU core) Each thread will run on separate (unique) core (sched_setaffinity)

Testing Application – User Cont’
Pseudo Code: fopen spinlock device Create a spinlock type for i = 0; i < MAX_OF_CORES; i++ sched_setaffinity(i)

Each thread will do the following in a loop of x iterators: acquire a lock suspend himself (sched_yield) and let other threads to run release the lock Measure the time that all threads finished

Pseudo Code: fopen spinlock device create a spinlock type start_tick= Get Tick for i = 0; i < MAX_OF_CORES; i++ Run thread (i) Make sure all threads finished working Time = Get Tick – start_tick

Inside thread sched_setaffinity(i) Loop: lock acquire suspend (sched_yield) lock release

Thank you

Cache & SpinLocks Udi & Haim

Similar presentations

Presentation on theme: "Cache & SpinLocks Udi & Haim"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cache & SpinLocks Udi & Haim

Similar presentations

Presentation on theme: "Cache & SpinLocks Udi & Haim"— Presentation transcript:

Similar presentations

About project

Feedback