Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?

Review of the i386 registers EAX EBX ECX EDX ESI EDI EBP ESP General Registers (32-bits) CS DS ES FS GS SS Segment Registers (16-bits) EIP EFLAGS Program Control and Status Registers (32 bits)

The x86 ‘system’ registers CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 Control Registers (32-bits) means ‘unimplemented’ DR0 DR1 DR2 DR3 DR4 Debug Registers (32-bits) DR5 DR6 DR7 LDTR TR GDTR IDTR (16-bits) (48-bits)

How often is ‘cmpxchg’ used? $ cat vmlinux.asm | grep cmpxchg c01046de: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c01055d9: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c010b895: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c010b949: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c0129a9f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c0129acf: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c012d377: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c012d41a: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c012d968: f0 0f b1 16 lock cmpxchg %edx,(%esi) c012e568: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e57a: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e58a: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e83f: f0 0f b1 13 lock cmpxchg %edx,(%ebx) c012e931: f0 0f b1 0a lock cmpxchg %ecx,(%edx) c012ea94: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c012ecf4: f0 0f b1 13 lock cmpxchg %edx,(%ebx) c012f08e: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx) c012f163: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c013cb60: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c0148b3c: f0 0f b1 29 lock cmpxchg %ebp,(%ecx) c0150d0f: f0 0f b1 3b lock cmpxchg %edi,(%ebx) c0150d87: f0 0f b1 31 lock cmpxchg %esi,(%ecx) c0199c5e: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c024b06f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c024b2fe: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx) c024b321: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx) c024b34b: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx) c024b960: f0 0f b1 53 18 lock cmpxchg %edx,0x18(%ebx) Here’s the occurrence that we studied in the ‘rtc_cmos_read()’ kernel-function… …plus 28 other times!

Intel’s documentation You can find out what any of the Intel x86 instructions does by consulting the official software developer’s manual, online at: http://www.intel.com/products/processor/manuals/index.htm http://www.intel.com/products/processor/manuals/index.htm Our course-webpage has a link to this site that you can just click (under ‘Resources’) The instruction-set reference is two parts: –Volume 2A: for opcodes A through M –Volume 2B: for opcodes N through Z

Example: ‘cmpxchg’ Operation of the ‘cmpxchg’ instruction is described (on 3 pages) in Volume 2A There’s an English-sentence description, and also a description in ‘pseudo-code’ You probably do not want to print out this complete volume (.pdf) – over 700 pages! (You could order a printed copy from Intel)

Instruction format Intel’s assembly language syntax differs from the GNU/Linux syntax (known as ‘AT&T syntax’ with roots in UNIX history) When AT&T syntax is used, the ‘cmpxchg’ instruction has this layout: [lock] cmpxchg reg, reg/mem optional ‘prefix’ (used for SMP) mnemonic opcode source operand destination operand

‘effects’ and ‘affects’ According to Intel’s manual, the ‘cmpxchg’ instruction also uses two ‘implicit’ operands (i.e., operands not mentioned in the instruction) –The CPU’s accumulator register –The CPU’s EFLAGS register The accumulator-register (EAX) is both a source-operand and a destination-operand The six status-bits in the EFLAGS register will get modified, as a ‘side-effect’ this instruction

‘cmpxchg’ description This instruction compares the accumulator with the destination-operand (so the ZF-bit in EFLAGS gets assigned accordingly) Then: –If (accumulator == destination) { ZF  1; destination  source; } –If (accumulator != destination) { ZF  0; accumulator  destination; }

An instruction-instance In our recent disassembly of Linux’s kernel function ‘rtc_cmos_read()’, this ‘cmpxchg’ instruction-instance was used: lockcmpxchg%edx, cmos_lock prefix opcode source-operand destination-operand Note: Keep in mind that the accumulator %eax will affect what happens! So we need to consider this instruction within it’s surrounding context

The complete function c0105574 : c0105574: 53 push %ebx c0105575: 9c pushf c0105576: 5b pop %ebx c0105577: fa cli c0105578: 64 8b 15 08 20 30 c0 mov %fs:0xc0302008,%edx c010557f: 0f b6 c8 movzbl %al,%ecx c0105582: 42 inc %edx c0105583: c1 e2 08 shl $0x8,%edx c0105586: 09 ca or %ecx,%edx c0105588: a1 3c 99 30 c0 mov 0xc030993c,%eax c010558d: 85 c0 test %eax,%eax c010558f: 75 f7 jne c0105588 c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c0105598: c0 c0105599: 85 c0 test %eax,%eax c010559b: 75 eb jne c0105588 c010559d: 88 c8 mov %cl,%al c010559f: e6 70 out %al,$0x70 c01055a1: e6 80 out %al,$0x80 c01055a3: e4 71 in $0x71,%al c01055a5: e6 80 out %al,$0x80 c01055a7: c7 05 3c 99 30 c0 00 movl $0x0,0xc030993c c01055ae: 00 00 00 c01055b1: 53 push %ebx c01055b2: 9d popf c01055b3: 0f b6 c0 movzbl %al,%eax c01055b6: 5b pop %ebx c01055b7: c3 ret

The ‘preparation’ steps The instructions that preceed ‘cmpxchg’ will setup register EDX (source operand) and register EAX (the x86 ‘accumulator’) Several instructions are used to set up a value in EDX, and result in this layout: The current processor’s value for ‘per_cpu__cpu_number’ plus 1 EDX: CMOS register’s index 31 8 7 0 this might be zero… …but this part is guaranteed to be non-zero!

The ‘cmos_lock’ variable This global variable is initialized to zero, meaning that access to CMOS memory locations is not currently ‘locked’ If some CPU stores a non-zero value in this variable’s memory-location, it means that access to CMOS memory is ‘locked’ The kernel needs to insure that only one CPU at a time can set this ‘lock’

The ‘most likely’ senario One of the CPUs wishes to access CMOS memory – so it needs to test ‘cmos_lock’ to be sure that access is now ‘unlocked’ (i.e., cmos_lock == 0 is true) The CPU copies the ‘cmos_lock’ variable into the EAX, where it can then be tested using the ‘test %eax, %eax’ instruction A conditional-jump follows the test

The ‘busy-wait’ loop # Here is a ‘busy-wait’ loop, used to wait for the CMOS access to be ‘unlocked’ spin:movcmos_lock, %eax# copy lock-variable to accumulator test%eax, %eax# was CMOS access ‘unlocked’? jnzspin# if it wasn’t, then check it again # A CPU will fall through to here if ‘unlocked’ access was detected, # and that CPU will now attempt to set the ‘lock’ – in other words, it # will try to assign a non-zero value to the ‘cmos_lock’ variable. # But there’s a potential ‘race’ here – the ‘cmos_lock’ might have been # zero when it was copied, but it could have been changed by now… # … and that’s why we need to execute ‘lock cmpxchg’ at this point

Busy-waiting will be brief spin: # see if the lock-variable is clear mov cmos_lock, %eax test %eax, %eax jnzspin # ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock # did another CPU grab it first? test%eax, %eax jnzspin If our CPU wins the ‘race’, the (non-zero) value from source-operand EDX will have been stored into the (previously zero) ‘cmos_lock’ memory-location, but the (previously zero) accumulator EAX will not have been modified; hence our CPU will not jump back, but will fall through and execute the ‘critical section’ of code (just a few instructions), then will promptly clear the ‘cmos_lock’ variable.

The ‘less likely’ case spin: # see if the lock-variable is clear mov cmos_lock, %eax test %eax, %eax jnzspin # ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock # did another CPU grab it first? test%eax, %eax jnzspin If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to some non-zero value after we had fetched our copy of it, then the (now non-zero) value from the ‘cmos_lock’ destination-operand will have been copied into EAX, and so the final conditional-jump shown above will take our CPU back into the spin-loop, where it will resume busy-waiting until the ‘winner’ of the race clears ‘cmos_lock’.

yes Setup nonzero value in EDX EAX  cmos_lock EAX is zero? no EAX is zero? EAX equals cmos_lock ? ZF  1 cmos_lock  EDX ZF  0 EAX  cmos_lock yes no critical section yes cmos_lock  0 start finish flowchart

‘btr’/’bts’ versus ‘cmpxchg’ In an earlier lesson we used the ‘btr’/’bts’ instructions to achieve ‘mutual exclusion’, whereas Linux uses ‘cmpxchg’ to do that We think ‘btr’/’bts’ is easier to understand, so why do you think the Linux developers would prefer to use ‘cmpxchg’ instead?

In-class exercise #1 Was it really necessary to insert a second ‘test %eax, %eax’ following ‘cmpxchg’? Can you design a simple LKM that would verify your answer to that question?

The Intel documentation does not state precisely how other EFLAGS status-bits (besides ZF) are affected by ‘cmpxchg’, only that they reflect the comparison of ‘accumulator’ and ‘destination’ operands Usually the CPU implements comparison- of-operands by performing a subtraction EFLAGS CFCF 1 PFPF 0 AFAF 0 ZFZF SFSF TFTF IFIF DFDF OFOF IOPL NTNT RFRF VMVM ACAC 31 11 10 9 8 7 6 5 4 3 2 1 0

In-class exercise #2 Can you decide what Intel means by “the comparison operation”, by writing suitable code that examines the effect on EFLAGS of ‘cmpxchg opnd1, opnd2’ and these two plausable alternatives: cmp opnd1, opnd2 cmp opnd2, opnd1

Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?

Similar presentations

Presentation on theme: "Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?

Similar presentations

Presentation on theme: "Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?"— Presentation transcript:

Similar presentations

About project

Feedback