Presentation is loading. Please wait.

Presentation is loading. Please wait.

System Calls.

Similar presentations


Presentation on theme: "System Calls."— Presentation transcript:

1 System Calls

2 Linux ABI System Calls What is a system call?
Everything distills into a system call /sys, /dev, /proc  read() & write() syscalls What is a system call? Special purpose function call Elevates privilege Executes function in kernel But what is a function call?

3 What is a function call? Special form of jmp
Execute a block of code at a given address Special instruction: call <fn-address> Why not just use jmp? What do function calls need? int foo(int arg1, char * arg2); Location: foo() Arguments: arg1, arg2, … Return code: int Must be implemented at hardware level

4 System Calls Function calls not that special
Just an abstraction built on top of hardware System calls are basically function calls With a few minor changes Privilege elevation Constrained entry points Functions can call to any address System calls must go through “gates”

5 Implementing system calls
System calls are implemented as a single function call: syscall() read() and write() actually just invoke syscall() What does syscall do? Enters into the kernel at a known location Elevates privilege Instantiates kernel level environment Once inside the kernel, an appropriate system call handler is invoked based on arguments to syscall()

6 x86 and Linux Number of different mechanisms for implementing syscall
Legacy: int 0x80 – Invokes a single interrupt handler 32 bit: SYSENTER – Special instruction that sets up preset kernel environment 64 bit: SYSCALL – 64 bit version of SYSENTER All jump to a preconfigured execution environment inside kernel space Either interrupt context or OS defined context What about arguments? syscall(int syscall_num, args…)

7 Specific system calls Each system call has a number assigned to it
Index into a system call table Function pointers referencing each syscall handler Syscall(int syscall_num, args…) Sets up kernel environment Invokes syscall_table[syscall_num](args…); Returns to user space: Resets environment to state before call

8 man –s 2 write WRITE(2) Linux Programmer's Manual WRITE(2) NAME
write - write to a file descriptor SYNOPSIS #include <unistd.h> ssize_t write(int fd, const void *buf, size_t count); DESCRIPTION write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.

9 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); ret = vfs_write(f.file, buf, count, &pos); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); } return ret;

10 ssize_t __vfs_write(struct file. file, const char __user
ssize_t __vfs_write(struct file *file, const char __user *p, size_t count, loff_t *pos) { if (file->f_op->write) return file->f_op->write(file, p, count, pos); else if (file->f_op->write_iter) return new_sync_write(file, p, count, pos); else return -EINVAL; } EXPORT_SYMBOL(__vfs_write);

11 static ssize_t console_write(struct file * filp, const char __user * buf, size_t size, loff_t * offset) { char * tmp_buf = NULL; if (copy_from_user(tmp_buf, buf, size)) { return -EFAULT; } return size; static struct file_operations cons_fops = { .read = console_read, .write = console_write, };

12 int 0x80 Old style system call invocation
Vectors into kernel through IDT Special Interrupt (128) only used for system calls IDT switches CPU to kernel mode Changes CS segment to kernel CS segment Hard coded as __KERNEL_CS Switches to kernel stack IRQ handler inspects register contents for syscall # and arguments System call index goes in %eax Syscall handler invoked from Syscall table Like how IRQ handlers are invoked

13 Sysenter More modern approach to syscall invocation
Allow OS to configure a syscall execution context Configured via writes to Hardware MSRs Achieves same effect as an IRQ handler, but faster Configured at boot time on each CPU SYSENTER_CS_MSR Stores Kernel Code Segment SYSENTER_EIP_MSR Address of code to handle system calls SYSENTER_ESP_MSR Kernel mode stack pointer Application issues sysenter instruction Instantiates system call context After system call, control returned to process with sysexit instruction

14 SYSENTER/SYSEXIT SYSENTER operation SYSEXIT operation

15 Syscall Long mode version of sysenter
Separate set of MSRs for 64 bit mode Assume flat memory model (no segments) Configured at boot time on each CPU SYSCALL_STAR_MSR Stores Code Segment information SYSCALL_LSTAR_MSR Stores 64 bit instruction pointer SYSCALL_FMASK_MSR Masks for setting rflag values Application issues syscall instruction Instantiates system call context After system call, control returned to process with sysret instruction

16 SYSCALL/SYSRET SYSCALL Operation SYSRET Operation

17 System call optimizations
System calls can be invoked in multiple ways Which one should a program use? Do you need to support all options at compile time? System calls add overhead Kernel <–> User mode switches are expensive Some system calls are pretty simple and don’t modify state E.g. getpid(), gettimeofday(), etc… What if we can handle a syscall without invoking the kernel?

18 VDSO Kernel provided dynamic library for making system calls
Mapped into address space of each process Links with standard C library Automatically uses optimal system call mechanism Also provides optimized user space system calls System calls executed without invoking kernel mode __vdso_clock_gettime; __vdso_getcpu; __vdso_gettimeofday; __vdso_time

19 Linux Kernel int 0x80 sysenter syscall VDSO /lib/libc.so.6 read() Stack Libc.so /bin/ls fread() Heap Code Data

20 Kernel Environment The kernel is a C program
Compiled instructions collected in a single binary Linked and loaded similar to a regular program By boot loader not OS Kernel executes in its own virtual address space This virtual address space is independent from process address spaces They do not intersect Allows kernel and processes to coexist in same virtual address space

21 Memory layout Traditional Unix (32bit) Program contents on the bottom
Kernel memory is on top Dynamic memory is in the middle Heap grows up Stack grows down kernel virtual memory Memory mapped region for shared libraries run-time heap (via malloc)‏ program text (.text)‏ initialized data (.data)‏ uninitialized data (.bss)‏ stack memory invisible to user code the “brk” ptr

22 Memory layout Plus much more… Modern Linux (64bit)
VDSO Modern Linux (64bit) Many more addresses Kernel is no longer top 1GB Sparsely mapped in at various addresses Memory mapped devices Balancing address use between stack and heap no longer an issue Heap allocated using mmap()’s brk can still be used VDSO region User executable kernel code User accessible kernel data current time Plus much more… kernel physical memory Memory mapped region for shared libraries run-time heap (via malloc)‏ program text (.text)‏ initialized data (.data)‏ uninitialized data (.bss)‏ stack kernel virtual memory Memory mapped devices

23 Memory management Address space of a process is virtual memory
What the process sees Virtual memory may or may not be backed by physical memory Actual byte addressable memory devices on motherboard (DRAM, NVM, etc) OS managed mapping of virtual memory to physical memory Memory grouped together as pages typically 4KB of physically contiguous memory OS allocates pages for each processes OS maps allocated pages into the virtual address space of each process OS tracks current mapping of all processes What memory is assigned to whom OS can change mapping at anytime Move memory around Move memory to disk (swapping)

24 Kernel layout

25 Physical Address Layout
Linux Kernel Boot loader copies kernel to 1MB boundary from Root partition BIOS loads boot loader from startup disk Boot loader

26 Virtual Address Layouts (32 bit)
3 GB (0xc ) 16 MB (0x )

27 Virtual Address Layout (64 bit) Process
cat /proc/self/maps c000 r-xp fd: /usr/bin/cat 0060b c000 r--p 0000b000 fd: /usr/bin/cat 0060c d000 rw-p 0000c000 fd: /usr/bin/cat 01a a47000 rw-p : [heap] 3dd dd r-xp fd: /usr/lib64/ld-2.18.so 3dd881f000-3dd r--p 0001f000 fd: /usr/lib64/ld-2.18.so 3dd dd rw-p fd: /usr/lib64/ld-2.18.so 3dd dd rw-p :00 0 3dd8e dd8fb4000 r-xp fd: /usr/lib64/libc-2.18.so 3dd8fb4000-3dd91b p 001b4000 fd: /usr/lib64/libc-2.18.so 3dd91b3000-3dd91b7000 r--p 001b3000 fd: /usr/lib64/libc-2.18.so 3dd91b7000-3dd91b9000 rw-p 001b7000 fd: /usr/lib64/libc-2.18.so 3dd91b9000-3dd91be000 rw-p :00 0 7f3b66ba0000-7f3b6d0c9000 r--p fd: /usr/lib/locale/locale-archive 7f3b6d0c9000-7f3b6d0cc000 rw-p :00 0 7f3b6d0e6000-7f3b6d0e7000 rw-p :00 0 7ffffed ffffed45000 rw-p : [stack] 7ffffedb3000-7ffffedb5000 r--p : [vvar] 7ffffedb5000-7ffffedb7000 r-xp : [vdso] ffffffffff ffffffffff r-xp : [vsyscall]

28 Virtual Address Layout (64 bit)
======================================================================================================================== Start addr | Offset | End addr | Size | VM area description | | | | | | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm __________________|____________|__________________|_________|___________________________________________________________ | TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical | | | | virtual memory addresses up to the -128 TB | | | | starting offset of kernel mappings. | Kernel-space virtual memory, shared between all processes: ____________________________________________________________|___________________________________________________________ ffff | TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor ffff | TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI ffff | TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) ffffc | TB | ffffc8ffffffffff | 0.5 TB | ... unused hole ffffc | TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe | TB | ffffe9ffffffffff | 1 TB | ... unused hole ffffea | TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) ffffeb | TB | ffffebffffffffff | 1 TB | ... unused hole ffffec | TB | fffffbffffffffff | 16 TB | KASAN shadow memory __________________|____________|__________________|_________|____________________________________________________________ | | Identical layout to the 56-bit one from here on: ____________________________________________________________|____________________________________________________________ fffffc | TB | fffffdffffffffff | 2 TB | ... unused hole | | | | vaddr_end for KASLR fffffe | TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping fffffe | TB | fffffeffffffffff | 0.5 TB | ... unused hole ffffff | TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff | GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef | GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff | GB | ffffffff7fffffff | 2 GB | ... unused hole ffffffff | GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 ffffffff | MB | | | ffffffffa | MB | fffffffffeffffff | 1520 MB | module mapping space ffffffffff | MB | | | FIXADDR_START | ~ MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset ffffffffff | MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI ffffffffffe00000 | MB | ffffffffffffffff | 2 MB | ... unused hole Both are contiguous ranges starting at physical address 0

29 Kernel System.map 0000000001000000 A phys_startup_64
Boot loader jumps here A phys_startup_64 ffffffff T _text ffffffff T startup_64 ffffffff T secondary_startup_64 ffffffff810001b0 T start_cpu0 ffffffff810b57f0 T vprintk ffffffff8118e650 T kfree ffffffff8118f780 T __kmalloc ffffffff8130a2a0 T memset ffffffff81309ff0 T memcpy Kernel initialization

30 Spectre/Meltdown Kernel used to share virtual address space with process Present in each process address space Only accessible if hardware was in kernel mode Protected by page table HW Allowed system calls to be made without switching page tables Performance optimization (just increase priviledge level) Spectre/Meltdown changed that Allowed hardware to speculatively access kernel memory Result of access could be read via cache side channel Location of access could be controlled by attacker Mitigations: Kernel and processes are no longer mapped into the same page tables Effect: Lots of stuff you read is no longer accurate

31 Linked Lists

32 structs and memory layout
fox fox fox list.next list.next list.next list.prev list.prev list.prev

33 Linked lists in Linux fox fox fox list { .next .prev } list { .next
Node; fox fox fox list { .next .prev } list { .next .prev } list { .next .prev }

34 What about types? Calculates a pointer to the containing struct
struct list_head fox_list; struct fox * fox_ptr = list_entry(fox_list->next, struct fox, node);

35 List access methods struct list_head some_list; list_add(struct list_head * new_entry, struct list_head * list); list_del(struct list_head * entry_to_remove); struct type * ptr; list_for_each_entry(ptr, &some_list, node){ … } struct type * ptr, * tmp_ptr; list_for_each_entry_safe(ptr, tmp_ptr, &some_list, node) { list_del(ptr); kfree(ptr);


Download ppt "System Calls."

Similar presentations


Ads by Google