CONTENTS INTRODUCTION to SYSTEM CALLS FUNCTION vs SYSTEM CALL TYPES and EXAMPLES ERRORS and SYSVECTORS HOW DO ALL THESE WORK TOGETHER ADDING A SYSCALL.

CONTENTS INTRODUCTION to SYSTEM CALLS FUNCTION vs SYSTEM CALL TYPES and EXAMPLES ERRORS and SYSVECTORS HOW DO ALL THESE WORK TOGETHER ADDING A SYSCALL STRACE(with examples) WHAT ABOUT WINDOWS PROBLEMS REFERENCES

INTRODUCTION to SYSTEM CALLS

One of the most renown features of Unix is the clear distinction between ``kernel space'' and ``user space''. The Linux kernel implementation allows to break this clean distinction by allowing kernel code to invoke some of the system calls. System calls are set of “extended instructions” provided by the operating system which provide an interface between user program and operating system. We have: ~ Library procedure corresponding to each system call, ~ Machine registers to hold parameters of system call ~ Trap instruction (protected procedure call) to start OS. ~ The result is returned to lib function after Return from trap instruction Ex : count = read ( file, buffer, nbytes ); Actual system call read invoked by read Number of bytes actually read is returned in count In case of error, count is set to -1

Superficially, syscalls look like ordinary C functions. Its definition is in the C Language regardless of the implementation technique used on any given system to invoke a system call. This defers from many older OS which traditionally defined the kernel entry points in the assembler language of the machine. However, they differ from ordinary functions in two important aspects: Argument passing – A syscall may take only arguments of a native word size. Thus, when you pass a single char or short to a syscall, the compiler promotes the argument to a 32-bit word on a 32-bit system. Return value – A syscall returns a signed word. Typically, the return value indicates the status of the operation but it can also be a pointer. A syscall cannot return structs in aggregates. Likewise, you cannot pass structs by value to a syscall.

In most cases, the bare syscalls are wrapped in ordinary C functions. The procedure stores the parameters of the sys call in a specified place,such as the machine registers, and then issues a trap instruction (a protected procedure call) to start OS. Under Intel CPUs, this is done by means of software interrupt 0x80. Thus the kernel can control all the harware access and prevent the user mode processes from doing anything destructive.. when the OS gets the control after the TRAP,it examines the parameters to see if they are valid and if so performs the work requested.When it is finished the OS puts a status code in a register,telling whether it succeeded or failed,and executes a RETURN FORM TRAP instruction to return control back to the library procedure. The library procedure then returns to the caller in the usual way, returning the status code as function value

It's recommended that you use syscalls wrapper functions instead of calling a bare syscall directly whenever possible. Most syscalls are declared in the header. The system call numbers for Linux are listed in asm/unistd.h.If new system calls are added, they are usually appended to the end of the list so as to maintain backwards compatibility with existing code. But how would we know which system to call when: Traditionally, system calls are identified by number rather than by name, The procedure at that location checks the system call number, which tells the kernel what service the process requested. Then, it looks at the table of system calls (sys_call_table) to see the address of the kernel function to call. Then it calls the function, and after it returns, does a few system checks and then return back to the process If you want to read this code, it's at the source file arch/kernel/entry.S, after the line ENTRY(system_call). arch/kernel/entry.S

What types of syscalls do we have: ~The main syscalls that are used in process management are creation and termination of the processes. EX : If we request to compile a program as a command in the shell, then the shell must create a new process that will run the compiler and after the work is done it t terminates it self. ~ Some sys calls are also available to request more memory ~ Some sys calls for conveying some information to the running process if any thing goes wrong Ex: in network data transfer.

FUNCTIONS vs SYSTEM CALL

The Kernel Runs in privileged mode Provides abstraction of machine Responsible for managing resources. Kernel Trap This Instruction transfers control to kernel Kernel takes control When a user issues a system call, usually a library does some simple book keeping and organization before the actual trap to the OS occurs.

EX: open(2) (Linux) open(filename, flags, mode) – open(“/etc/passwd”, O_RDONLY, 0) int (mode) %edx int (flags) %ecx filename (char*) %ebx 5 (open) %eax Here's an example system call, open. Each system call has a number associated with it. In Linux, this number is 5 for open. The C library takes the arguments and puts them in certain registers, then puts 5 in %eax. It then traps to the kernel. Handling a Call Check arguments Try to perform operation Return corresponding value or errno If an error occurs, the specifics of the error are returned to the user via errno.

Here's more specific Ex:At the bottom is the system call interface. The C standard library provides a few abstractions over this, such as fopen() or execvp(). Of course, it also provides functions that don't have to do with system calls, such as strcpy() and strcmp(). On top of the C library is the C++ library. Some of it uses the C library, while some of it directly uses system calls. There are some system calls that aren't abstracted by the libraries, such as kill(), which sends a signal to a process, such as SIGSTOP.

From the implementor’s point of view the distinctiuon between a system call and a library function is fundamental.But from the user’s point of view, the distinction is not as critical. Consider the memory alloction function malloc as an example. We have many ways to do memory allocation. The unix system call that handles the memory allocation,sbrk(2),is not a general purpose memory manager, It increases or decreases the address space of the process by a specified no of bytes. How the space is managed is up to the process. Lets have a look now

Application code C library functions User process System calls kernel Application code malloc function User process Sbrk System call kernel

Library Functions The problem with using low-level system calls directly for input and output is that they can be very inefficient. Why? Well: There’s a performance penalty in making a system call. System calls are therefore expensive compared to function calls because Linux has to switch from running your program code to executing its own kernel code and back again. It’s a good idea to keep the number of system calls used in a program to a minimum and get each call to do as much work as possible, Example:By reading and writing large amounts of data rather than a single character at a time. Example: tape drives often have a block size, say 10k, to which they can write. But, if you attempt to write an amount that is not an exact multiple of 10k, the drive will still advance the tape to the next 10k block, leaving gaps on the tape. To provide a higher-level interface to devices and disk files, all Linux distributions, like UNIX, provides a number of standard libraries. A good example is the standard I/O library that provides buffered output. You can effectively write data blocks of varying sizes, This dramatically reduces the system call overhead.

To see how the linkage works on the user-level side, let’s take a look at the C library call that serves as a wrapper to sys open. asmlinkage long sys open (const char* name, int flags, int mode); and the user-level library function is declared in /usr/include/sys/fcntl.h as follows: int open (const char* name, int flags, int mode);

TYPES and EXAMPLES

process mangement System calls fork is a good process to deal with: The fork call returns a value 0 in the child and the child’s pid in the parent using the return value we can see which is the parent and which is child. As we know child executes a different code and the parent waits till the execution of the child and then the parent starts reading the next instruction after the termination of the child. So parent executes the WAITPD sys call memory management system call We know malloc function used to allocate the memory required according to the parameters but in practice,most unix sys have system call BRK that specifies the size that the data segment is to be set

file and directory system calls in unix MANY syscalls relate to files and file systems Files: CREAT--CREAT opens file and opens it for writing regardless of files mode. Returns a file descriptor fd used to write the file. READ,WRITE: the most used system calls obviously. LSEEK: parameters : 1.file descriptor for the file 2. file position 3. tells whether the file position is relative to the beginning of the file,current position or the end of the file values returned is the absolute position in the field after the file pointer was changed. MKDIR RMDIR-- LINK : linking to file creates a new directory entry that points to the existing file. For breaking the link we use UNLINK system call. Thus we cam remove the directory if its empty. CHMOD -change the mode of the file that is the protection bits.

I/O sys calls Prior to posix most unix systems had a call IOCTL that performed a large no of device specific actions on special files.over the course it got divided into few separate syscalls for input and output as some modems operate in split speeds Link: System call parameters On i386, the parameters of a system call are transported via registers. The system call number goes into %eax, the first parameter in %ebx, the second in %ecx, the third in %edx, the fourth in %esi, the fifth in %edi, the sixth in %ebp. Ancient history Earlier versions of Linux could handle only four or five system call parameters, and therefore the system calls select() (5 parameters) and mmap() (6 parameters) used to have a single parameter that was a pointer to a parameter block in memory. Since Linux 1.3.0 five parameters are supported (and the earlier select with memory block was renamed old_select), and since Linux 2.3.31 six parameters are supported (and the earlier mmap with memory block was succeeded by the new mmap2).

Slow system call : We use this terminology for the sys calls that can block for ever. i.e the sys call never returns. (So, what is "slow"? Mostly those calls that can block forever waiting for external events; read and write to terminal devices, but not read and write to disk devices, wait, pause.) ex :networking function calls The basic rule that applies here is that when a process is blocked in a slow system call and the process catches a signal and the signal handler returns,the sys call can return an error of EINTR. SOME kernels automatically restart some interrupted system calls. When a system call is slow and a signal arrives while it was blocked, waiting for something, the call is aborted and returns -EINTR, so that the library function will return -1 and set errno to EINTR. Just before the system call returns, the user program's signal handler is called. This means that a system call can return an error while nothing was wrong. Usually one will want to redo the system call. That can be automated by installing the signal handler using a call to sigaction with the SA_RESTART flag set. DEMO program

Changes: The world changes and system calls change. Since one must not break old binaries, the semantics associated to any given system call number must remain fully backwards compatible. As System calls are identified by their numbers. The number of the call foo is __NR_foo. For example, the number of _llseek used above is __NR__llseek, defined as 140 in /usr/include/asm-i386/unistd.h. Different architectures have different numbers./usr/include/asm-i386/unistd.h Often, the kernel routine that handles the call foo is called sys_foo. One finds the association between numbers and names in the sys_call_table, for example in arch/i386/kernel/entry.S.arch/i386/kernel/entry.S. 2.2.20 2.4.20 2.62.2.202.4.202.6 For example, long ago user IDs had 16 bits, today they have 32. __NR_getuid is 24, and __NR_getuid32 is 199, and the former belongs to the 16-bit version of the call, the latter to the 32-bit version. Looking at the associated kernel routines, we find that these are sys_getuid16 and sys_getuid, respectively. DEMO apart of program

The call What happens? The assembler for a call with 0 parameters (on i386) is #define _syscall0(type,name) \ type name(void) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ : "0" (__NR_##name)); \ __syscall_return(type,__res); \ } extract from include/asm-i386/unistd.h include/asm-i386/unistd.h

ERRORS and SYS VECTORS (exceptions and interrupts in 386)

The 386 recognizes two event classes: exceptions and interrupts. Both cause a forced context switch to new a procedure or task. Two sources of interrupts are recognized by the 386: Maskable interrupts and Nonmaskable interrupts. Two sources of exceptions are recognized by the 386: Processor detected exceptions and programmed exceptions. Each interrupt or exception has a number, which is referred to by the 386 literature as the vector. table The priority of simultaneous interrupts and exceptions is: table

How Linux Uses Interrupts and Exceptions Under Linux the execution of a system call is invoked by a maskable interrupt or exception class transfer, caused by the instruction int 0x80. (include/asm-i386/hw_irq.h) We use vector 0x80 to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors like the system clock vector.include/asm-i386/hw_irq.h When a user invokes a system call, execution flow is as follows: Each call is vectored through a stub in libc. Each call within the libc library is generally a syscallX() macro, where X is the number of parameters used by the actual routine. Each syscall macro expands to an assembly routine which sets up the calling stack frame and calls _system_call() through an interrupt, via the instruction int $0x80

The macro definition for the syscallX() macros can be found in /usr/include/linux/unistd.h, /usr/include/linux/unistd.h, No system code for the call has been executed until the int $0x80 is executed and does the call transfer to the kernel entry point _system_call(). This entry point is the same for all system calls. It is responsible for saving all registers, checking to make sure a valid system call was invoked and then ultimately transfering control to the actual system call code via the offsets in the _sys_call_table. It is also responsible for calling _ret_from_sys_call() when the system call has been completed, but before returning to user space. Actual code for system_call entry point can be found in arch/parisc/kernel/syscall.S. Actual code for many of the system calls can be found in /usr/src/linux/kernel/sys.c, arch/parisc/kernel/syscall.S/usr/src/linux/kernel/sys.c Upon return from the system call, the syscallX() macro code checks for a negative return value, and if there is one, puts a positive copy of the return value in the global variable _errno, so that it can be accessed by code like perror().

For example, the setuid system call is coded as _syscall1(int,setuid,uid_t,uid); which will expand to: _setuid: subl $4,%exp pushl %ebx movzwl 12(%esp),%eax movl %eax,4(%esp) movl $23,%eax movl 4(%esp),%ebx int $0x80 movl %eax,%edx testl %edx,%edx jge L2 negl %edx movl %edx,_errno movl $-1,%eax popl %ebx addl $4,%esp ret L2: movl %edx,%eax popl %ebx addl $4,%esp ret

How Linux Initializes the system call vectors The startup_32() code found in /arch/i386/kernel/head.S starts everything off by calling setup_idt(). This routine sets up an IDT (Interrupt Descriptor Table) with 256 entries./arch/i386/kernel/head.S An IDT has 256 entries, each 4 bytes long, for a total of 1024 bytes. When start_kernel() (found in /usr/src/linux/init/main.c) is called it invokes trap_init() (found in /usr/src/linux/kernel/traps.c). trap_init() initializes the interrupt descriptor table as shown here:found in /usr/src/linux/init/main.cfound in /usr/src/linux/kernel/traps.c Table The return value from a system call is placed in EAX, and can have an arbitrary type (of the appropriate size). Errors are indicated by reserving a small range of possible return values and returning an error values from a second enumerated list (see asm /errno.h). The error values are listed as positive numbers, but by convention are negated before being returned by a system call, and are typically re-negated by library code before being handed back to a program.asm /errno.h

Error return Above we said: typically, the kernel returns a negative value to indicate an error. But this would mean that any system call only can return positive values. Since the negative error returns are of the form -ESOMETHING, and the error numbers have small positive values, there is only a small negative error range. Thus #define __syscall_return(type, res) \ do { \ if ((unsigned long)(res) >= (unsigned long)(-125)) { \ errno = -(res); \ res = -1; \ } \ return (type) (res); \ } while (0) Here the range [-125,-1] is reserved for errors (the constant 125 is version and architecture dependent) and other values are OK.

What if a system call wants to return a small negative number and it is not an error? The scheduling priority of a process is set by setpriority() and read by getpriority(), and this value ranges from -20 (top priority) to 19 (lowest priority background job). The library routines with these names use these numbers, but the system call getpriority() returns 20 - P instead of P, moving the output interval to positive numbers only.. However, the system call returns this value in the data argument, and glibc does something like res = sys_ptrace(request, pid, addr, &data); if (res >= 0) { errno = 0; res = data; } return res; so that a user program has to do errno = 0; res = ptrace(PTRACE_PEEKDATA, pid, addr, NULL); if (res == -1 && errno != 0) /* error */

HOW DO ALL THESE WORK TOGETHER

When a userspace application makes a system call, the arguments are passed via registers and the application executes 'int 0x80' instruction. This causes a trap into kernel mode and processor jumps to system_call entry point in entry.S. What this does is: Save registers. Set %ds and %es to KERNEL_DS, so that all data (and extra segment) references are made in kernel address space. If the value of %eax is greater than NR_syscalls (currently 256), fail with ENOSYS error. If the task is being ptraced (tsk->ptrace & PF_TRACESYS), do special processing. This is to support programs like strace (analogue of SVR4 truss(1)) or debuggers. Call sys_call_table+4*(syscall_number from %eax). This table is initialised in the same file (arch/i386/kernel/entry.S) to point to individual system call handlers which under Linux are prefixed with sys_, e.g. sys_open, sys_exit, etc. These C system call handlers will find their arguments on the stack where SAVE_ALL stored them.

Enter 'system call return path'. This is a separate label because it is used not only by int 0x80 but also by lcall7, lcall27. This is concerned with handling tasklets (including bottom halves), checking if a schedule() is needed (tsk- >need_resched != 0), checking if there are signals pending and if so handling them. Linux supports up to 6 arguments for system calls. They are passed in %ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see _syscall6() in asm-i386/unistd.h). The system call number is passed via %eax.asm-i386/unistd.h

/* include/asm-i386/hw_irq.h */ #define SYSCALL_VECTOR 0x80 /* arch/i386/kernel/traps.c */ set_system_gate(SYSCALL_VECTOR,&system_call); /* arch/i386/kernel/entry.S */ #define GET_CURRENT(reg) \ movl $-8192, reg; \ andl %esp, reg #define SAVE_ALL \ cld; \ pushl %es; \ pushl %ds; \ pushl %eax; \ pushl %ebp; \ pushl %edi; \ pushl %esi; \ pushl %edx; \ pushl %ecx; \ pushl %ebx; \ movl $(__KERNEL_DS),%edx; \ movl %edx,%ds; \ movl %edx,%es;

#define RESTORE_ALL \ popl %ebx; \ popl %ecx; \ popl %edx; \ popl %esi; \ popl %edi; \ popl %ebp; \ popl %eax; \ 1: popl %ds; \ 2: popl %es; \ addl $4,%esp; \ 3: iret; ENTRY(system_call) pushl %eax # save orig_eax SAVE_ALL GET_CURRENT(%ebx) testb $0x02,tsk_ptrace(%ebx) # PT_TRACESYS jne tracesys cmpl $(NR_syscalls),%eax jae badsys call *SYMBOL_NAME(sys_call_table)(,%eax,4) movl %eax,EAX(%esp) # save the return value

ENTRY(ret_from_sys_call) cli # need_resched and signals atomic test cmpl $0,need_resched(%ebx) jne reschedule cmpl $0,sigpending(%ebx) jne signal_return RESTORE_ALL Upon return we check a few things and when all is well restore the registers and call IRET to return from this INT. (This was for the i386 architecture. All details differ on other architectures, but the basic idea is the same: store the syscall number and the syscall parameters somewhere the kernel can find them, in registers, on the stack, or in a known place of memory, do something that causes a transfer to kernel code, etc.)

main()__llbc_read() arch/i386/kernel/entry.S system_call() fs/read_write.c sys_read() filesystem or network or device code User SpaceKernel Space … push arguments _libc_read() load args to regs EAX=__NR_read int 0x80 SAVE_ALL check limit of EAX syscall_tab[EAX]() file=fget(fd) check file ops check file locks (file->f_op->read() check destination retrieve data copy data return fput(file) return handle signals possibly schedule RESTORE_ALL iret check error return pop arguments … …

ADDING A SYSTEM CALL

How to Add Your Own System Calls Create a directory under the /usr/src/linux/ directory to hold your code. Put any include files in /usr/include/sys/ and /usr/include/linux/. Add a #define __NR_xx to unistd.h to assign a call number for your system call, where xx, the index, is something descriptive relating to your system call. It will be used to set up the vector through sys_call_table to invoke you code. Add an entry point for your system call to the sys_call_table in entry.S. It should match the index (xx) that you assigned in the previous step. The NR_syscalls variable will be recalculated automatically. entry.S. Run make from the top level to produce the new kernel incorporating your new code. At this point, you will have to either add a syscall to your libraries, or use the proper _syscalln() macro in your user program for your programs to access the new system call.

Call Implementation So now we are ready to code our first system call. I like to put my system call source code files in /usr/src/linux/kernel because they a part of the kernel. For symplicity I decided to call my first system call ever vijju. This is what the file /usr/src/linux/kernel/vijju.c looks like: #include asmlinkage int sys_vijju(void) { return(111); } We can see that there isn't much to it. This system call just returns 314 and doesn't really do anything more. It enter's kernel space and executes with permision to do what ever it wants to the system. A couple things to note: First we need to use the asmlinkage int. Second even though the system call is named vijju we must name it sys_vijju. In a way this makes sense because it is a system call. All of the system calls start with the prefix sys_. Aslo we include a header file named in sync with this system call as linux/vijju.h. The linux/ is needed because vijju.h exists in /usr/src/linux/include/linux not just /usr/src/linux/include.

Adding a Library Function C is a middle level language and we don't acctually have control over the register contents directly. Recall that system calls are made by putting values into registers. We can use assembly to do this for us. has a macro that we can use to create a wrapper function for us. Having a look at /usr/src/linux/include/linux/vijju.h we see: #ifndef __LINUX_VIJJU_H #define __LINUX_VIJJU_H #include _syscall0(int, vijju) #endif The #ifndef, #define and, #endif lines are just there to say "If when compiling we have not seen this file then read it, otherwise skip it". The line we are really interested in is _syscall0(int, vijju).

The _syscall part means that this line is to be translated to a system call. The 0 means that this system call takes zero arguments. The first field we encounter is int. This is the return type. Next we see vijju. This is the system call name. Arguments to this macro come in pairs. Each pair consists of a return type and a name. Getting a System Call Number Remember that each system call needs to be referenced by a number passed throught the EAX register. Here is how we assign a number to our system call. Open up /usr/src/linux/include/asm-i386/unistd.h. We find a list of #define's that assign numbers to system calls. The first one looks like this:/usr/src/linux/include/asm-i386/unistd.h #define __NR_exit 1 At the bottom of the list add a line like: #define __NR_vijju 253

System Call Table Entry Have a look in /usr/src/linux/arch/i386/kernel/entry.S. Way down at the end of the file is a long table that starts with the line ENTRY(sys_call_table). The table then consists of a whole bunch of entries like.long SYMBOL_NAME(sys_exit). This table holds a list containing each system call. In fact each line says use 4 bytes to hold a pointer to the label specified by SYMBOL_NAME. You may notice that this table of pointers could be seen as a array and that the system call number could work as the index into the array. Also notice the counter in comments off to the right every 5 calls./usr/src/linux/arch/i386/kernel/entry.S Go way down to the bottom of the table (about 190 on kernel 2.2.6). We need to add a reference to our own call so just copy the last line that has the.long SYMBOL_NAME(...) format. Next change the sys_... part of the copy so that it has the name of our new system call (sys_vijju). The new line will look like this:.long SYMBOL_NAME(sys_vijju)/* added by vijju */

Now before we leave this file look down just a couple lines and notice these lines: /* * NOTE!! This doesn't have to be exact - we just have * to make sure we have _enough_ of the "sys_ni_syscall" * entries. Don't panic if you notice that this hasn't * been shrunk every time we add a new system call. */.rept NR_syscalls-190.long SYMBOL_NAME(sys_ni_syscall).endr What is happening is that the end of the system call table is being padded with references to a safe system call. Just imagine what could happen if this didn't happen and we passed a system call number that amounted to an index into uninitialized memory. Who knows what value that entry would point to. So we just change the number of used system calls to reflect our new entry. In this case the line:.rept NR_syscalls-190 becomes:.rept NR_syscalls-191

Updating the Makefile We created a new C file that will need to be compiled and linked into the kernel. The file was /usr/src/linux/kernel/vijju.c so we need to edit the appropriate makefile (/usr/src/linux/kernel/Makefile).(/usr/src/linux/kernel/Makefile). Open the makefile and find the lines that start with O_OBJS =. O_OBJS = sched.o dma.o fork.o exec_domain.o panic.o printk.o sys.o module.o exit.o itimer.o info.o time.o softirq.o resource.o sysctl.o acct.o capability.o This is a list of the files that need to be linked into the kernel when we compile it. We can just add the following line right afterwards which says to also include our new file. Don't worry that vijju.o doesn't exist yet. It will be created when we compile the kernel. O_OBJS += vijju.o Compiling the kernel This part I'm assuming you have already done before.

STRACE (know about functions)

Using strace It will be useful to present a command with which you can learn about and debug system calls. The strace command traces the execution of another program, listing any system calls the program makes and any signals it receives. To watch the system calls and signals in a program, simply invoke strace, followed by the program and its command-line arguments. For example, to watch the system calls that are invoked by the hostname 1 command, use this command: % strace hostname This produces a couple screens of output. Each line corresponds to a single system call. For each call, the system call's name is listed, followed by its arguments (or abbreviated arguments, if they are very long) and its return value. Where possible, strace conveniently displays symbolic names instead of numerical values for arguments and return values, and it displays the fields of structures passed by a pointer into the system call. **NOTE that strace does not show ordinary function calls. In the output from strace hostname, the first line shows the execve system call that invokes the hostname program:

WHAT ABOUT WINDOWS

the Motorola 68000 has two processor modes "built into" the CPU, i.e. it has a flag in a status register that tells the CPU if it is currently executing in user-mode or supervisor-mode. Intel x86 CPUs do not have such a flag. Instead, it is the privilege level of the code segment that is currently executing that determines the privilege level of the executing program. Each code segment in an application that runs in protected mode on an x86 CPU is described by an 8 byte data structure called a Segment Descriptor. A segment descriptor contains the start address of the code segment that is described by the descriptor, the length of the code segment and the privilege level that the code in the code segment will execute at. Code that executes in a code segment with a privilege level of 3 is said to run in user mode and code that executes in a code segment with a privilege level of 0 is said to execute in kernel mode. In other words, kernel-mode (privilege level 0) and user-mode (privilege level 3) are attributes of the code and not of the CPU. Intel calls privilege level 0 "Ring 0" and privilege level 3 "Ring 3". There are two more privilege levels in the x86 CPU that are not used by Windows NT (ring 1 and 2).

Where do the Segment Descriptors reside? Since each code segment that exists in the system is described by a segment descriptor and since there are many code segments in a system the segment descriptors must be stored somewhere so that the CPU can read them in order to accept or deny access to a program that wishes to execute code in a segment. Intel did not choose to store all this information on the CPU chip itself but instead in the main memory. There are two tables in main memory that store segment descriptors; the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). There are also two registers in the CPU that holds the addresses to and sizes of these descriptor tables so that the CPU can find the segment descriptors. These registers are the Global Descriptor Table Register (GDTR) and the Local Descriptor Table Register (LDTR). It is the operating system's responsibility to set up these descriptor tables and to load the GDTR and LDTR registers with the addresses of the GDT and LDT respectively. This has to be done very early in the boot process, even before the CPU is switched into protected mode, because without the descriptor tables no memory segments can be accessed in protected mode..

Figure below illustrates the relationship between the GDTR, LDTR, GDT and the LDT

Since there are two segment descriptor tables it is not enough to use an index to uniquely select a segment descriptor. A bit that identifies in which of the two tables the segment descriptor resides is necessary. The index combined with the table indicator bit is called a segment selector. The segment selector format is displayed below.

Interrupt gates In order to control transitions between code executing at different privilege levels, Windows NT uses a feature of the x86 CPU called an interrupt gate. In real-mode, the x86 CPU's interrupt vector table simply contains pointers (4 byte values) to the Interrupt Service Routines that will handle the interrupts. In protected-mode, however, the interrupt vector table contains Interrupt Gate Descriptors which are 8 byte data structures that describe how the interrupt should be handled. The reason for having an Interrupt Gate Descriptor instead of a simple pointer in the interrupt vector table is the requirement that code executing in user-mode cannot directly call into kernel-mode. By checking the privilege level in the Interrupt Gate Descriptor the CPU can verify that the calling application is allowed to call the protected code at well defined locations,this is the reason for the name "Interrupt Gate"

System calls in Windows NT are initiated by executing an "int 2e" instruction. The 'int' instructor causes the CPU to execute a software interrupt, i.e. it will go into the Interrupt Descriptor Table at index 2e and read the Interrupt Gate Descriptor at that location. The Interrupt Gate Descriptor contains the Segment Selector of the Code Segment that contains the Interrupt Service Routine (the ISR). It also contains the offset to the ISR within the target code segment. The CPU will use the Segment Selector in the Interrupt Gate Descriptor to index into the GDT or LDT (depending on the TI-bit in the segment selector). Once the CPU knows the information in the target segment descriptor it loads the information from the segment descriptor into the CPU. It also loads the EIP register from the Offset in the Interrupt Gate Descriptor. At this point the CPU is almost set up to start executing the ISR code in the kernel-mode code segment.

the relationship between the Interrupt Descriptor Table Entry associated with the 'int 2e' instruction, the Global Descriptor Table Entry and the Interrupt Service Routine in the target code segment. Entry 2e Interrupt Descriptor Table (IDT) INT 2e ISR Offset within code segment Segment Selector 8 byte Interrupt Gate Descriptor ISR Code Segment Entry x Global Descriptor Table (GDT) Code Segment Base Address Privilege Level 8 byte Code Segment Descriptor

The CPU switches automatically to the kernel-mode stack Each privilege level in the x86 Protected Mode environment therefore has its own stack. When making function calls to a higher-privileged level through an interrupt gate descriptor like described above, the CPU automatically saves the user-mode program's SS, ESP, EFLAGS, CS and EIP registers on the kernel-mode stack. In the case of our Windows NT system service dispatcher function (KiSystemService) it needs access to the parameters that the user- mode code pushed onto its stack before it called 'int 2e'. By convention, the user-mode code must set up the EBX register to contain a pointer to the user-mode stack's parameters before executing the 'int 2e' instruction. The KiSystemService can then simply copy over as many arguments as the called system function needs from the user-mode stack to the kernel-mode stack before calling the system function.

What system call are we calling? (same as linux!!) Since all Windows NT system calls use the same 'int 2e' software interrupt to switch into kernel-mode, how does the user-mode code tell the kernel-mode code what system function to execute? The answer is that an index is placed in the EAX register before the int 2e instruction is executed. The kernel-mode ISR looks in the EAX register and calls the specified kernel-mode function if all parameters passed from user-mode appears to be correct. The call parameters (for instance passed to our OpenFile function) are passed to the kernel-mode function by the ISR. Returning from the system call Once the system call has completed the CPU automatically restores the running program's original registers by executing an IRET instruction. This pops all the saved register values from the kernel- mode stack and causes the CPU to continue the execution at the point in the user-mode code next after the 'int 2e' call.

PROBLEMS (CLICK HERE)CLICK HERE

MORE TO LOOK AT IF YOU ARE INTERESTED http://www.di.uevora.pt/~lmr/syscalls.html http://burks.brighton.ac.uk/burks/language/ml/ocaml man/manual06.htmhttp://burks.brighton.ac.uk/burks/language/ml/ocaml man/manual06.htm http://www.cs.clemson.edu/~mark/syscall.html http://www.linux.com/guides/khg/HyperNews/get/k hg/135.shtmlhttp://www.linux.com/guides/khg/HyperNews/get/k hg/135.shtml http://www.quepublishing.com/articles/article.asp?p =23618http://www.quepublishing.com/articles/article.asp?p =23618

REFERENCES http://www.linux.it/kerneldocs/ksys/ksys.html http://www.codeguru.com/Cpp/W- P/system/devicedriverdevelopment/article.php/c80 35/http://www.codeguru.com/Cpp/W- P/system/devicedriverdevelopment/article.php/c80 35/ http://en.tldp.org/LDP/khg/HyperNews/get/syscall /syscall86.htmlhttp://en.tldp.org/LDP/khg/HyperNews/get/syscall /syscall86.html http://www.superfrink.net/docs/sys_call_howto.ht mlhttp://www.superfrink.net/docs/sys_call_howto.ht ml http:// world.std.com/~slanning/asm/syscall_list.ht mlhttp:// world.std.com/~slanning/asm/syscall_list.ht ml UNIX NETWORK PROGRAMMING- W.R.STEVES

CONTENTS INTRODUCTION to SYSTEM CALLS FUNCTION vs SYSTEM CALL TYPES and EXAMPLES ERRORS and SYSVECTORS HOW DO ALL THESE WORK TOGETHER ADDING A SYSCALL.

Similar presentations

Presentation on theme: "CONTENTS INTRODUCTION to SYSTEM CALLS FUNCTION vs SYSTEM CALL TYPES and EXAMPLES ERRORS and SYSVECTORS HOW DO ALL THESE WORK TOGETHER ADDING A SYSCALL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CONTENTS INTRODUCTION to SYSTEM CALLS FUNCTION vs SYSTEM CALL TYPES and EXAMPLES ERRORS and SYSVECTORS HOW DO ALL THESE WORK TOGETHER ADDING A SYSCALL.

Similar presentations

Presentation on theme: "CONTENTS INTRODUCTION to SYSTEM CALLS FUNCTION vs SYSTEM CALL TYPES and EXAMPLES ERRORS and SYSVECTORS HOW DO ALL THESE WORK TOGETHER ADDING A SYSCALL."— Presentation transcript:

Similar presentations

About project

Feedback