Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2: UNIX Structure

Similar presentations


Presentation on theme: "Lecture 2: UNIX Structure"— Presentation transcript:

1 Lecture 2: UNIX Structure

2 Layers of a UNIX System User Interface

3 Essential Unix Architecture
Hardware I/O Related Process Related Scheduler Memory Management IPC File Systems Networking Device Drivers Modules System Call Interface System Libraries (libc) Applications Architecture-Dependent Code Stephen Tweedie claims “All kernel code executes in a process context (except during startup)”. He also says that it is possible for an interrupt to occur during a context switch. So it is uncommon for there to be no user process mapped. The real problem is not knowing what process is mapped. Kernel mode, system context activities occurs asynchronously and may be entirely unrelated to the current process.

4 Monolitic vs. Micro Kernels
Most Unix kernels are monolithic: each kernel layer is integrated into the whole kernel program and runs in the kernel mode on behalf of the current process Microkernel operating systems demand a small set of functions from the kernel: few synchronization primitives, a simple scheduler, an IPC mechanism etc. System processes that run on top of the microkernel implement other OS functions: device drivers, file systems, system call handlers etc. e.g. Linux, BSD e.g. Minix

5 Microkernels vs. Linux Modules
System programmers are forced to adopt a modularized approach Layers (relatively independent programs) interact through well-defined interfaces OS can be ported to other architectures fairly easily Claim to be more reliable In a monolithic system, a bug in a device driver can easily crash the whole kernel Slower than monolithic kernels (explicit message passing between different layers) Linux modules: An object file whose code can be linked to the kernel at runtime Unlike the external layers of a microkernel OS, does not run as a separate process Executed in the kernel mode on behalf of the current process, like any other statically linked kernel function Can be linked to the running kernel when its functionality is required and unlinked when it is no longer useful (e.g. embedded systems) Achieve many of the theoretical advantages of microkernels without introducing performance penalties (no explicit message passing is required) Various studies have shown that device drivers have 3-7 times as many bugs as the rest of the OS. When combined with the fact that 70% of a typical OS consists of device drivers, it is clear that device drivers are a big source of trouble.

6 FreeBSD machine-independent kernel code
Category Code lines % from all kernel code Headers 38 158 4,8% Initialization 1 663 6,7 % Kernel means 53 805 6,7% Common interfaces 22 191 2,8% IPC 10 019 1,3% Terminal management 5 798 0,7% Virtual memory 24 714 3,1% Vnode management 22 764 2,9% Local file system 28 067 3,5% Different file systems (19) 58 753 7,4% Network File system 22 436 Network communication 46 570 5,8% IPv4 protocol support 41 220 5,2 %

7 FreeBSD machine-independent kernel code
Category Code lines % from all kernel code IPv6 protocol support 45 527 5,7% IPsec 17 956 2,2% Netgraph 74 338 9,3% Cryptography support 7 515 0.9% GEOM level 11 563 1,4% CAM level 41 805 5,2% ATA level 14 192 1,8% ISA bus 10 984 PCI bus 72 366 9,1% PCCARD bus 6 916 Linux compatibility subsystem 10 474 1,3% ALL 86,4% Kernel without drivers.

8 FreeBSD machine-dependent kernel code
Category Code lines % from all kernel code Machine dependent headers 16 115 2,0% ISA bus 50 882 6,4% PCI bus 2 266 0,3% Virtual memory 3 118 0,4% Different machine dependent code 26 708 3,3% Assembler procedures 4 400 0.6% Linux compatibility subsystem 4 857 All 13,6% An abridged 12,000 lines of the C source code of the kernel, memory manager, and file system of MINIX 1.0 are printed in the book. Prentice-Hall also released MINIX source code and binaries on floppy disk with a reference manual. vs. Linux kernel 2.6: lines of code vs. MINIX 1 (kernel, memory manager, file system): lines; MINIX 3 (kernel): 6 000

9 Kernel Services A border between the kernel level and the user level code Supported by the hardware protection Kernel is working in an isolated address range Impossible to get access to that address space from the user level Any interaction between the two levels is possible only via system calls Strictly controlled by the kernel System calls are mostly synchronous for user level application Kernel might continue some work after returning results to the user level System calls are mostly implemented by the means of hardware exceptions Change the CPU working mode and the current virtual memory content Kernel strictly controls system call arguments before executing the call Each argument is copied to the kernel address space to guarantee that it will not be changed during the execution of the system call The address space where the result of the system call will be placed has to belong to the process who made the call To ensure safe protection mechanisms, operating systems must use the hardware protection associated with the CPU privileged mode. If system call got an error, it returns -1 and sets global errno variable.

10 A few of the more common UNIX utility programs required by POSIX

11 Execution Strategy by Example
Consider the cat utility calling the library function read Target system: OpenBSD 3.9 Target architecture: i386 (80x86) Target device: WD100x (ATA) ./src/bin/cat/cat.c: Line 246 calls the read system function to read a buffer from the file Library interface

12 Execution Strategy by Example
./lib/csu/common.h The read function is a wrapper around the __syscall function with SYS_read as its argument System call interface ./lib/libc/arch/i386/sys/syscall.S The __syscall function is an assembly code function that triggers the entry into the kernel via a specially-crafted interrupt

13 Execution Strategy by Example
./sys/kern/sys_generic.c Program (process) now executes in the kernel mode. sys_read checks arguments, then actually calls out to read from the file.

14 Execution Strategy by Example
dofileread calls fo_read – a function of Virtual File System interface implemented in ./src/ufs/ufs_readwrite.c – which in turn calls out to read on the underlying device ./sys/dev/ata/wd.c The driver calls physio which fills a buffer from the underlying block device In physio, the process is deprioritized as it waits for data to transfer, allowing other processes to gain control of the processor Once the buffer has been appropriately filled (or the process is interrupted), control works its way back to the original caller Block device – disk drive (vs. character devices - peripherials).

15 System Calls System Calls for process control
fork() wait() execl(), execlp(), execv(), execvp() exit() signal(sig, handler) kill(sig, pid) System calls for low level file I/O creat(name, permissions) open(name, mode) close(fd) unlink(fd) read(fd, buffer, n_to_read) write(fd, buffer, n_to_write) lseek(fd, offest, whence) System Calls for IPC pipe(fildes) dup(fd) Total ~270 System Calls in Linux kernel v2.6 Portable Operating System Interface (POSIX) ISO/IEC 9945 IEEE 1003 Single UNIX Specification (SUS) Linux Standard Base POSIX vs. OS-specific SYS calls.

16 System Calls for Process Management
s is an error code pid is a process ID residual is the remaining time from the previous alarm void

17 System Lifecycle: Ups & Downs
start_kernel() Power on Power off Boot- loader Kernel Init OS Init RUN! Shut down init() shutdown LILO/GRUB sleep? (hlt) BIOS, the ultimate authority of what hardware is installed, finds the configured primary bootable device and loads the initial bootstrap program from the master boot record, MBR Boot loader presents boot options and executes the start_kernel function The start_kernel function initializes all data structures needed by the kernel, enables interrupts, starts the process scheduler and creates another kernel thread, named process 1 (the init process) init process, the parent of all user processes, executes scripts (/etc/rc.../, /etc/init/) to set up non-OS services (e.g. daemons) and structures for the user environment, mounts the filesystem, etc. Booting can be quite complex. Reviewing the details helps to dispel some of the mystery surrounding the process and provides an appreciation for the complex hardware, firmware, software coordination required. The standard Intel boot manager LILO. Kernel initialization is staged. Low-level, device-dependent initialization finally yields to high-level, device-independent initialization. Each category has initialization dependencies. Certain things must be done before other things. Each logical Linux subsystem has one or more _init() functions that are called during this process. Processes 0 and 1 are discussed. Process 1 (init) is responsible for all user-level process creation. Shutdown is also staged but not as complex as startup. /sbin/shutdown and init cooperate to bring the system down gently. Linux enthusiasts continue to stretch the bounds. We review some creative thinking about the boot process. Power management is an increasingly important OS responsibility and part of the system lifecycle.

18 Processes kernel Process 0: kernel bootstrap /etc/init
httpd lpd /etc/init kernel Process 0: kernel bootstrap Process 1: creates processes to allow for login inetd /etc/getty fork exec /bin/login shell condition terminal for login check password command interpreter kernel mode user mode At start up time, process 0 launched in the kernel mode (boot). Process 0 starts process 1 in the user model (manual transfer of control). Process 1 starts other processes (fork system call). Each time a command/program is executed, the shell forks a new process for it. P.S. getty (“get teletype”) is a Unix program running on a host computer that manages physical or virtual terminals (TTYs).

19 Shell Shell is a user interface for accessing kernel services
A command-line interface (CLI) Although could be GUI as well, e.g. X Window Interprets user commands and starts applications

20 Different Shells Bourne C Shell Korn Shell BASH
Last login: Tue Sep 21 07:58: from root]# root]# ps PID TTY TIME CMD 20879 pts/ :00:00 bash 20905 pts/ :00:00 ps root]# ls -l total 64 -rw-r--r-- 1 root root Sep 20 16:11 anaconda-ks.cfg -rw-r--r-- 1 root root Sep 20 16:11 install.log -rw-r--r-- 1 root root Sep 20 16:11 install.log.syslog root]# pwd /root

21 Environment Variables
Ieejot sistēmā, lietotājam automātiski tiek iestatīti dažādi environment variables. Lai tos aplūkotu, jāizpilda komanda env /]# env HOSTNAME=unix.mii.lu.lv TERM=vt100 SHELL=/bin/bash HISTSIZE=1000 SSH_CLIENT=::ffff: SSH_TTY=/dev/pts/3 USER=root LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01: USERNAME=root MAIL=/var/spool/mail/root PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin INPUTRC=/etc/inputrc PWD=/ LANG=en_US.UTF-8 SHLVL=1 HOME=/root BASH_ENV=/root/.bashrc LOGNAME=root SSH_CONNECTION=::ffff: ::ffff: LESSOPEN=|/usr/bin/lesspipe.sh %s G_BROKEN_FILENAMES=1 _=/bin/env

22 Processes Process is an instance of a program in execution
It can switch between the user mode and the kernel mode by means of system calls Process resources can be divided into: User level resources: CPU general purpose registers, command counter, CPU state registers, stack registers, process memory segments (text, data, shared libs, stack) Kernel level resources, which are important for the underlying hardware: registers, command counter, stack pointer, schedule information, system call information, etc. Process kernel state is divided into two parts: Process structure that contains data which has to be always in memory and can not be swapped out: contains pointers to all other resident structures User structure that has to be in memory only during the process execution; otherwise it can be swapped out to the disk Dynamically allocated to the process by memory management routines Multitasking is achieved by the context switching Because context switching operations take place very often, minimizing the context switching time is an effective way to achieve a better performance

23 Parts of Process Memory Structure
Program code Initialised data Non-initialised data Stack frames of invoked functions arena/heap malloc switches on system call (trap, software interrupt) user-id open files saved register states environment $ size /usr/bin/size = 25777

24 The Big Picture: Another look
Data Stack Text (shared) kernel stack/u area process structure kernel memory

25 Process Structure A process can be seen as a collection of data structures that fully describe how far the execution of the program has progressed Every process has a unique identifier, PID A mechanism how the kernel and other processes can refer to each other Process structure contains: PID Signal state: waiting signals, signal mask, signal action summary Profiling information Timers: real-time timers and CPU usage counters Different process substructures

26 Process Structure Different process substructures:
Process group identification: process group and session it belongs to User mandates: the actual, effective and stored user and group identification Memory management that describes the virtual address space for every process File descriptors: an array of pointers to the files, indexed by file descriptors and open file flags System call vector: it is possible to run object files, compiled for different UNIX systems, by using different system call vectors for different object files Resource accounting: rlimit structure which is used for accounting different system resources Statistics: information got from working processes; written to the accounting file at the time the process exits; includes process timers and profiling information Signal action: an action to be taken when a signal is sent to the process Thread structure 1. Kernel assigns system call type a system call number. 2. Kernel initializes system call table, mapping system call. number to function implementing the system call; also called system call vector. 3. User process sets up system call number and arguments.

27 LIST_ENTRY(proc) p_list; /* (d) List of all processes. */
struct proc { LIST_ENTRY(proc) p_list; /* (d) List of all processes. */ TAILQ_HEAD(, ksegrp) p_ksegrps; /* (c)(kg_ksegrp) All KSEGs. */ TAILQ_HEAD(, thread) p_threads; /* (j)(td_plist) Threads. (shortcut) */ TAILQ_HEAD(, thread) p_suspended; /* (td_runq) Suspended threads. */ struct ucred *p_ucred; /* (c) Process owner's identity. */ struct filedesc *p_fd; /* (b) Open files. */ struct filedesc_to_leader *p_fdtol; /* (b) Tracking node */ struct pstats *p_stats; /* (b) Accounting/statistics (CPU). */ struct plimit *p_limit; /* (c) Process limits. */ struct sigacts *p_sigacts; /* (x) Signal actions, state (CPU). */ enum { PRS_NEW = 0, /* In creation */ PRS_NORMAL, /* threads can be run. */ PRS_ZOMBIE } p_state; /* (j/c) S* process status. */ pid_t p_pid; /* (b) Process identifier. */ FreeBSD: /sys/sys/proc.h

28 LIST_ENTRY(proc) p_hash; /* (d) Hash chain. */
LIST_ENTRY(proc) p_pglist; /* (g + e) List of processes in pgrp. */ struct proc *p_pptr; /* (c + e) Pointer to parent process. */ LIST_ENTRY(proc) p_sibling; /* (e) List of sibling processes. */ LIST_HEAD(, proc) p_children; /* (e) Pointer to list of children. */ struct mtx p_mtx; /* (n) Lock for this struct. */ /* The following fields are all zeroed upon creation in fork. */ #define p_startzero p_oppid pid_t p_oppid; /* (c + e) Save ppid in ptrace. XXX */ struct vmspace *p_vmspace; /* (b) Address space. */ u_int p_swtime; /* (j) Time swapped in or out. */ struct itimerval p_realtimer; /* (c) Alarm timer. */ struct rusage_ext p_rux; /* (cj) Internal resource usage. */ struct rusage_ext p_crux; /* (c) Internal child resource usage. */ int p_profthreads; /* (c) Num threads in addupc_task. */ int p_maxthrwaits; /* (c) Max threads num waiters */ int p_traceflag; /* (o) Kernel trace points. */ struct vnode *p_tracevp; /* (c + o) Trace to vnode. */ struct ucred *p_tracecred; /* (o) Credentials to trace with. */ struct vnode *p_textvp; /* (b) Vnode of executable. */ sigset_t p_siglist; /* (c) Sigs not delivered to a td. */ char p_lock; /* (c) Proclock (prevent swap) count. */ struct sigiolst p_sigiolst; /* (c) List of sigio sources. */ int p_sigparent; /* (c) Signal to parent on exit. */ int p_sig; /* (n) For core dump/debugger XXX. */ u_long p_code; /* (n) For core dump/debugger XXX. */ u_int p_stops; /* (c) Stop event bitmask. */ u_int p_stype; /* (c) Stop event type. */

29 char p_step; /* (c) Process is stopped. */
u_char p_pfsflags; /* (c) Procfs flags. */ struct nlminfo *p_nlminfo; /* (?) Only used by/for lockd. */ struct kaioinfo *p_aioinfo; /* (c) ASYNC I/O info. */ struct thread *p_singlethread;/* (c + j) If single threading this is it */ int p_suspcount; /* (c) Num threads in suspended mode. */ struct thread *p_xthread; /* (c) Trap thread */ int p_boundary_count;/* (c) Num threads at user boundary */ struct ksegrp *p_procscopegrp; /* End area that is zeroed on creation. */ #define p_endzero p_magic /* The following fields are all copied upon creation in fork. */ #define p_startcopy p_endzero u_int p_magic; /* (b) Magic number. */ char p_comm[MAXCOMLEN + 1]; /* (b) Process name. */ struct pgrp *p_pgrp; /* (c + e) Pointer to process group. */ struct sysentvec *p_sysent; /* (b) Syscall dispatch info. */ struct pargs *p_args; /* (c) Process arguments. */ rlim_t p_cpulimit; /* (j) Current CPU limit in seconds. */ signed char p_nice; /* (c + j) Process "nice" value. */ /* End area that is copied on creation. */ #define p_endcopy p_xstat The magic number tells the kernel how to load the executable image. Shell scripts, however, are not directly executable programs: they require a shell to act as an interpreter. The characters #! /bin/sh are recognized by the kernel - just like the magic numbers for binary programs - to indicate that the rest of the line contains the name of the program which acts as the interpreter for the script.

30 u_short p_xstat; /* (c) Exit status; also stop sig. */
struct knlist p_klist; /* (c) Knotes attached to this proc. */ int p_numthreads; /* (j) Number of threads. */ int p_numksegrps; /* (c) Number of ksegrps. */ struct mdproc p_md; /* Any machine-dependent fields. */ struct callout p_itcallout; /* (h + c) Interval timer callout. */ u_short p_acflag; /* (c) Accounting flags. */ struct rusage *p_ru; /* (a) Exit information. XXX */ struct proc *p_peers; /* (r) */ struct proc *p_leader; /* (b) */ void *p_emuldata; /* (c) Emulator state data. */ struct label *p_label; /* (*) Proc (not subject) MAC label. */ struct p_sched *p_sched; /* (*) Scheduler-specific data. */ };

31 Creating a Child Process

32 Creating a Child Process
Process can be created by using a sysytem call: pid_t fork(void); pid_t rfork(int flags); pid_t vfork(void); Child process created by fork() is an exact copy of parent process except for the following: The child process has a unique process ID. The child process has a different parent process ID (i.e., the process ID of the parent process). The child process has its own copy of the parent's descriptors These descriptors reference the same underlying objects, so that, for instance, file pointers in file objects are shared between the child and the parent, so that an lseek(2) on a descriptor in the child process can affect a subsequent read(2) or write(2) by the parent. This descriptor copying is also used by the shell to establish standard input and output for newly created processes as well as to set up pipes. The child process' resource utilizations are set to 0; see setrlimit(2). All interval timers are cleared; see setitimer(2). Child process created by rfork() is an exact copy of parent process except for the following: The flags argument to rfork() selects which resources of the invoking process (parent) are shared by the new process (child) or initialized to their default values. The resources include the open file descriptor table (which, when shared, permits processes to open and close files for other processes), and open files. The vfork() system call can be used to create new processes without fully copying the address space of the old process, which is horrendously inefficient in a paged environment. It is useful when the purpose of fork(2) would have been to create a new system context for an execve(2). The vfork() system call differs from fork(2) in that the child borrows the parent's memory and thread of control until a call to execve(2) or an exit (either by a call to _exit(2) or abnormally). The parent process is suspended while the child is using its resources.

33 Illustration of Process Control Calls

34 A highly simplified shell
POSIX Shell A highly simplified shell

35 Executing the ls Command

36 Threads Although the parent and child processes may share the program code, they have separate copies of data (stack and heap), so that changes by the child to a memory location are invisible to the parent (and vice versa) Modern Unix systems support multithreaded applications having many relatively independent execution flows sharing a large portion of the data structures A process can be composed of several user threads, each of which represents an execution flow of the process A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler A program counter is a register in a computer processor that contains the address of the instruction being executed at the current time. As each instruction gets fetched, the program counter increases its stored value by 1.

37 The principal POSIX thread calls.
POSIX Threads Multithreaded applications are written using standard sets of library functions called pthread (POSIX thread) libraries A mutex is a program object that allows multiple threads to share the same resource but not simultaneously A condition variable is a container of threads that are waiting on a certain condition The principal POSIX thread calls.

38 The UNIX scheduler is based on a multilevel queue structure

39 UNIX Scheduler Process states:
NEW NORMAL (RUNNNABLE, SLEEPING, STOPPED) ZOMBIE Kernel uses 2 queues to hold processes in different states: zombieproc and allproc In most cases threads are organiezed in 2 queues: Runnable queue and waiting queue Threads which are ready for running go to the runnable queue while threads which are waiting for some something being placed – to the waiting queue Queues are organized based on process and thread priority values Waiting queue hashed based on event ID in order to make search operation faster Process exit ether by using the exit() call or by reciving a signal In either way, the exit status is delivered to the parent process by wait4() system call A zombie is a process that has completed execution (via the exit system call) but still has an entry in the process table. This occurs for child processes, where the entry is still needed to allow the parent process to read its child's exit status. Once the exit status is read via the wait system call, the zombie's entry is removed from the process table. STOPPED means that the process has received a STOP signal (Ctrl+Z), and won't do anything much until it receives a CONT signal.

40 POSIX Signals Signals serve two main purposes:
To make a process aware that a specific event has occurred To cause a process to execute a signal handler function included in its code A number of system calls allow to send signals and determine how their processes respond to the signals they receive Signals may be sent at any time to a process whose state is usually unpredictable Signals sent to a process that is not currently executing must be saved by the kernel The only information given to the process is usually a number identifying the signal.

41 Disk layout in classical UNIX systems
UNIX File System Disk layout in classical UNIX systems

42 Inodes Index nodes Directory entry – the main file table (filename, inode) – in memory Inode entries – in memory or on disk (or both) Contain permissions, ownership, timestamps etc. Point to blocks of data on the disk (physical addresses) Several inodes may refer to the same file (data), e.g. from directory entries of different users Deleting files: inode is unliked, but data gets left (until removed by the garbage collector) If one “deletes” the file (inode), and there is another inode pointing to this file (data), the data is not removed (by the garbage collector) Inode object ≠ file object (describes how a process interacts with an opened file) Because several processes may access the same file concurrently, the file pointer must be kept in the file object rather than the inode object All information needed by the filesystem to handle a file is included in a data structure called an inode.

43 Disk vs. Filesystem / bin etc users tmp usr hollid2 scully
The entire hierarchy (tree) can actually include many disk drives Some directories can be on other computers Absolute vs. relative paths Symbolic links – a filesystem construct (unlike shortcuts) Unix itself processes the symbolic link, resolving it to the real object transparently / bin etc users tmp usr hollid2 scully

44 Top Directory Structure
/bin The bin directory is where all the executables binaries were kept in early Unix.Over time, as more and more executables were added to Unix, it became quite unmanageable to keep all the executables in one place and the bin directory split into multiple parts(/bin/sbin, /usr/bin) /dev Device drivers (screen, keyboard, harddisks etc.) /etc Unix designates the etc directory as the storage place for all the adminstrative files and information. /lib If programs want to include certain features, they can reference just the shared copy of that utility in the Unix library rather than having a new unique copy. /lost+found When files are recovered after any sort of problem or failure, they are placed in the lost + found directory, if the kernel cannot ascertain the proper location in the system. /mnt The mnt directory is an empty directory reserved for mounting removable filesystems like hard disks,removable cartridge drives, and so on. /tmp The tmp directory contains temporary files created by Unix system programs. You can remove any temporary file that does not belong to a running program. /usr The usr directory consists of several subdirectories that contain additional Unix commands and data files. /home Default location of user home directories. /var Logfiles, spools (mailqueue)

45 Fedora Linux Directories
/]# ls -l total 237 drwxr-xr-x 2 root root Sep 20 17:19 bin drwxr-xr-x 4 root root Sep 20 16:04 boot drwxr-xr-x 23 root root Sep 20 16:13 dev drwxr-xr-x 41 root root Sep 20 17:19 etc drwxr-xr-x 2 root root Mar home drwxr-xr-x 2 root root Mar initrd drwxr-xr-x 9 root root Sep 20 17:19 lib drwx root root Sep 20 19:00 lost+found drwxr-xr-x 2 root root Apr 14 20:39 misc drwxr-xr-x 5 root root Sep 20 16:13 mnt drwxr-xr-x 2 root root Mar opt dr-xr-xr-x 50 root root Sep 20 19:12 proc drwxr-x root root Sep 20 17:06 root drwxr-xr-x 2 root root Sep 20 17:19 sbin drwxr-xr-x 2 root root Mar selinux drwxr-xr-x 8 root root Sep 20 19:12 sys drwxrwxrwt 2 root root Sep 20 17:28 tmp drwxr-xr-x 14 root root Sep 20 16:03 usr drwxr-xr-x 18 root root Sep 20 16:10 var /]#

46 Virtual File System First included in SunOS (1986)
Allows to transparently mount disks or partitions hosting file formats used by other systems (incl. Windows) The root directory (/) is contained in the root filesystem Other filesystems can be mounted on subdirectories of the root filesystem Provides a common interface to several kinds of filesystems There is a kernel field or function to support operations provided by each supported filesystem The kernel substitutes the actual functions (read, write, etc.) that support the native filesystem

47 System Calls for File Management
s is an error code fd is a file descriptor position is a file offset ...

48 Fields returned by the stat system call

49 System Calls for Directory Management
s is an error code dir identifies a directory stream dirent is a directory entry

50 System Calls for File Protection
s is an error code uid and gid are the UID and GID, respectively

51 Some examples of file protection modes
Security in UNIX Some examples of file protection modes

52 passwd, shadow, group unix etc # ls -l passwd shadow group
-rw-r--r-- 1 root root 705 Sep 23 15:36 group -rw-r--r-- 1 root root 1895 Sep 24 18:20 passwd -rw root root 634 Sep 24 18:22 shadow unix etc # unix root # more /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/bin/false daemon:x:2:2:daemon:/sbin:/bin/false adm:x:3:4:adm:/var/adm:/bin/false lp:x:4:7:lp:/var/spool/lpd:/bin/false sync:x:5:0:sync:/sbin:/bin/sync shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown halt:x:7:0:halt:/sbin:/sbin/halt ... guest:x:405:100:guest:/dev/null:/dev/null nobody:x:65534:65534:nobody:/:/bin/false girtsf:x:1000:100::/home/girtsf:/bin/bash dima:x:1001:100::/home/dima:/bin/bash guntis:x:1002:100::/home/guntis:/bin/bash students:x:1003:100::/home/students:/bin/bash unix root # unix root # more /etc/group root::0:root bin::1:root,bin,daemon daemon::2:root,bin,daemon sys::3:root,bin,adm adm::4:root,adm,daemon tty::5:girtsf disk::6:root,adm lp::7:lp mem::8: kmem::9: wheel::10:root,girtsf floppy::11:root mail::12:mail ... users::100:games,girtsf nofiles:x:200: qmail:x:201: postfix:x:207: postdrop:x:208: smmsp:x:209:smmsp slocate::245: portage::250:portage utmp:x:406: nogroup::65533: nobody::65534: unix root # unix root # more /etc/shadow root:$1$VlYbWsrd$GUs2cptio.rKlGHgAMBzr.:12684:0::::: halt:*:9797:0::::: ... guest:*:9797:0::::: nobody:*:9797:0::::: girtsf:$1$u6UEWKT2$w5K28n2iAB2wNWtyPLycP1:12684:0:99999:7::: dima:$1$BQCdIBdV$xzzlj4s8XT6L9cLAmcoV50:12684:0:99999:7::: guntis:$1$fiJF/0BT$Py9JiQQL6icajjQVyMZ7//:12684:0:99999:7::: students:$1$wueon8yh$nLpUpNOKr8yTYaEnEK6OJ1:12685:0:99999:7::: unix root #

53 Special Filesystems There are few special types of filesystems that play an important role in the internal design of the kernel While network and disk-based filesystems enable the user to handle information stored outside the kernel, special filesystems may provide an easy way for system programs and administrators to manipulate the data structures of the kernel and to implement special features of the operating system Examples: proc /proc General access point to kernel data structures sysfs /sys General access point to system data pipefs none Pipes (can be treated in the same way as FIFO files) Some special FS does not have mount points

54 The /proc pseudo filesystem
The /proc directory contains virtual files that are windows into the current state of the running kernel. This allows the user to peer into a vast array of information, effectively providing them with the kernel's point-of-view within the system. In addition, the user can use the /proc directory to communicate particular configuration changes to the kernel. /proc directory contains files that are not part of any filesystem associated with your hard disks, CD-ROM, or any other physical storage device connected to your system (except, arguably, your RAM). Rather, these files are part of a virtual filesystem, enabled or disabled in the kernel when it is compiled. The /proc virtual filesystem is a switch in the configuration of the kernel, one that is turned on by default. If, for whatever reason, you would like to completely disable /proc on your system, de-select /proc file system support within the File system configuration section of config, menuconfig, or xconfig when rebuilding your kernel. Alternatively, you can simply comment out the /proc line in /etc/fstab to prevent it from being mounted. The /proc virtual files exhibit some interesting qualities. First, most of them are 0 bytes in size. However, when the file is viewed, it likely contains quite a bit of information. In addition, most of their time and date settings reflect the current time and date, meaning that they are constantly changing. A system administrator can use /proc as an easy method of accessing information about the state of the kernel, the attributes of the machine, the states of individual processes, and more. Most of the files in this directory, such as interrupts, meminfo, mounts, and partitions, provide an up-to-the-moment glimpse of a system's environment.

55 The /proc pseudo filesystem
Interesting quality of virtual files can be seen when viewing them with the more command, which usually tells gives your location in the file by displaying the percentage of the document you are currently seeing. This percentage number usually climbs the further you navigate down a long file. However, when viewing a /proc virtual file, the percentage amount never changes, always staying at 0%. Be sure to avoid viewing the kcore file in /proc. This virtual file contains an image of the kernel's memory, and the contents of the file will do strange things to your terminal. You may need to type reset after hitting [Ctrl]-[C] to get back to a proper command line prompt.

56 Top-Level Files in /proc
Most of the files at the top-level of the /proc directory hold key pieces of information about the state of the Linux kernel and your system in general. It is important to remember that the content of the files in the /proc directory and its various sub-directories is entirely dependent on information concerning your system. In other words, do not expect to see the exact same information in the same /proc file on two different machines.

57 Top-Level Files in /proc
/proc/apm This file provides information about the Advanced Power Management (APM) state and options on the system. This information is used by the kernel to provide information for the apm command. /proc/cmdline This file essentially shows the parameters passed to the Linux kernel at the time it is started. /proc/cpuinfo This file changes based on the type of processor in your system. The output is fairly easy to understand. /proc/devices This file displays the various character and block devices currently configured for use with the kernel. It does not include modules that are available but not loaded into the kernel. The output from /proc/devices includes the major number and name of the device. /proc/dma This file contains a list of the registered ISA direct memory access (DMA) channels in use. /proc/execdomains This file lists the execution domains currently supported by the Linux kernel, along with the range of personalities they support. Think of execution domains as a kind of "personality" of a particular operating system. Other binary formats, such as Solaris, UnixWare, and FreeBSD, can be used with Linux. By changing the personality of a task running in Linux, a programmer can change the way the operating system treats particular system calls from a certain binary.

58 Top-Level Files in /proc
/proc/fb This file contains a list of frame buffer devices, with the frame buffer device number and the driver that controls it. /proc/filesystems This file displays a list of the filesystem types currently supported by the kernel. /proc/interrupts This file records the number of interrupts per IRQ on the x86 architecture. /proc/iomem This file shows you the current map of the system's memory for its various devices /proc/ioports In a way similar to /proc/iomem, /proc/ioports provides a list of currently registered port regions used for input or output communication with a device. /proc/isapnp This file lists Plug and Play (PnP) cards in ISA slots on the system. This is most often seen with sound cards but may include any number of devices. /proc/kcore This file represents the physical memory of the system and is stored in the core file format. Unlike most /proc files, kcore does display a size. This value is given in bytes and is equal to the size of physical memory (RAM) used plus 4KB.

59 Top-Level Files in /proc
/proc/kmsg This file is used to hold messages generated by the kernel. These messages are then picked up by other programs, such as klogd. /proc/ksyms This file holds the kernel exported symbol definitions used by the modules tools to dynamically link and bind loadable modules. proc/loadavg This file provides a look at load average, or the utilization of the processor, over time, as well as giving additional data used by uptime and other commands. /proc/locks This files displays the files currently locked by the kernel. The content of this file contains kernel internal debugging data and can vary greatly, depending on the use of the system. /proc/mdstat This file contains the current information for multiple-disk, RAID configurations. If your system does not contain such a configuration, then your mdstat file will look similar to this: Personalities : read_ahead not set unused devices: <none> /proc/meminfo This is one of the more commonly used /proc files, as it reports back plenty of valuable information about the current utilization of RAM on the system.

60 Top-Level Files in /proc
/proc/misc This file lists miscellaneous drivers registered on the miscellaneous major device, which is number 10 /proc/modules This file displays a list of all modules that have been loaded by the system. Its contents will vary based on the configuration and use of your system /proc/mounts This file provides a quick list of all mounts in use by the system. /proc/mtrr This file refers to the current Memory Type Range Registers (MTRRs) in use with the system. /proc/partitions For very detailed information on the various partitions currently available to the system /proc/pci This file contains a full listing of every PCI device on your system. Depending on the number of PCI devices you have, /proc/pci can get rather long.

61 Top-Level Files in /proc
/proc/slabinfo This file gives information about memory usage on the slab level. Linux kernels greater than 2.2 use slab pools to manage memory above the page level. Commonly used objects have their own slab pools. /proc/stat This file keeps track of a variety of different statistics about the system since it was last restarted. /proc/swaps This file measures swap space and its utilization. /proc/uptime This file contains information about how long the system has on since its last restart. /proc/version This files tells you the versions of the Linux kernel and gcc

62 Directories in /proc Common groups of information concerning the kernel is grouped into directories and sub-directories within /proc. Process Directories Every /proc directory contains quite a few directories named with a number. These directories are called process directories, as they refer to a process's ID and contain information specific to that process. The owner and group of each process directory is set to the user running the process. When the process is terminated, its /proc process directory vanishes. However, while the process is running, a great deal of information specific to that process is contained in the process directory's various files. Each of the process directories contains the following files: cmdline — Contains the command line arguments that started the process. cpu — Provides specific information about the utilization of each of the system's CPUs. cwd — A link to the current working directory for the process.

63 Directories in /proc environ — Gives a list of the environment variables for the process. The environment variable is given in all upper-case characters, and the value is in lower-case characters. exe — A link to the executable of this process. fd — A directory containing all of the file descriptors for a particular process. maps — Contains memory maps to the various executables and library files associated with this process. mem — The memory held by the process. root — A link to the root directory of the process. stat — A status of the process. statm — A status of the memory in use by the process. The seven columns relate to different memory statistics for the process. In order of how they are displayed, from right to left, they report different aspects of the memory used: Total program size, in kilobytes Size of memory portions, in kilobytes Number of pages that are shared Number of pages are code Number of pages of data/stack Number of pages of library Number of dirty pages

64 Directories in /proc /proc/self
status — Provides the status of the process in a form that is much more readable than stat or statm. /proc/self The /proc/self directory is a link to the currently running process. This allows a process to look at itself without having to know its process ID. Within a shell environment, a listing of the /proc/self directory produces the same contents as listing the process directory for that process. /proc/bus This directory contains information specific to the various busses available on the system. So, for example, on a standard system containing ISA, PCI, and USB busses, current data on each of these busses is available in its directory under /proc/bus. The contents of the sub-directories and files available varies greatly on the precise configuration of your system. However, each of the directories for each of the bus types contains at least one directory for each bus of that type. /proc/driver This directory contains information for specific drivers in use by the kernel. A common file found here is rtc, which provides output from the driver for the system's Real Time Clock (RTC), the device that keeps the time while the system is switched off.

65 Directories in /proc /proc/fs
This directory contains specific filesystem, file handle, inode, dentry and quota information. This information is actually located in /proc/sys/fs. /proc/ide This directory holds an assorted array of information about IDE devices on the system. Each IDE channel is represented as a separate directory, such as /proc/ide/ide0 and /proc/ide/ide1. Device Directories Some of the most useful data can be found in the device directories within the channel directory. Each device, such as a hard drive or CD-ROM, on that channel will have its own directory containing its own collection of information and statistics. The contents of these directories vary according to the type of device connected. Some of the more useful files common to different devices include: cache — The device's cache. capacity — The capacity of the device, in 512 byte blocks. driver — The driver and version used to control the device. geometry — The physical and logical geometry of the device. media — The type of device, such as a disk. model — The model name or number of the device. settings — A collection of current parameters of the device.

66 Directories in /proc /proc/sys
/proc/irq This directory is used to set IRQ to CPU affinity, which allows you to connect a particular IRQ to only one CPU. Alternatively, you can exclude a CPU from handling any IRQs. Each IRQ has its own directory, allowing for each IRQ to be configured different from any other. The /proc/irq/prof_cpu_mask file is a bitmask that contains the default values for the smp_affinity file in the IRQ directory. The values in smp_affinity specify which CPUs handle that particular IRQ. /proc/net This directory provides a comprehensive look at various networking parameters and statistics. /proc/scsi In the same way the /proc/ide directory only exists if an IDE controller is connected to the system, the /proc/scsi directory is only available if you have a SCSI host adapter. /proc/sys This directory is special and different from the others in /proc, as it not only provides a lot of information about the system but also allows you to make configuration changes to a running kernel. Warning Never attempt to tweak your kernel's settings on a production system using the various files in the /proc/sys directory. Occasionally, changing a setting may render the kernel unstable, requiring a reboot of the system. As this would obviously disrupt any users currently using the system, use a similar development system to try out changes before utilizing them on any production machines.

67 Directories in /proc The /proc/sys directory contains several different directories that control different aspects of a running kernel. /proc/sys/dev This directory provides parameters for particular devices on the system. Most systems have at least two directories, cdrom and raid, but customized kernels can have others, such as parport, which provides the ability to share one parallel port between multiple device drivers. /proc/sys/fs This directory contains an array of options and information concerning various aspects of the filesystem, including quota, file handle, inode, and dentry information. /proc/sys/kernel This directory contains a variety of different configuration files that directly affect the operation of the kernel. /proc/sys/net This directory contains assorted directories of its own concerning various networking topics, including assorted protocols and centers of emphasis. Various configurations at the time of kernel compilation make available different directories here, such as appletalk, ethernet, ipv4, ipx, and ipv6. Within these directories, you can adjust the assorted networking values for that configuration on a running system.

68 Directories in /proc /proc/sys/vm
This directory facilitates the configuration of the Linux kernel's virtual memory (VM) subsystem. The kernel makes extensive and intelligent use of virtual memory, which is commonly called swap space. /proc/sysvipc This directory contain information about System V IPC resources. The files in this directory relate to System V IPC calls for messages (msg), semaphores (sem), and shared memory (shm). /proc/tty This directory contains information about the available and currently used tty devices on the system. Originally called a teletype device, any character-based data terminals are called tty devices. In Linux, there are three different kinds of tty devices. Serial devices are used with serial connections, such as over a modem or using a serial cable. Virtual terminals create the common console connection, such as the virtual consoles available when pressing [Alt]-[<F-key>] at the system console. Pseudo terminals create a two-way communication that is used by some higher level applications, such as X11.

69 Using sysctl Setting kernel parameters in the /proc/sys directory need not be a manual process or one that required echoing values into a virtual file, hoping they are correct. The sysctl command can make viewing, setting, and automating special kernel settings very easy. To get a quick overview of all settings configurable in the /proc/sys directory, type the sysctl -a command as root. This will create a large, comprehensive list. This is the same basic information you would see if you viewed each of the files individually. The only difference is the file location. The /proc/sys/net/ipv4/route/min_delay is signified by net.ipv4.route.min_delay, with the directory slashes replaced by dots and the proc.sys portion assumed. quickly setting single values like this in /proc/sys is helpful during testing, it does not work as well on a production system, as all /proc/sys special settings are lost when the machine is rebooted. To preserve the settings that you like to make permanently to your kernel, add them to the /etc/sysctl.conf file. Even though the /proc filesystem is a great resource to exploit, sometimes it is just missing. The filesystem is not vital to system operation, and there are cases when you choose to leave it out of the kernel image or simply don't mount it. When you build an embedded system, for example, saving kB can be an interesting option; if you are very concerned about security, on the other hand, you might decide to hide system information and leave /proc unmounted.

70 Using sysctl The system call interface to kernel tuning, namely sysctl, is an alternative way to peek into configurable parameters and to modify them. An additional advantage of the system call interface is that it's faster, as no fork/exec is involved, nor any directory lookup. Anyway, unless you run a very old platform, the performance savings are irrelevant. To use the system call, the header <sys/sysctl.h> must be included: it declares the function as: int sysctl (int *name, int nlen, void *oldval, size_t *oldlenp, void *newval, size_t newlen); The arguments of the function have the following meaning: name points to an array of integers: each of the integer values identifies a sysctl item, either a directory or a leaf node file. The symbolic names for such values are defined in <linux/sysctl.h>. nlen states how many integer numbers are listed in the array name: to reach a particular entry you need to specify the path through the subdirectories, so you need to tell how long is such path. oldval is a pointer to a data buffer where the old value of the sysctl item must be stored. If it is NULL, the system call won't return values to user space. oldlenp points to an integer number stating the length of the oldval buffer. The system call changes the value to reflect how much data has been written, which can be less than the buffer length. newval points to a data buffer hosting replacement data: the kernel will read this buffer to change the sysctl entry being acted upon. If it is NULL, the kernel value is not changed. newlen is the length of newval. The kernel will read no more than newlen bytes from newval.

71 Using sysctl (FreeBSD specific)
The FreeBSD sysctl mechanism is based on the so-called linker set technology[1]. It lets us gather information of a running kernel and configure it to some extent without rebuilding a new kernel. All the information is stored inside the kernel and is organized into a Management Information Base (MIB) tree. To access the MIB tree, you should use sysctl variables whose names are naturally managed hierarchically. Most sysctl variables have ASCII names separated by dots. For example, the read-only sysctl variable kern.ostype contains the type of the kernel. This naming scheme is very similar to filenames, where we use slashes to separate component names instead of using dots. To list all sysctl variables by their ASCII names, you can issue the following command: $ sysctl -a The types of the sysctl variables include node, integer, string, structure and opaque data. A node is like a directory in a filesystem. The kern.ostype variable is a string. Its value is "FreeBSD." The sysctl command that you can use on a command line only accepts ASCII names of a sysctl variable. Unlike filenames, wildcard characters like "*" and "?" are not accepted. But you do not have to specify full name to display sysctl variables. ALL sysctl names are implemented internally as an array of integers. I call it "integer names" to distinguish with "ASCII names." You can only use integer names with the system call __sysctl(). If the user only knows the ASCII name of a sysctl variable, it must use a special integer name {0,3} (see below) along with the ASCII name to get the integer name of the sysctl variable. You can not avoid this indirection.

72 Using sysctl (FreeBSD specific)
The maximum number of integers consisting of a sysctl name is limited to CTL_MAXNAME (12). The corresponding internal name of kern.ostype is an array of integers with two elements: {CTL_KERN, KERN_OSTYPE} or {1,1}. Note some sysctl variables only have integer names. For example, {CTL_KERN, KERN_PROC, GPROF_STATE} is the name for the kernel profiling sysctl variable recording whether the kernel is currently being profiled. It has no corresponding ASCII name and therefore cannot be accessed by the sysctl command.

73 Resources M.K. McKusick, G.V. Neville-Neil. Design and Implementation of the FreeBSD Operating System FreeBSD Documentation project. FreeBSD Handbook The Official Red Hat Linux Reference Guide G. Mourani, Open Network Architecture Inc. Securing and Optimizing Linux: The Ultimate Solution D.P. Bovet, M. Cesati. Understanding the Linux Kernel. O’Reilly, 3rd Ed., 2006 J.T. Giffin, G.S. Kola. Linux Process Control via the File System InformIT - The /proc File System Z. Zhang. FreeBSD 4.0 Sysctl Mechanism S. Davis. sysctl On NetBSD - An Easy Way To Get Process Data O. Andreasson. Ipsysctl tutorial 1.0.4


Download ppt "Lecture 2: UNIX Structure"

Similar presentations


Ads by Google