Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lightweight virtual system mechanism Gao feng

Similar presentations

Presentation on theme: "Lightweight virtual system mechanism Gao feng"— Presentation transcript:

1 Lightweight virtual system mechanism Gao feng
LXC(Linux Container) Lightweight virtual system mechanism Gao feng

2 Outline Introduction Comparison Problems Future work Namespace
System API Libvirt LXC Comparison Problems Future work

3 Introduction Container: Operation System Level virtualization method for Linux Guest1 Guest2 Container Management Tools P1 P2 P1 P2 Relies on the kernel namespace to isolate the resources of the system. The Guest shares the same Kernel with Host The Kernel provides the System API/ABI for the userspace Userspace container Management Tools use the API to create namespace and These tools is used to manager the containers too(libvirt-lxc, lxc-tools) API/ABI Kernel Namespace Set 1 Namespace Set 2

4 Introduction Why Container Easy to set up Multi-Tenancy environment
Better Performance Easy to set up Multi-Tenancy environment Kvm Container App Guest-OS Emulator-Lay App Host-OS Host-OS

5 Namespace Namespace isolates the resources of system, currently there are 6 kinds of namespaces in linux kernel. Mount namespace UTS namespace IPC namespace Net namespace Pid namespace User namespace

6 Mount Namespace Each mount namespace has its own filesystem layout.
/proc/<p1>/mounts / /dev/sda1 /home /dev/sda2 /proc/<p2>/mounts / /dev/sda1 /home /dev/sda2 /proc/<p3>/mounts / /dev/sda3 /boot /dev/sda4 Isolates the file system layout, every mount namespace has its own file system layout.The tasks running in each mount namespace doing mount/unmount will not affect the file system layout of the other mount namespace. Mount namespace can provide an private mount points for processes. it’s just like chroot but better than chroot. Chroot can only change the root directory of process. /proc/<pid>/mounts is also virtualized to only show the mount points of mount namespace that the <pid> process belongs to. P1 P2 P3 Mount Namespace1 Mount Namespace2

7 UTS Namespace Every uts namespace has its own uts related information.
ostype: Linux osrelease: version: … hostname: uts1 domainname: uts1 ostype: Linux osrelease: version: … hostname: uts2 domainname: uts2 Unalterable Every uts namespace has its own OS-type,OS-release,version,host name,domain name… the uname() systemcall will return different information in different uts namespaces. Every uts namespace has its own proc files. The processes can only see and change the proc files of uts namespace which these processes belong to. “Alterbale” File(s) can be changed. “Unalterable” can not be changed alterable

8 IPC Namespace IPC namespce isolates the interprocess communication resource(shared memory, semaphore, message queue) P1 P2 P3 P4 IPC namespace isolates the inter-process communication resources. The processes can communicate with each other through IPC only when they are in the same IPC namespace. IPC namespace1 IPC namespace2

9 Net Namespace Net namespace isolates the networking related resources
Net devices: eth0 IP address: /24 Route Firewall rule Sockets Proc sysfs Net Namespace2 Net devices: eth1 IP address: /24 Route Firewall rule Sockets Proc sysfs Every net namespace has its own network devices, ip address, firewall rules, routing tables, sockets and so on The proc and sysfs interface of net namespace has been virtualized too.

10 PID namespace1 (Parent)
PID namespace isolates the Process ID, implemented as a hierarchy. PID namespace1 (Parent) (Level 0) ls /proc P1 P4 pid:1 pid:4 P2 P3 pid:2 pid:3 The Pid namespace isolates the Process ID Pid namespace is a hierarchy comprised of “Parent” and “Child”. The parent pidns can have many children pidns. Each Child pidns only has one parent pidns. In the creation process tasks are created. Each task is allocated one pid in one pid namespace. Furthermore, a task is also allocated to each pid in parent and ancestor pid namespaces. The Parent pid namespace can access this Task through the pid allocated to the parent pid namespace. Every pid namespace has a level Value - n, used for marking the level of each pid namespace. The level of init_pid_ns is 0,it’s children pid namespace’s level is 1,the level of each grandchild is 2. Because the processes in the child pid namespace will allocate a pid to the ancestor pid namespace,the kernel limits the max level to MAX_PID_NS_LEVEL(32) /proc/<pid> has also been virtualized. Please note that in the graphic we also have P2! What this ,means is that each Parent and Child can have its own PID – PID:1 for the Child1 and PID:2 for the Parent P3 is the same as P2, it also has two PID’s. PID:1 for the Child2 and PID:3 for the Parent. The Child Namespace2 has one task that is made up of the proc and the Parent PID Namespace has 4 tasks – Tasks 1, 2, 3, and 4 The Parent PID ns can access the Task P2 through PID2 PID Namespace2 (Child) (Level 1) pid:1 PID Namespace3 (Child) (Level 1) pid:1 ls /proc 1 ls /proc 1

11 User Namespace kuid/kgid: Original uid/gid, Global
uid/gid: user id in user namespace, will be translated to kuid/kgid finally Only parent User NS has rights to set map Before I begin to explain the User Namespace, I would like to explain the KUID and KGID KUID: means the Kernel User ID. Before we have the User Namespace we us the UID. KGID: Is the same as KUID but it is the Kernel Group ID. UID: will only be used in the User Namespace User namespace has two private proc files /proc/pid/uid_map and gid_map. These two files can be used for setting the mapping table of the user namespace. uid_map and gid_map contain multiple lines, The format of each line is “user_uid kernel_uid count”. kuid: kuid: uid_map uid_map uid: 10-14 uid: 0-9 User namespace1 User namespace2

12 User Namespace Create and stat file in User namesapce root root
#touch /file root #stat /file uid_map: User namespace File : “/file” Access: uid (0/root) We Have the User ID Map to which P1 belongs. UID map maps the Root User of the User Namespace to a Normal User. P1 is the Root User of userns1 which wants to create the H-File. When we write the H-file to the disk, the owner of this file will be mapped to the normal user(uid 1000). Finally in the disk, the owner of this file is the Normal user. P2 is in the same User NS. It wants to get information of the H-file. When the userid information is returned to the P2, the owner of this file will be mapped to the root user of user namespace1. Disk /file (kuid:1000)

13 System API/ABI Proc System Call /proc/<pid>/ns/ clone unshare
setns In this Slide I will discuss the System API / ABI interface. The Namespace provides proc files to the Userspace In the System Call,we have three Namespaces related functions available: Clone, Unshare, and SETNS

14 Proc /proc/<pid>/ns/ipc: ipc namespace
/proc/<pid>/ns/mnt: mount namespace /proc/<pid>/ns/net: net namespace /proc/<pid>/ns/pid: pid namespace /proc/<pid>/ns/uts: uts namespace /proc/<pid>/ns/user: user namespace If the proc file of two processes is the same, these two processes must be in the same namespace. In the Proc File we have 6 sub-files; The first sub-file is the IPC which stands for the IPC NS

15 System Call clone CLONE_NEWIPC,CLONE_NEWNET, 6 new flags:
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, …); 6 new flags: CLONE_NEWIPC,CLONE_NEWNET, CLONE_NEWNS,CLONE_NEWPID, CLONE_NEWUTS,CLONE_NEWUSER Within the System Call parameter we have the “Flags”. The Namespace increases the parameter of the Flags of the System Call Clone which then adds 6 new Flags.

16 System Call clone create process2 and IPC namespace2 Mount1 Mount1 P1
clone(,, CLONE_NEWIPC,) IPC1 IPC2 (new created) Each Flag represents one type of Namespace. For example: P1 creates P2 through the System Call Clone with increased parameter Flag – Clone(Underscore _ )NewIPC. Furthermore, it will not only create a Task P2 but it will also create a new IPC NS(IPC2). And the P2 will be “run” in the IPC2. Others1 Others1

17 System Call unshare Namespace extends the system call unshare
int unshare(int flags); Namespace extends the system call unshare too. User space can use unshare to create new namespace and the caller will run in this new created namespace.

18 System Call unshare create net namespace2 Mount1 Mount1 P1 P1 Net1
unshare(CLONE_NEWNET) Net1 Net2 (new created) Others1 Others1

19 System Call setns int setns(int fd, int nstype);
setns is a new added system call for namespace. Process can use setns to set which namespace the process will belong to. @fd: file descriptor of namespace(/proc/<pid>/ns/*) @nstype: type of namespace.

20 System Call setns Change the PID namespace of P2 P1 P1 PID1 PID1 P2 P2
setns(open(/proc/p1/ns/pid,) , 0) PID2 PID2

21 Libvirt LXC Libvirt LXC: userspace container management tool, Implemented as one type of libvirt driver. Manage containers Create namespace Create private filesystem layout for container Create devices for container Resources controller by cgroup LIBVIRT LXC has 5 main functions which work together towards the preparation work for the Container.

22 Comparison The feature that host share the same kernel with guest makes container different from other virtualization method Container KVM performance Great Normal OS support Linux Only No Limit Security Completeness Low

23 /proc/meminfo, cpuinfo…
Problems /proc/meminfo, cpuinfo… Kernel space (relate to cgroup) User space (poor efficiency) New namespace Audit (assign to user namespace?) Syslog (do we really need it?) We currently have no new flags available for API clone system. It’s a problem we will need to resolve when we try to add a new namespace. The meminfo,cpuinfo etc files are not completely virtualized.

24 Problems Bandwidth control Disk quota TC Qdisc Netfilter
On host (How to handle setting nic to container?) On container (user can change it) Netfilter How to control Ingress bandwidth Disk quota Uid/Gid Quota (Many users ) Project Quota (xfs only)

25 Future Work Improve Libvirt LXC Unchanged systemd in Libvirt LXC
Use interface of systemd to set cgroup Libvirt LXC based Docker Audit namespace

26 Thank you! Q&A

Download ppt "Lightweight virtual system mechanism Gao feng"

Similar presentations

Ads by Google