Presentation on theme: "Lightweight virtual system mechanism Gao feng"— Presentation transcript:
1 Lightweight virtual system mechanism Gao feng email@example.com LXC(Linux Container)Lightweight virtual system mechanismGao feng
2 Outline Introduction Comparison Problems Future work Namespace System APILibvirt LXCComparisonProblemsFuture work
3 IntroductionContainer: Operation System Level virtualization method for LinuxGuest1Guest2ContainerManagementToolsP1P2P1P2Relies on the kernel namespace to isolate the resources of the system.The Guest shares the same Kernel with HostThe Kernel provides the System API/ABI for the userspaceUserspace container Management Tools use the API to create namespace andThese tools is used to manager the containers too(libvirt-lxc, lxc-tools)API/ABIKernelNamespace Set 1NamespaceSet 2
4 Introduction Why Container Easy to set up Multi-Tenancy environment Better PerformanceEasy to set up Multi-Tenancy environmentKvmContainerAppGuest-OSEmulator-LayAppHost-OSHost-OS
5 NamespaceNamespace isolates the resources of system, currently there are 6 kinds of namespaces in linux kernel.Mount namespaceUTS namespaceIPC namespaceNet namespacePid namespaceUser namespace
6 Mount Namespace Each mount namespace has its own filesystem layout. /proc/<p1>/mounts/ /dev/sda1/home /dev/sda2/proc/<p2>/mounts/ /dev/sda1/home /dev/sda2/proc/<p3>/mounts/ /dev/sda3/boot /dev/sda4Isolates the file system layout,every mount namespace has its own file system layout.The tasks running in each mount namespace doing mount/unmount will not affect the file system layout of the other mount namespace.Mount namespace can provide an private mount points for processes. it’s just like chroot but better than chroot.Chroot can only change the root directory of process./proc/<pid>/mounts is also virtualized to only show the mount points of mount namespace that the <pid> process belongs to.P1P2P3MountNamespace1MountNamespace2
7 UTS Namespace Every uts namespace has its own uts related information. ostype: Linuxosrelease:version: …hostname: uts1domainname: uts1ostype: Linuxosrelease:version: …hostname: uts2domainname: uts2UnalterableEvery uts namespace has its own OS-type,OS-release,version,host name,domain name… the uname() systemcall will return different information in different uts namespaces.Every uts namespace has its own proc files. The processes can only see and change the proc files of uts namespace which these processes belong to.“Alterbale” File(s) can be changed.“Unalterable” can not be changedalterable
8 IPC NamespaceIPC namespce isolates the interprocess communication resource(shared memory, semaphore, message queue)P1P2P3P4IPC namespace isolates the inter-process communication resources.The processes can communicate with each other through IPC only when they are in the same IPC namespace.IPCnamespace1IPCnamespace2
9 Net Namespace Net namespace isolates the networking related resources Net devices: eth0IP address: /24RouteFirewall ruleSocketsProcsysfs…Net Namespace2Net devices: eth1IP address: /24RouteFirewall ruleSocketsProcsysfs…Every net namespace has its own network devices, ip address, firewall rules, routing tables, sockets and so onThe proc and sysfs interface of net namespace has been virtualized too.
10 PID namespace1 (Parent) PID namespace isolates the Process ID, implemented as a hierarchy.PID namespace1 (Parent)(Level 0)ls /procP1P4pid:1pid:4P2P3pid:2pid:3The Pid namespace isolates the Process IDPid namespace is a hierarchy comprised of “Parent” and “Child”. The parent pidns can have many children pidns. Each Child pidns only has one parent pidns.In the creation process tasks are created. Each task is allocated one pid in one pid namespace. Furthermore, a task is also allocated to each pid in parent and ancestor pid namespaces.The Parent pid namespace can access this Task through the pid allocated to the parent pid namespace.Every pid namespace has a level Value - n, used for marking the level of each pid namespace. The level of init_pid_ns is 0，it’s children pid namespace’s level is 1，the level of each grandchild is 2.Because the processes in the child pid namespace will allocate a pid to the ancestor pid namespace,the kernel limits the max level to MAX_PID_NS_LEVEL(32)/proc/<pid> has also been virtualized.Please note that in the graphic we also have P2!What this ,means is that each Parent and Child can have its own PID – PID:1 for the Child1 and PID:2 for the ParentP3 is the same as P2, it also has two PID’s. PID:1 for the Child2 and PID:3 for the Parent.The Child Namespace2 has one task that is made up of the proc and the Parent PID Namespace has 4 tasks – Tasks 1, 2, 3, and 4The Parent PID ns can access the Task P2 through PID2PID Namespace2 (Child)(Level 1)pid:1PID Namespace3 (Child)(Level 1)pid:1ls /proc1ls /proc1
11 User Namespace kuid/kgid: Original uid/gid, Global uid/gid: user id in user namespace, will be translated to kuid/kgid finallyOnly parent User NS has rights to set mapBefore I begin to explain the User Namespace, I would like to explain the KUID and KGIDKUID: means the Kernel User ID. Before we have the User Namespace we us the UID.KGID: Is the same as KUID but it is the Kernel Group ID.UID: will only be used in the User NamespaceUser namespace has two private proc files /proc/pid/uid_map and gid_map.These two files can be used for setting the mapping table of the user namespace.uid_map and gid_map contain multiple lines, The format of each line is “user_uid kernel_uid count”.kuid:kuid:uid_mapuid_mapuid:10-14uid:0-9User namespace1User namespace2
12 User Namespace Create and stat file in User namesapce root root #touch /fileroot#stat /fileuid_map:UsernamespaceFile : “/file”Access: uid (0/root)We Have the User ID Map to which P1 belongs. UID map maps the Root User of the User Namespace to a Normal User.P1 is the Root User of userns1 which wants to create the H-File. When we write the H-file to the disk, the owner of this file will be mapped to the normal user(uid 1000). Finally in the disk, the owner of this file is the Normal user.P2 is in the same User NS. It wants to get information of the H-file. When the userid information is returned to the P2, the owner of this file will be mapped to the root user of user namespace1.Disk/file (kuid:1000)
13 System API/ABI Proc System Call /proc/<pid>/ns/ clone unshare setnsIn this Slide I will discuss the System API / ABI interface.The Namespace provides proc files to the UserspaceIn the System Call,we have three Namespaces related functions available: Clone, Unshare, and SETNS
14 Proc /proc/<pid>/ns/ipc: ipc namespace /proc/<pid>/ns/mnt: mount namespace/proc/<pid>/ns/net: net namespace/proc/<pid>/ns/pid: pid namespace/proc/<pid>/ns/uts: uts namespace/proc/<pid>/ns/user: user namespaceIf the proc file of two processes is the same, these two processes must be in the same namespace.In the Proc File we have 6 sub-files;The first sub-file is the IPC which stands for the IPC NS
15 System Call clone CLONE_NEWIPC,CLONE_NEWNET, 6 new flags: int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, …);6 new flags:CLONE_NEWIPC,CLONE_NEWNET,CLONE_NEWNS,CLONE_NEWPID,CLONE_NEWUTS,CLONE_NEWUSERWithin the System Call parameter we have the “Flags”. The Namespace increases the parameter of the Flags of the System Call Clone which then adds 6 new Flags.
16 System Call clone create process2 and IPC namespace2 Mount1 Mount1 P1 clone(,, CLONE_NEWIPC,)IPC1IPC2(new created)Each Flag represents one type of Namespace. For example:P1 creates P2 through the System Call Clone with increased parameter Flag – Clone(Underscore _ )NewIPC.Furthermore, it will not only create a Task P2 but it will also create a new IPC NS(IPC2).And the P2 will be “run” in the IPC2.Others1Others1
17 System Call unshare Namespace extends the system call unshare int unshare(int flags);Namespace extends the system call unsharetoo. User space can use unshare to createnew namespace and the caller will run inthis new created namespace.
18 System Call unshare create net namespace2 Mount1 Mount1 P1 P1 Net1 unshare(CLONE_NEWNET)Net1Net2(new created)Others1Others1
19 System Call setns int setns(int fd, int nstype); setns is a new added system call for namespace.Process can use setns to set which namespace theprocess will belong to.@fd: file descriptor of namespace(/proc/<pid>/ns/*)@nstype: type of namespace.
20 System Call setns Change the PID namespace of P2 P1 P1 PID1 PID1 P2 P2 setns(open(/proc/p1/ns/pid,) , 0)PID2PID2
21 Libvirt LXCLibvirt LXC: userspace container management tool, Implemented as one type of libvirt driver.Manage containersCreate namespaceCreate private filesystem layout for containerCreate devices for containerResources controller by cgroupLIBVIRT LXC has 5 main functions which work together towards the preparation work for the Container.
22 ComparisonThe feature that host share the same kernel with guest makes container different from other virtualization methodContainerKVMperformanceGreatNormalOS supportLinux OnlyNo LimitSecurityCompletenessLow
23 /proc/meminfo, cpuinfo… Problems/proc/meminfo, cpuinfo…Kernel space (relate to cgroup)User space (poor efficiency)New namespaceAudit (assign to user namespace?)Syslog (do we really need it?)We currently have no new flags available for API clone system.It’s a problem we will need to resolve when we try to add a new namespace.The meminfo,cpuinfo etc files are not completely virtualized.
24 Problems Bandwidth control Disk quota TC Qdisc Netfilter On host (How to handle setting nic to container?)On container (user can change it)NetfilterHow to control Ingress bandwidthDisk quotaUid/Gid Quota (Many users )Project Quota (xfs only)
25 Future Work Improve Libvirt LXC Unchanged systemd in Libvirt LXC Use interface of systemd to set cgroupLibvirt LXC based DockerAudit namespace