Presentation on theme: "A COMPLETE GUIDE TO OPENMOSIX (updated to 2.4.26)."— Presentation transcript:
A COMPLETE GUIDE TO OPENMOSIX (updated to )
Note :- THIS PRESENTATION IS BEST VIEWED UNDER MICROSOFT POWERPOINT XP BECAUSE OF USE OF ANIMATIONS
Authors Students of IIT Delhi, India Aastha Jain (Btech 3 rd Year ) Shruti Garg (Btech 3 rd Year) Akram Khan (Dual Degree 3 rd Year) Rohan Choudhary (Dual Degree 3 rd Year) This presentation has been made as a part of a reading assignment in the Operating Systems course (2006 Spring) taught by Dr. Subhashis Banerjee
Born early 80s on PDP-11/70. One full PDP and disk-less PDP, therefore process migration idea. First implementation on BSD/pdp as MS.c thesis. VAX 11/780 implementation (different word size, different memory architecture) Motorola / VME bus implementation as Ph.D. thesis in 1993 for under contract from IDF (Israeli Defence Forces) 1994 BSDi version, GNU and Linux since 1997 Contributed dozens of patches to the standard Linux kernel Split Mosix / openMosix November 2001 OPEN MOSIX PROJECT HISTORY
What is openMOSIX (today version ) Linux kernel extension (2.4.26) for clustering Single System Image - like an SMP, for: No need to modify applications Adaptive resource management to dynamic load characteristics (CPU intensive, RAM intensive, I/O etc.) Linear scalability (unlike SMP)
Single System Image Cluster Users can start from any node in the cluster, or sysadmin setups a few nodes as "login" nodes use round-robin DNS: “hpc.qlusters” with many IPs assigned to same name Each process has a Home-Node Migrated processes always seem to run at the home node, e.g., “ps” show all your processes, even if they run elsewhere
A two level technology 1. Information gathering and dissemination Support scalable configurations by probabilistic dissemination algorithms Same overhead for 16 nodes or 2056 nodes 2. Pre-emptive process migration that can migrate any process, anywhere, anytime - transparently Supervised by adaptive algorithms that respond to global resource availability Transparent to applications, no change to user interface
Organization of topics Gossip Algorithms for Information Dissemination Process migration by adaptive resource management algorithms Load Balancing Memory Ushering How Actual migration takes place ? Direct File System Access MigSHM (A Special study)
Gossip Algorithms and Open Mosix
Agenda A short introduction to gossip algorithms Cluster/Grid Information services requirements How good is old information The distributed bulletin board model Implementation
A Problem In an n node system assume that every pair of nodes can communicate directly node i wishes to send a message (rumor, color) to all other nodes. Possible deterministic solutions BROADCAST (only in a broadcast medium) Defining a static tree between the nodes and sending the message along the edges of this tree
A Gossip Style solution Starting with the round in which a rumor is generated each node that holds the rumor selects another node independently and uniformly at random send the rumor to this node The distribution of the rumor is terminated after some fixed number of O( ln n ) rounds At this point all players are informed with high probability
Uniform Gossip Example 1 t
Gossip benefits Robustness to the presence of node failures Messages will continue to propagate due to the random selection of destination F nodes failure results in only O(F) uninformed players Simplicity All nodes run the same algorithm Scalability The number of massages each nodes send (and possibly receive) each round is fixed
Gossip taxonomy Other names are Epidemic algorithms (demers et al) Randomized communication (Karp et al) Propagation can be done by Push – sending the information from the node to the selected node Pull – the other way around Push&Pull both ways We distinguish between 2 conceptual layers A basic gossip algorithm by which nodes choose other nodes for communication A gossip-based protocol Built on top of a gossip algorithm Determine the content of the messages that are sent The way received messages cause nodes to update their internal state
Rumor speeding bounds From a single node to all Time complexity: Message complexity (Karp el al) lower bound to the number of messages:
Spatial Gossip (Kampe at al) New information is most interesting to nodes that are nearby Combines the benefits of Uniform gossip Deterministic flooding The gossip algorithm chooses the nodes according to New information is spread to nodes at distance d with high probability,in :
Aggregating values Gossip can also be used to aggregate a value over all nodes Average, maximum, minimum … In this case the question is how fast the local value in each node converge to the desired value
Cluster/Grid Information services Basic properties of Grid environment Information sources are distributed Individual sources are subject to failure Total number of information providers is large Both the types of information sources and the ways it is used can be varied We cannot in general provide users with accurate information: any information delivered to a user is “old” How useful is old information? (Mitzenmacher) How to build an information service with guaranteed age properties?
Distributed Bulletin board The system Consists of ‘N’ nodes (or clusters) Distributed Nodes are subject to failure Each node maintains a data structure that holds an entry on selected (or all) nodes in the system We refer to this data structure as “The vector” Each vector entry holds: state of the resources (static and dynamic) about the corresponding node age of the information (tune to the local clock) The vector is a distributed bulletin board that serves information requests locally
Algorithm 1- Information dissemination Each time unit Update local information Find all vector entries which are up to age t Choose a random node Send the above entries to that node Upon receiving a message Compute the received entries age Update the entries which the newly received information is fresher A:1B:12C:2D:4E:11 A:1C:2D:4 A:4B:12C:2D:4E:11 B:1C:3E:3
Algorithm 1 : t=2 1 t
Handling inactive nodes The presence of inactive nodes causes problems Age quality of the information deteriorate Number of ARP broadcasts increase linearly Using a fixed size window improves the age quality but the number of ARP broadcasts stay the same
Algorithm 2 Algorithm 2 solves the above 2 issues Works basically the same as algorithm 1 with the following difference when sending a message Calculate l the number of active nodes (from the local vector) Generate a random number between k=0…l If K=0 send the window to all nodes Else send the window only to the active nodes Using Algorithm 2 the maximal expected number of messages to inactive nodes ≤ 1 From all nodes at each round
Algorithm 2 – minimizing messages to inactive nodes 1 t
Algorithm 2 t 2
Supporting Urgent information In previous algorithm information is propagated from all nodes constantly In some cases we wish to send an important message urgently to all such as the detection of a newly dead node In this case the source node give the message high priority 2*log(n) When a node assemble the window it is about to send it takes the entries with the highest priority and only then the younger entries The priority of an entry is decremented every time unit The result is that urgent messages are disseminated in O(log(n)) steps And regular information is disseminated a bit slower
Conclusions Constructed a distributed bulletin board Age properties are guaranteed The administrator can configure it to the desired properties No two nodes have the same view of the system Information requests are served locally Noise level (messages to inactive) is constant Urgent messages are propagated quickly
Load Balancing Load calculation and distribution
The key Idea Convert the total usage of several heterogeneous resources, such as memory and CPU, into a single homogeneous “cost”. Processes are then assigned to the machine where they have the lowest cost.
The Model Cluster of n machines Machine i has 1) CPU resource of speed – r c (i) 2) Memory resource of size – r m (i) Each process is defined by three parameters: 1) Its arrival time, a(j), 2) The number of CPU seconds it requires, t(j), and 3) The amount of memory it requires, m(j) Assumption- m(j) is known when a job arrives, but t(j) is not. A job must be assigned to a machine immediately upon its arrival, and may or may not be able to move to another machine later.
Let J(t,i). be the set of jobs in machine i at time t. CPU load and the memory load of machine i at time t l c (t,i) = | J(t,i)| l m (t,i) = ∑ j€J(t,i) m(j) When a machine runs out of main memory, it is slowed down by a multiplicative factor of r,due to disk paging. The effective CPU load of machine i at time t, L(t,i), is therefore: l c (t,i) if l m (t,i) < r m (i) l c (t,i)*r otherwise At time t, each job on machine i will receive 1/L(t, i) of the CPU resource. A job's completion time, c(j), therefore satisfies : c(j) ∫ a(j) r c (i) / L(t,i) = t(j)
Machine models 1. Identical Machines. All of the machines are identical,and the speed of a job on a given machine is determined only by the machine's load. 2. Related Machines. The machines are identical except that some of them have different speeds i.e they have different r c values, 3. Unrelated Machines. Many different factors can influence the effective load of the machine and the completion times of jobs running there. Jobs: 1. Permanent Jobs. 2. Temporary Jobs.
Main Goal Our goal is a method for job assignment and/or reassignment that will minimize the average slowdown over all jobs. Slowdown of job (c(j) – a(j) ) / t(j) The algorithm that we’ll describe from here onwards is called as ASSIGN-U algorithm
ASSIGN-U algorithm for single resource Identical and related mach CPU time can be considered as the only resource, hence in case when no reassignments are possible, a simple greedy algorithm performs well (competive ratio is 2-(1/n)) An algo(ALG) is c competetive if for any input seq I ALG(I) <= c*OPT(I) + d Where OPT is any optimal algo and d is constant
Unrelated Mach concept of Marginal Cost Let: a be a constant, 1 < a < 2, L(i,j) be the load of machine i before assigning job j, P(i,j) be the load job j will add to machine i. Assign j to the machine i that minimizes the marginal cost H i (j) = a l(i,j) + p(i,j) – a l(I,j) This algorithm is O(log n) competitive for unrelated machines and permanent jobs and can be extended to temporary jobs, using up to O(log n) reassignments per job with the same competetive ratio.
ASSIGN-u for k-resources Cost = i=1 ∑ k a (load on resource i) where 1
Memory ushering means migrating a processes from a node that nearly exhausted its free memory, to prevent paging. Memory ushering algorithm is executed only when free memory becomes scarce otherwise only load balancing algorithm is used.
Memory ushering algorithm. There are three parts to the memory ushering algorithm- 1. When to migrate a process. 2. Which process to migrate. 3. Choice of Target node. Mosix_mem_daemon logs memory usage for each process for memory ushering.
When to migrate a process. The algorithm is triggered when node’s free memory falls short of the threshold memory. The choice of threshold memory is critical( should be adjusted according to the page allocation rate.)
Choosing the process to migrate. At first stage the process with lowest migration overhead is selected from the processes whose memory requirement is more than the memory overflow value. If there is no such process or we don’t have a target node to accommodate this process then move to stage 2.
Stage 2 In second stage algorithm finds the node with the largest free memory index. It finds the largest process which can fit into this node. Algorithm again moves to stage1.
Choosing the target node Target node is chosen from among the subset of nodes whose free memory indices are available. The node whose free memory index is larger than the requirement of the process which has to be migrated is chosen. The node with lowest load is chosen to avoid repeated migrations. Before start of process migration approval is required from destination site to avoid simultaneous migrations from different nodes.
Preemptive Process Migration
Unique home node Each process has a Unique Home-Node (UHN) where it was created. Processes that migrate to other (remote) nodes use local (in the remote node) resources whenever possible, but interact with the user's environment through the UHN To a user, process seems to be running at UHN (transparency) irrespective of wherever it may have migrated (SSI).
Does the process migrate completely? User Context ( remote) - contains the program code, stack, data, memorymaps, and registers of the process. The remote encapsulates the process when it is running in the user level ; hence migrates easily. System context (deputy) - contains a description of the resources that the process is attached to, and a kernel- stack for the execution of system code on behalf of the process. The deputy encapsulates the process when it is running in the kernel ; hence it must remain in the UHN of the process.
On the Local node Contacts remote mig daemon. After successful handshake, calls mig_do_send(): send the mm area, which sends some fields (ex: start, end....) of the process's mm. send vma areas. send pages to populate memory area. send floating point context (fp registers, fp state). send process context (ex: credentials, normal registers, misc...) When this function returns, mark the process DDEPUTY here.
On the remote node When the mig daemon accepts a contact on remote, it starts a new user process which calls mig_do_receive() after a successull handshake. As it doesn't know either how much memory pages or how much vma it is going to receive: loop waiting for data to receive. identify the type of data we are going to receive. pass the data to the right receive function. break the loop on receiving MIG_TASK calling arch_kickstart(), which will jump into a kernel return to userspace path (ret_from_kickstart).
What happens to Deputy Stub ? Must not be killed because we need to complete some task that a remote process can't do (ex: open a file, read from disks). We can't also return in userspace because of lack of userspace memory or free cpu cycles. Before returning into userspace test whether a task is DMIGRATED (migrating),if so call openmosix_pre_usermode(). This function will call deputy_main_loop(), which will loop until the process has been marked DDEPUTYand then wait for either incoming (from remote) or pending (to be sent to remote) messages.
Handling Syscalls 2 kinds – 1) Can be executed like a local syscall eg-which interact with the migrated memory, 2) need to be handled on the home node, like file related ones.
Passing Arguments has a problem ! Arguments which are pointer, are user adresses. On deputy, there's no memory, so all syscalls which refers to at least one memory address, will fail. We need to pack all data associated with pointer on remote, then send it with argument and syscall number. When the deputy will try to copy to or from user memory, deputy_copy_(from|to)_user will be called. The address will be search into all packed data, and if success it will copy data to/from the realaddr into the ucache data.
Fork and Clone Requires creation of a new link between the child on deputy and the child on the remote. When receiving a remote_sys_fork create a socket waiting for communication, then the socket's sockaddr is send with register to deputy with a request for forking. On deputy, after getting the request to fork, connect to the socket sockaddr received, then create the new task. If all goes well, assign the new socket to the new task, then reply to remote's request. After receiving the reply on remote, fork as well,the task and then assign the socket to the task.
Signal delivery Other forms of interaction :- signal delivery and process wakeup events, such as when network data arrives. In a typical scenario, the kernel at the UHN informs the deputy of the event. The deputy checks whether any action needs to be taken, and if so, informs the remote. The remote monitors the communication channel for reports of asynchronous events, like signals, just before resuming user-level execution. (by calling openmosix_pre_usermode().)
Direct File System Access : An Open Mosix special
Direct File System Access As the name implies, DFSA is a provision that allows a migrated process to directly access files in its current location. When combined with an appropriate cluster file system it substantially increases I/O performance and decreases network congestion.
The need for DFSA As we have seen, openMosix is efficient for load balancing CPU-bound processes. However, it became inefficient for running processes with significant amount of I/O or file system operations. For example,
The need for DFSA
So we see that a simple read() syscall requires 4 messages and an I/O intensive process has the potential to congest the entire network, not to mention slow itself down. To overcome this problem, the scheme of direct file system access was devised.
What is DFSA all about? So whenever openMosix encounters a high I/O oriented process, it migrates it to a node (file server) where the necessary files reside. Now on this node DFSA acts as a re- routing switch which intercepts and performs most of the file related syscalls on the node itself!
What is DFSA all about?
Advantages of DFSA Substantial increase of I/O performance Reduction in network congestion Possible to partition large files into several smaller ones and place them on different file servers so that parallel access (through parallel migrated processes) on that file is possible.
Requirements from the supporting file systems Single-node consistency. Time-stamps on files in the same partition must be consistent and non-decreasing regardless from which node the modifications are made.
Single-node consistency The results of any sequence of read/write operations on a file by processes running in a set of nodes must be the same if all those processes were running on one node. NFS does not provide single node consistency when used in multiple nodes.
The Mosix File System At the time DFSA was developed, there was no production level, scalable (without shared hardware) file system available which satisfied the above criteria. So to test the performance of DFSA, the Mosix file system (MFS/oMFS) was developed!
The Mosix File System Features of MFS Provides a unified view of all files on all mounted file systems on all the nodes of a MOSIX cluster as if they were all on a single partition. E.g. if MFS is mounted on /mfs mount-point then the file /mfs/14/usr/tmp/m.c refers to the file /usr/tmp/m.c on node #14 Scalability and single node consistency.
The Mosix File System Client/server model When a process issues an MFS related syscall, the local kernel acts as a client and redirects the request to the appropriate MFS server. Thus each node can be used both as a client and a server.
The Mosix File System
Disadvantages No cache on clients is a major drawback for I/O operations with smaller block sizes. High availability is not supported i.e., failure of a node prevents any access to files which were on that node.
Bringing the process to the file Each process collects statistics on its MFS usage, including usage, rates and amount of data accessed on each node. The list of nodes is continuously updated according to the process’s most recent I/O activities. These statistics are incorporated along with the other collected information like CPU usage etc. and into the process migration policy.
Bringing the process to the file If the supervising algorithm detects that a process’s I/O is increasing a certain threshold then it is marked as a candidate for migration. After weighing the amount of I/O operations with the CPU load-balancing considerations, the process is migrated (if required) to the node on which it does most of its I/O.
MigSHM :- The Indian touch
Shared Memory handling. Earlier version of openmosix did not have provision for migrating processes using shared memory. Migshm patch for openmosix introduced by MAASK team ( 5 girls from Pune) Migshm’s job is to intercept function calls like shmget(), shmat(), shmdt() which handle the shared memory.
Modules in Migshm Migration of shared memory processes- This module enables shared memory processes to be picked up during their execution for migration. It keeps track of information such as identifier of the process, virtual memory address of the shared memory region to which the process is attached. Keeps a track of identifier of the shared memory region,access counts of shared memory region by all processes attached to it.
Consistency module Its task is to maintain consistency among copies of shared memory pages that exist on different nodes. Local copies of modified pages will only be written to the original owner when the lock on that memory segment is released, thus not for every write. After writing to the owner node invalidate message is sent to all the nodes.When process on some other node has to access the modified page it page faults to the owner node.
When a process tries to access invalid page it page faults to the owner node. The function sys_semop is used to acquire or release a semaphore lock.The release of semaphore is used as the event for flushing out dirty pages of shared memory region and sending the invalidate message.
Communication module To enable communication between two processes on different nodes a daemon MigSharedMemD has been implemented. This daemon is responsible for receiving messages that are identified with the help of a header type.
Access logs and migration decision This module keeps a track of number of times shared memory has been accessed by the process. A process is strongly linked if it accesses the shared memory more than all other processes taken together. If a process is strongly linked we migrate the shared memory along with the process.
The function log_access increments count for read and write accesses.It is invoked from mosix_mem_daemon which is memsorter and logs memory usage for each process.
Migration of shared memory Openmosix uses the logical migration i.e. the owner node of the shared memory has the latest copy without creating new entry in the IPC array on the new node. MigShm also implements the owner page fault policy by fetching the page from the new owner node and not from initial owner.
If page was not referenced and is not present on the owner node then it is fetched from the home node. The function consider performs this task. Here shared memory may be migrated with a strongly linked task. It is logically migrated and change in ownership is broadcast.
The openMosix API Kernel 2.4. API and Implementation No new system-calls Everything done through /proc /proc/hpc /proc/hpc/admin Administration /proc/hpc/info Cluster-wide information /proc/hpc/nodes/nnnn/ Per-node information /proc/hpc/remote/pppp/ Remote proc. information
Impact on the kernel MOSIX for the kernel: 80 new files (40,000 lines) 109 modified files (7,000 lines changed/added) About 3,000 lines are load-balancing algorithms openMOSIX for Linux 47 new files (38,500 lines) 126 kernel files modified (5,200 lines changed/added) 48 user-level files (12,000 lines)
People behind openMosix Copyright for openMosix, Moshe Bar Barak and Moshe Bar were co-project managers of Mosix until Nov 2001 Team Members Danny Getz (migration) Avraham Ben Yehudah (MFS and 2.5.x) David Santo Orcero (user-space utilities) Michael Farnbach (extern. Patch matching, ie XFS, JFS etc.) Many others, including help from Ingo Molnar, Alan Cox, Andrea Arcangeli and Rik van Riel
Present and Future of openMosix Current Projects Migrating sockets Network RAM Checkpoint / Restart Queue Manager / Scheduler
Future Plans Inclusion in Linux 2.6 Re-writing MFS Increase developers to 20-30
References Amir Y., Awerbuch B., Barak A., Borgstrom R.S. and Keren A., An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster. IEEE Tran. Parallel and Distributed Systems (11) 7, pp , July 2000.An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster Barak A. and La'adan O., The MOSIX Multicomputer Operating System for High Performance Cluster Computing. Journal of Future Generation Computer Systems (13) 4-5, pp , March 1998.The MOSIX Multicomputer Operating System for High Performance Cluster Computing Barak A., The MOSIX Organizational Grid - A White Paper, August 2005.The MOSIX Organizational Grid - A White Paper Amar L., Barak A. and Shiloh A., The MOSIX Direct File System Access Method for Supporting Scalable Cluster File Systems. Cluster Computing (7) 2, pp , April 2004.The MOSIX Direct File System Access Method for Supporting Scalable Cluster File Systems I. Peer, A. Barak, and L. Amar, “A Gossip-Based Distributed Bulletin Board with Guaranteed Age Properties,”, submitted for publication.A Gossip-Based Distributed Bulletin Board with Guaranteed Age Properties Introduction to openMosix presented by Moshe Bar and Maya, Anu, Asmita, Snehal, Krushna (MAASK) Introduction to openMosix presented by Moshe Bar and Maya, Anu, Asmita, Snehal, Krushna (MAASK) openMosix Internals - Hacker's Guide by Vincent Hanquez MigShm Project: Migration of Shared Memory - by MAASK MigShm Project: Migration of Shared Memory Memory Ushering in a Scalable Computing Cluster (1997) Amnon Barak
For downloading OpenMosix, visit For a Live example of OpenMosix, please login to yogini THANK YOU