Download presentation
Presentation is loading. Please wait.
1
Distributed Systems: Principles and Paradigms
Chapter 01 Introduction 1
2
Prologue Distributed Processing vs. Distributed System
DS research in the 80’s vs. 90’s Centralized vs. Distributed System What are the examples of Distributed Systems? 2
3
Distributed System: Definition
A distributed system is a collection of independent computers that appears to its users as a single coherent system Two aspects: (1) hardware - autonomous computers (2) software – users think they are dealing with a single system 01 – Introduction/1.1 Definition 3
4
Goals of Distributed Systems
Allow users to access and share resources Transparency - To hide the fact that its processes and resources are physically distributed across multiple computers Openness To offer services according to standard rules that describe the syntax and semantics of those services E.g., specify interfaces using an interface definition language (IDL) Scalability 01 – Introduction/1.2 Goals 4
5
Transparency 01 – Introduction/1.2 Goals 5
6
Degree of Transparency
Observation: Aiming at full transparency may be too much: Users may be located in different continents; distribution is apparent and not something you want to hide Completely hiding failures of networks and nodes is (theoretically and practically) impossible – You cannot distinguish a slow computer from a failing one – You can never be sure that a server actually performed an operation before a crash Full transparency will cost performance, exposing distribution of the system – Keeping Web caches exactly up-to-date with the master copy – Immediately flushing write operations to disk for fault tolerance 01 – Introduction/1.2 Goals 6
7
Openness of Distributed Systems
Open distributed system: Be able to interact with services from other open systems, irrespective of the underlying environment: Systems should conform to well-defined interfaces Systems should support portability of applications Systems should easily interoperate Achieving openness: At least make the distributed system independent from heterogeneity of the underlying environment: Hardware Platforms Languages 01 – Introduction/1.2 Goals 7
8
Policies versus Mechanisms
Implementing openness: Requires support for different policies specified by applications and users: What level of consistency do we require for client cached data? Which operations do we allow downloaded code to perform? Which QoS requirements do we adjust in the face of varying bandwidth? What level of secrecy do we require for communication? Implementing openness: Ideally, a distributed system provides only mechanisms: Allow (dynamic) setting of caching policies, preferably per cachable item Support different levels of trust for mobile code Provide adjustable QoS parameters per data stream Offer different encryption algorithms 01 – Introduction/1.2 Goals 8
9
Scalability in Distributed Systems
Observation: Many developers of modern distributed systems easily use the adjective “scalable” without making clear why their system actually scales. Scalability: At least three components: Number of users and/or processes (size scalability) Maximum distance between nodes (geographical scalability) Number of administrative domains (administrative scalability) Most systems account only, to a certain extent, for size scalability. The (non)solution: powerful servers. Today, the challenge lies in geographical and administrative scalability. 01 – 7 Introduction/1.2 Goals 9
10
Techniques for Scaling
Distribution: Partition data and computations across multiple machines: Move computations to clients (Java applets) Decentralized naming services (DNS) Decentralized information systems (WWW) Replication: Make copies of data available at different machines: Replicated file servers (mainly for fault tolerance) Replicated databases Mirrored Web sites Large-scale distributed shared memory systems Caching: Allow client processes to access local copies: Web caches (browser/Web proxy) File caching (at server and client) 01 – Introduction/1.2 Goals 10
11
Scaling – The Problem Observation: Applying scaling techniques is easy, except for one thing: Having multiple copies (cached or replicated), leads to inconsistencies: modifying one copy makes that copy different from the rest. Always keeping copies consistent and in a general way requires global synchronization on each modification. Global synchronization makes large-scale solutions practically impossible. Observation: If we can tolerate inconsistencies, we may reduce the need for global synchronization. Observation: Tolerating inconsistencies is application dependent. 01 – Introduction/1.2 Goals 11
12
Distributed Systems: Hardware Concepts
Multiprocessors Multicomputers Networks of Computers 01 – Introduction/1.3 Hardware Concepts 12
13
Multiprocessors and Multicomputers
Distinguishing features: Private versus shared memory Bus versus switched interconnection 01 – Introduction/1.3 Hardware Concepts 13
14
Networks of Computers High degree of node heterogeneity:
High-performance parallel systems (multiprocessors as well as multicomputers) High-end PCs and workstations (servers) Simple network computers (offer users only network access) Mobile computers (palmtops, laptops) High degree of network heterogeneity: Local-area gigabit networks Wireless connections Long-haul, high-latency POTS connections Wide-area switched megabit connections Observation: Ideally, a distributed system hides these differences 01 – Introduction/1.3 Hardware Concepts 14
15
Distributed Systems:Software Concepts
Distributed operating system Network operating system Middleware 01 – Introduction/1.4 Software Concepts 15
16
Distributed Operating System
Some characteristics: OS on each computer knows about the other computers OS on different computers generally the same Services are generally (transparently) distributed across computers 01 – Introduction/1.4 Software Concepts 16
17
Multicomputer Operating System
Harder than traditional (multiprocessor) OS: Because memory is not shared, emphasis shifts to processor communication by message passing: Often no simple global communication: Only bus-based multicomputers provide hardware broadcasting Efficient broadcasting may require network interface programming techniques No simple system-wide synchronization mechanisms Virtual (distributed) shared memory requires OS to maintain global memory map in software Inherent distributed resource management: no central point where allocation decisions can be made Practice: Only very few truly multicomputer operating systems exist (example: Amoeba) 01 – Introduction/1.4 Software Concepts 17
18
Network Operating System
Some characteristics: Each computer has its own operating system with networking facilities Computers work independently (i.e., they may even have different operating systems) Services are tied to individual nodes (ftp, telnet,WWW) Highly file oriented (basically, processors share only files) 01 – Introduction/1.4 Software Concepts 18
19
Middleware Some characteristics:
OS on each computer need not know about the other computers OS on different computers need not generally be the same Services are generally (transparently) distributed across computers 01 – Introduction/1.4 Software Concepts 19
20
Need for Middleware Motivation: Too many networked applications were difficult to integrate: Departments are running different NOSs Integration and interoperability only at level of primitive NOS Services Need for federated information systems: – Combining different databases, but providing a single view to applications – Setting up enterprise-wide Internet services, making use of existing information systems – Allow transactions across different databases – Allow extensibility for future services (e.g., mobility, teleworking, collaborative applications) Constraint: use the existing operating systems, and treat them as the underlying environment 01 – Introduction/1.4 Software Concepts 20
21
Middleware Services (1/2)
Communication services: Abandon primitive socket-based message passing in favor of: Procedure calls across networks Remote-object method invocation Message-queuing systems Advanced communication streams Event notification service Information system services: Services that help manage data in a distributed system: Large-scale, system-wide naming services Advanced directory services (search engines) Location services for tracking mobile objects Persistent storage facilities Data caching and replication 01 – Introduction/1.4 Software Concepts 21
22
Middleware Services (2/2)
Control services: Services giving applications control over when, where, and how they access data: Distributed transaction processing Code migration Security services: Services for secure processing and communication: Authentication and authorization services Simple encryption services Auditing service 01 – Introduction/1.4 Software Concepts 22
23
Comparison of DOS, NOS, and Middleware
1: Degree of transparency 2: Same operating system on each node? 3: Number of copies of the operating system 4: Basis for communication 5: How are resources managed? 6: Is the system easy to scale? 7: How open is the system? 01 – Introduction/1.4 Software Concepts 23
24
Client–Server Model Basic model Application layering
Client–Server architectures 01 – Introduction/1.5 Client–Server Model 24
25
Basic Client–Server Model (1/2)
Characteristics: There are processes offering services (servers) There are processes that use services (clients) Clients and servers can be distributed across different machines Clients follow request/reply model with respect to using services 01 – Introduction/1.5 Client–Server Model 25
26
Basic Client–Server Model (2/2)
Servers: Generally provide services related to a shared resource: Servers for file systems, databases, implementation repositories, etc. Servers for shared, linked documents (Web, VoD) Servers for shared applications Servers for shared distributed objects Clients: Allow remote service access: Programming interface transforming client’s local service calls to request/reply messages Devices with (relatively simple) digital components (barcode readers, teller machines, hand-held phones) Computers providing independent user interfaces for specific services Computers providing an integrated user interface for related services (compound documents) 01 – Introduction/1.5 Client–Server Model 26
27
Application Layering (1/2)
Traditional three-layered view: User-interface layer contains units for an application’s user interface Processing layer contains the functions of an application, i.e. without specific data Data layer contains the data that a client wants to manipulate through the application components Observation: This layering is found in many distributed information systems, using traditional database technology and accompanying applications. 01 – Introduction/1.5 Client–Server Model 27
28
Application Layering (2/2)
01 – Introduction/1.5 Client–Server Model 28
29
Client-Server Architectures
Single-tiered: dumb terminal/mainframe configuration Two-tiered: client/single server configuration Three-tiered: each layer on separate machine Traditional two-tiered configurations: 01 – Introduction/1.5 Client–Server Model 29
30
Alternative C/S Architectures
Observation: Multi-tiered architectures seem to constitute buzzwords that fail to capture many modern client–server systems. Cooperating servers: Service is physically distributed across a collection of servers: Traditional multi-tiered architectures Replicated file systems Network news services Large-scale naming systems (DNS, X.500) Workflow systems Financial brokerage systems Cooperating clients: Distributed application exists by virtue of client collaboration: Teleconferencing where each client owns a (multimedia) workstation Publish/subscribe architectures in which role of client and server is blurred Peer-to-Peer (P2P) applications 01 – Introduction/1.5 Client–Server Model 30
31
Reading Read Chapter 1 of Distributed Systems: Principles and Paradigms book 01 – Introduction/1.5 Client–Server Model 31
32
Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 03 Processes 00 – 1 32
33
Threads Introduction to threads Threads in distributed systems
03 – Processes/3.1 Threads 33
34
Introduction to Threads
Basic idea: we build virtual processors in software,on top of physical processors: Processor: Provides a set of instructions along with the capability of automatically executing a series of those instructions. Thread: A minimal software processor in whose context a series of instructions can be executed. Saving a thread context implies stopping the current execution and saving all the data needed to continue the execution at a later stage. Process: A software processor in whose context one or more threads may be executed. Executing a thread means executing a series of instructions in the context of that thread. 03 – Processes/3.1 Threads 34
35
Context Switching (1/2) Processor context: The minimal collection of values stored in the registers of a processor used for the execution of a series of instructions (e.g., stack pointer, addressing registers, program counter). Process context: The minimal collection of values stored in registers and memory of a process used for the execution of a series of instructions (i.e., processor context, state). Thread context: The minimal collection of values stored in registers and memory, used for the execution of a thread (i.e., thread context). 03 – Processes/3.1 Threads 35
36
Context Switching (2/2) Observation 1: Threads share the same address space. Thread context switching can be done entirely independent of the operating system. Observation 2: Process context switching is generally more expensive as it involves getting the OS in the loop, i.e., trapping to the kernel. Observation 3: Creating and destroying threads is much cheaper than doing so for processes. 03 – Processes/3.1 Threads 36
37
Threads and Operating Systems (1/2)
Main issue: Should an OS kernel provide threads, or should they be implemented as user-level packages? User-space solution: Have nothing to do with the kernel, so all operations can be completely handled within a single process. All services provided by the kernel are done on behalf of the process in which a thread resides => if the kernel decides to block a thread, the entire process will be blocked. Requires messy solutions. In practice, we want to use threads when there are lots of external events: threads block on a per-event basis => if the kernel cannot distinguish threads, how can it support signaling events to them? 03 – Processes/3.1 Threads 37
38
Threads and Operating Systems (2/2)
Kernel solution: The whole idea is to have the kernel contain the implementation of a thread package. This does mean that all operations return as system calls Operations that block a thread are no longer a problem: the kernel schedules another available thread within the same process. Handling external events is simple: the kernel (which catches all events) schedules the thread associated with the event. The big problem is the loss of efficiency due to the fact that each thread operation requires a trap to the kernel. Conclusion: Try to mix user-level and kernel-level threads into a single concept. 03 – Processes/3.1 Threads 38
39
Solaris Threads (1/2) Basic idea: Introduce a two-level threading approach: Lightweight processes (LWP) that can execute user-level threads. 03 – Processes/3.1 Threads 39
40
Solaris Threads (2/2) When a user-level thread does a system call, the LWP that is executing that thread, blocks. The thread remains bound to the LWP The kernel can simply schedule another LWP having a runnable thread bound to it. Note that this thread can switch to any other runnable thread currently in user space. . When a thread calls a blocking user-level operation, we can simply do a context switch to a runnable thread, which is then bound to the same LWP. When there are no threads to schedule, an LWP may remain idle, and may even be removed (destroyed) by the kernel. 03 – Processes/3.1 Threads 40
41
Threads and Distributed Systems (1/2)
Multi-threaded clients: Main issue is hiding network latency Multi-threaded Web client: Web browser scans an incoming HTML page, and finds that more files need to be fetched Each file is fetched by a separate thread, each doing a (blocking) HTTP request As files come in, the browser displays them Multiple RPCs: A client does several RPCs at the same time, each one by a different thread It then waits until all results have been returned Note: if RPCs are to different servers, we may have a linear speed-up compared to doing RPCs one after the other 03 – Processes/3.1 Threads 41
42
Threads and Distributed Systems (2/2)
Multi-threaded servers: Main issue is improved performance and better structure Improve performance: Starting a thread to handle an incoming request is much cheaper than starting a new process Having a single-threaded server prohibits simply scaling the server to a multiprocessor system As with clients: hide network latency by reacting to next request while previous one is being replied Better structure: Most servers have high I/O demands. Using simple, well- understood blocking calls simplifies the overall structure Multi-threaded programs tend to be smaller and easier to understand due to simplified flow of control 03 – Processes/3.1 Threads 42
43
Clients User interfaces Other client-side software
03 – Processes/3.2 Clients 43
44
User Interfaces Essence: A major part of client-side software is focused on (graphical) user interfaces. Compound documents: Make the user interface application-aware to allow inter-application communication: drag-and-drop: move objects to other positions on the screen, possibly invoking interaction with other applications in-place editing: integrate several applications at user-interface level (word processing + drawing facilities) 03 – Processes/3.2 Clients 44
45
Client-Side Software Essence: Often focused on providing distribution transparency access transparency: client-side stubs for RPCs and RMIs location/migration transparency: let client-side software keep track of actual location replication transparency: multiple invocations handled by client stub: failure transparency: can often be placed only at client (we’re trying to mask server and communication failures). 03 – Processes/3.2 Clients 45
46
Servers General server organization Object servers
03 – Processes/3.3 Servers 46
47
General Organization Basic model: A server is a process that waits for incoming service requests at a specific transport address. In practice, there is a one-to- one mapping between a port and a service: Super servers: Servers that listen to several ports,i.e., provide several independent services. In practice,when a service request comes in, they start a subprocess to handle the request (UNIX inetd) Iterative vs. concurrent servers: Iterative servers can handle only one client at a time, in contrast to concurrent servers 03 – Processes/3.3 Servers 47
48
Out-of-Band Communication
Issue: Is it possible to interrupt a server once it has accepted (or is in the process of accepting) a service request? Solution 1: Use a separate port for urgent data (possibly per service request): Server has a separate thread (or process) waiting for incoming urgent messages When urgent message comes in, associated request is put on hold Note: we require OS supports high-priority scheduling of specific threads or processes Solution 2: Use out-of-band communication facilities of the transport layer: Example: TCP allows to send urgent messages in the same connection Urgent messages can be caught using OS signaling techniques 03 – Processes/3.3 Servers 48
49
Servers and State (1/2) Stateless servers: Never keep accurate information about the status of a client after having handled a request: Don’t record whether a file has been opened (simply close it again after access) Don’t promise to invalidate a client’s cache Don’t keep track of your clients Consequences: Clients and servers are completely independent State inconsistencies due to client or server crashes are reduced Possible loss of performance because, e.g., a server cannot anticipate client behavior (think of prefetching file blocks) Question: Does connection-oriented communication fit into a stateless design? 03 – Processes/3.3 Servers 49
50
Servers and State (2/2) Stateful servers: Keeps track of the status of its clients: Record that a file has been opened, so that prefetching can be done Knows which data a client has cached, and allows clients to keep local copies of shared data Observation: The performance of stateful servers can be extremely high, provided clients are allowed to keep local copies. As it turns out, reliability is not a major problem. 03 – Processes/3.3 Servers 50
51
Object Servers (1/2) Servant: The actual implementation of an object, sometimes containing only method implementations: Collection of C functions that act on structs, records, DB tables, etc. Java or C++ classes Skeleton: Server-side stub for handling network I/O: Unmarshalls incoming requests, and calls the appropriate servant code Marshalls results and sends reply message Generated from interface specifications Object adapter: The “manager” of a set of objects: Inspects incoming requests Ensures referenced object is activated (requires ID of servant) Passes request to appropriate skeleton, following specific activation policy Responsible for generating object references 03 – 19 Processes/3.3 Servers 51
52
Object Servers (2/2) Observation: Object servers determine how their objects are constructed 03 – Processes/3.3 Servers 52
53
Code Migration Approaches to code migration
Migration and local resources Migration in heterogeneous systems 03 – Processes/3.4 Code Migration 53
54
Code Migration: Some Context
03 – Processes/3.4 Code Migration 54
55
Strong and Weak Mobility
Object components: Code segment: contains the actual code Data segment: contains the state Execution state: contains context of thread executing the object’s code Weak mobility: Move only code and data segment (and start execution from the beginning) after migration: Relatively simple, especially if code is portable Distinguish code shipping (push) from code fetching (pull) e.g., Java applets Strong mobility: Move component, including execution state Migration: move the entire object from one machine to the other Cloning: simply start a clone, and set it in the same execution state. 03 – Processes/3.4 Code Migration 55
56
Managing Local Resources (1/2)
Problem: An object uses local resources that may or may not be available at the target site. Resource types: Fixed: the resource cannot be migrated, such as local hardware Fastened: the resource can, in principle, be migrated but only at high cost Unattached: the resource can easily be moved along with the object (e.g., a cache) Object-to-resource binding: By identifier: the object requires a specific instance of a resource (e.g., a specific database) By value: the object requires the value of a resource (e.g., the set of cache entries) By type: the object requires that only a type of resource is available (e.g., a color monitor) 03 – Processes/3.4 Code Migration 56
57
Managing Local Resources (2/2)
03 – Processes/3.4 Code Migration 57
58
Migration in Heterogeneous Systems
Main problem: The target machine may not be suitable to execute the migrated code The definition of process/thread/processor context is highly dependent on local hardware, operating system and runtime system Only solution: Make use of an abstract machine that is implemented on different platforms Current solutions: Interpreted languages running on a virtual machine (Java/JVM; scripting languages) Existing languages: allow migration at specific “transferable” points, such as just before a function call. 03 – Processes/3.4 Code Migration 58
59
Software Agents What’s an agent? Agent technology
03 – Processes/3.5 Software agents 59
60
What’s an Agent? Definition: An autonomous process capable of reacting to and initiating changes in its environment, possibly in collaboration with users and other agents collaborative agent: collaborate with others in a multi-agent system mobile agent: can move between machines interface agent: assist users at user-interface level information agent: manage information from physically different sources 60
61
Agent Technology Management: Keeps track of where the agents on this platform are (mapping agent ID to port) Directory: Mapping of agent names & attributes to agent IDs ACC: Agent Communication Channel, used to communicate with other platforms READ CHAPTER 3! 03 – Processes/3.5 Software agents 61
62
Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 04 Communication 00 – 1
63
Remote procedure call (RPC) Sun RPC DCE & DCE RPC
Layered Protocols Low-level layers Transport layer Application layer Middleware layer Remote procedure call (RPC) Sun RPC DCE & DCE RPC Remote Method Invocation Message-Oriented Communication Stream-Oriented Communication 02 – Communication/2.1 Layered Protocols
64
Basic Networking Model
Open Systems Interconnection Reference Model (Zimmerman, 1983) Also called ISO OSI or OSI Model ISO – International Standards Organization 02 – Communication/2.1 Layered Protocols
65
Low-level layers Physical layer: contains the specification and implementation of bits, and their transmission between sender and receiver Data link layer: prescribes the transmission of a series of bits into a frame to allow for error and flow control Network layer: describes how packets in a network of computers are to be routed. Observation: for many distributed systems, the lowest level interface is that of the network layer. 02 – Communication/2.1 Layered Protocols
66
Transport Layer Important: The transport layer provides the actual communication facilities for most distributed systems. Standard Internet protocols: TCP: connection-oriented, reliable, stream-oriented communication UDP: unreliable (best-effort) datagram communication Note: IP multicasting is generally considered a standard available service. 02 – Communication/2.1 Layered Protocols
67
Client–Server TCP TCP for transactions (T/TCP): A transport protocol aimed to support client–server interaction Normal operation of TCP Transactional TCP 02 – Communication/2.1 Layered Protocols
68
Application Layer Observation: Many application protocols are directly implemented on top of transport protocols, doing a lot of application- independent work. 02 – Communication/2.1 Layered Protocols
69
Middleware Layer Observation: Middleware is invented to provide common services and protocols that can be used by many different applications: A rich set of communication protocols, but which allow different applications to communicate Marshaling and unmarshaling of data, necessary for integrated systems Naming protocols, so that different applications can easily share resources Security protocols, to allow different applications to communicate in a secure way Scaling mechanisms, such as support for replication and caching 02 – Communication/2.1 Layered Protocols
70
Remote Procedure Call (RPC)
Basic RPC operation Parameter passing Variations 02 – Communication/2.2 Remote Procedure Call
71
Basic RPC Operation Observations:
Application developers are familiar with simple procedure model Well-engineered procedures operate in isolation (black box) There is no fundamental reason not to execute procedures on separate machines Conclusion: communication between caller & callee can be hidden by using procedure-call mechanism. 02 – Communication/2.2 Remote Procedure Call
72
RPC Implementation (1/2)
Local procedure call: 1: Push parameter values of the procedure on a stack 2: Call procedure 3: Use stack for local variables 4: Pop results (in parameters) Principle: “communication” with local procedure is handled by copying data to/from the stack (with a few exceptions) 02 – Communication/2.2 Remote Procedure Call
73
RPC Implementation (2/2)
Steps involved in doing a remote “add” operation 02 – Communication/2.2 Remote Procedure Call
74
RPC: Parameter Passing (1/2)
Parameter marshaling: There’s more than just wrapping parameters into a message: Client and server machines may have different data representations (think of byte ordering) Wrapping a parameter means transforming a value into a sequence of bytes Client and server have to agree on the same encoding: How are basic data values represented (integers, floats, characters)? How are complex data values represented (arrays, unions)? Client and server need to properly interpret messages, transforming them into machine-dependent representations. 02 – Communication/2.2 Remote Procedure Call
75
RPC: Parameter Passing (2/2)
RPC assumes copy in/copy out semantics: while procedure is executed, nothing can be assumed about parameter values. RPC assumes all data that is to be operated on is passed by parameters. Excludes passing references to (global) data. Conclusion: full access transparency cannot be realized. Observation: If we introduce a remote reference mechanism, access transparency can be enhanced: Remote reference offers unified access to remote data Remote references can be passed as parameter in RPCs 02 – Communication/2.2 Remote Procedure Call
76
Local RPCs: Doors Essence: Try to use the RPC mechanism as the only mechanism for interprocess communication (IPC). Doors are RPCs implemented for processes on the same machine. 02 – Communication/2.2 Remote Procedure Call
77
Asynchronous RPCs Essence: Try to get rid of the strict request-reply behavior, but let the client continue without waiting for an answer from the server. Variation: deferred synchronous RPC: 02 – Communication/2.2 Remote Procedure Call
78
RPC in Practice Essence: Let the developer concentrate on only the client- and server-specific code; let the RPC system (generators and libraries) do the rest. 02 – Communication/2.2 Remote Procedure Call
79
Client-to-Server Binding (DCE)
Issues: (1) Client must locate server machine, and (2) locate the server. Read Introduction to DCE in Section Example: DCE uses a separate daemon for each server machine. 02 – Communication/2.2 Remote Procedure Call
80
Remote Object Invocation
Distributed objects Remote method invocation Parameter passing 02 – Communication/2.3 Remote Object Invocation
81
Remote Distributed Objects (1/2)
Data and operations encapsulated in an object Operations are implemented as methods, and are accessible through interfaces Object offers only its interface to clients Object server is responsible for a collection of objects Client stub (proxy) implements interface Server skeleton handles (un)marshaling and object invocation 02 – Communication/2.3 Remote Object Invocation
82
Remote Distributed Objects (2/2)
Compile-time objects: Language-level objects, from which proxy and skeletons are automatically generated. Runtime objects: Can be implemented in any language, but require use of an object adapter that makes the implementation appear as an object. Transient objects: live only by virtue of a server: if the server exits, so will the object. Persistent objects: live independently from a server: if a server exits, the object’s state and code remain (passively) on disk. 02 – Communication/2.3 Remote Object Invocation
83
Client-to-Object Binding (1/2)
Object reference: Having an object reference allows a client to bind to an object: Reference denotes server, object, and communication protocol Client loads associated stub code Stub is instantiated and initialized for specific object Two ways of binding: Implicit: Invoke methods directly on the referenced object Explicit: Client must first explicitly bind to object before invoking it 02 – Communication/2.3 Remote Object Invocation
84
Client-to-Object Binding (2/2)
Some remarks: Reference may contain a URL pointing to an implementation file (Server, object) pair is enough to locate target object We need only a standard protocol for loading and instantiating code Observation: Remote-object references allow us to pass references as parameters. This was difficult with ordinary RPCs. 02 – Communication/2.3 Remote Object Invocation
85
Remote Method Invocation
Basics: (Assume client stub and server skeleton are in place) Client invokes method at stub Stub marshals request and sends it to server Server ensures referenced object is active: Create separate process to hold object Load the object into server process Request is unmarshaled by object’s skeleton, and referenced method is invoked If request contained an object reference, invocation is applied recursively (i.e., server acts as client) Result is marshaled and passed back to client Client stub unmarshals reply and passes result to client application 02 – Communication/2.3 Remote Object Invocation
86
RMI: Parameter Passing (1/2)
Object reference: Much easier than in the case of RPC: Server can simply bind to referenced object, and invoke methods Unbind when referenced object is no longer needed Object-by-value: A client may also pass a complete object as parameter value: An object has to be marshaled: Marshall its state Marshall its methods, or give a reference to where an implementation can be found Server unmarshals object. Note that we have now created a copy of the original object. Object-by-value passing tends to introduce nasty problems 02 – Communication/2.3 Remote Object Invocation
87
RMI: Parameter Passing (2/2)
Passing of an object by reference or by value Question: What’s an alternative implementation for a remote-object reference? 02 – Communication/2.3 Remote Object Invocation
88
Message-Oriented Communication
Synchronous versus asynchronous communications Message-Queuing System Message Brokers Example: IBM MQSeries 02 – 26 Communication/2.4 Message-Oriented Communication
89
Synchronous Communication
Some observations: Client/Server computing is generally based on a model of synchronous communication: Client and server have to be active at the time of communication Client issues request and blocks until it receives reply Server essentially waits only for incoming requests, and subsequently processes them Drawbacks of synchronous communication: Client cannot do any other work while waiting for reply Failures have to be dealt with immediately (the client is waiting) In many cases the model is simply not appropriate (mail, news) 02 – 27 Communication/2.4 Message-Oriented Communication
90
Asynchronous Communication: Messaging
Message-oriented middleware: Aims at high-level asynchronous communication: Processes send each other messages, which are queued Sender need not wait for immediate reply, but can do other things Middleware often ensures fault tolerance 02 – 28 Communication/2.4 Message-Oriented Communication
91
Persistent vs. Transient Communication
Persistent communication: A message is stored at a communication server as long as it takes to deliver it at the receiver. Transient communication: A message is discarded by a communication server as soon as it cannot be delivered at the next server, or at the receiver. 02 – 29 Communication/2.4 Message-Oriented Communication
92
Messaging Combinations
02 – 30 Communication/2.4 Message-Oriented Communication
93
Message-Oriented Middleware
Essence: Asynchronous persistent communication through support of middleware-level queues. Queues correspond to buffers at communication servers. Canonical example: IBM MQSeries 02 – 31 Communication/2.4 Message-Oriented Communication
94
IBM MQSeries (1/3) Basic concepts:
Application-specific messages are put into, and removed from queues Queues always reside under the regime of a queue manager Processes can put messages only in local queues, or through an RPC mechanism Message transfer: Messages are transferred between queues Message transfer between queues at different processes, requires a channel At each endpoint of channel is a message channel agent Setting up channels using lower-level network communication facilities (e.g., TCP/IP) (Un)wrapping messages from/in transport-level packets Sending/receiving packets 02 – 32 Communication/2.4 Message-Oriented Communication
95
IBM MQSeries (2/3) Channels are inherently unidirectional
MQSeries provides mechanisms to automatically start MCAs when messages arrive, or to have a receiver set up a channel Any network of queue managers can be created; routes are set up manually (system administration) 02 – 33 Communication/2.4 Message-Oriented Communication
96
IBM MQSeries (3/3) Routing: By using logical names, in combination with name resolution to local queues, it is possible to put a message in a remote queue Question: What’s a major problem here? 02 – 34 Communication/2.4 Message-Oriented Communication
97
Message Broker Observation: Message queuing systems assume a common messaging protocol: all applications agree on message format (i.e., structure and data representation) Message broker: Centralized component that takes care of application heterogeneity in a message-queuing system: Transforms incoming messages to target format, possibly using intermediate representation May provide subject-based routing capabilities Acts very much like an application gateway 02 – 35 Communication/2.4 Message-Oriented Communication
98
Stream-Oriented Communication
Support for continuous media Streams in distributed systems Stream management 02 – 36 Communication/2.5 Stream-Oriented Communication
99
Continuous Media Observation: All communication facilities discussed so far are essentially based on a discrete, that is time independent exchange of information Continuous media: Characterized by the fact that values are time dependent: Audio Video Animations Sensor data (temperature, pressure, etc.) Transmission modes: Different timing guarantees with respect to data transfer: Asynchronous: no restrictions with respect to when data is to be delivered Synchronous: define a maximum end-to-end delay for individual data packets Isochronous: define a maximum and minimum end-to-end delay (jitter is bounded) 02 – 37 Communication/2.5 Stream-Oriented Communication
100
Stream (1/2) Definition: A (continuous) data stream is a connection-oriented communication facility that supports isochronous data transmission Some common stream characteristics: Streams are unidirectional. There is generally a single source, and one or more sinks Often, either the sink and/or source is a wrapper around hardware (e.g., camera, CD device, TV monitor, dedicated storage) Stream types: Simple: consists of a single flow of data, e.g., audio or video Complex: multiple data flows, e.g., stereo audio or combination audio/video 02 – 38 Communication/2.5 Stream-Oriented Communication
101
Stream (2/2) Issue: Streams can be set up between two processes at different machines, or directly between two different devices. Combinations are possible as well. 02 – 39 Communication/2.5 Stream-Oriented Communication
102
Streams and QoS Essence: Streams are all about timely delivery of data. How do you specify this Quality of Service (QoS)? Make distinction between specification and implementation of QoS. Flow specification: Use a token-bucket model and express QoS in that model. 02 – 40 Communication/2.5 Stream-Oriented Communication
103
Implementing QoS Problem: QoS specifications translate to resource reservations in underlying communication system. There is no standard way of (1) QoS specs, (2) describing resources, (3) mapping specs to reservations. Approach: Use Resource reSerVation Protocol (RSVP) as first attempt. RSVP is a transport-level protocol. 02 – 41 Communication/2.5 Stream-Oriented Communication
104
Stream Synchronization
Problem: Given a complex stream, how do you keep the different substreams in synch? Example: Think of playing out two channels, that together form stereo sound. Difference should be less than 20–30 µsec! Alternative: multiplex all substreams into a single stream, and demultiplex at the receiver. Synchronization is handled at multiplexing/demultiplexing point (MPEG). 02 – 42 Communication/2.5 Stream-Oriented Communication
105
Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 05 Naming
106
04 – 1 Naming/4.1 Naming Entities
Names, identifiers, and addresses Name resolution Name space implementation 04 – Naming/4.1 Naming Entities
107
04 – 2 Naming/4.1 Naming Entities
Essence: Names are used to denote entities in a distributed system. To operate on an entity, we need to access it at an access point. Access points are entities that are named by means of an address. Note: A location-independent name for an entity E, is independent from the addresses of the access points offered by E. 04 – Naming/4.1 Naming Entities
108
04 – 3 Naming/4.1 Naming Entities
Identifiers Pure name: A name that has no meaning at all; it is just a random string. Pure names can be used for comparison only. Identifier: A name having the following properties: - P1 Each identifier refers to at most one entity - P2 Each entity is referred to by at most one identifier - P3 An identifier always refers to the same entity (prohibits reusing an identifier) Observation: An identifier need not necessarily be a pure name, i.e., it may have content. Question: Can the content of an identifier ever change? 04 – Naming/4.1 Naming Entities
109
04 – 4 Naming/4.1 Naming Entities
Name Space (1/2) Essence: a graph in which a leaf node represents a (named) entity. A directory node is an entity that refers to other nodes. Note: A directory node contains a (directory) table of (edge label, node identifier) pairs. 04 – Naming/4.1 Naming Entities
110
04 – 5 Naming/4.1 Naming Entities
Name Space (2/2) Observation: We can easily store all kinds of attributes in a node, describing aspects of the entity the node represents: Type of the entity An identifier for that entity Address of the entity’s location Nicknames ... Observation: Directory nodes can also have attributes, besides just storing a directory table with (edge label, node identifier) pairs. 04 – Naming/4.1 Naming Entities
111
04 – 6 Naming/4.1 Naming Entities
Name Resolution Name Resolution - the process of looking up a name Problem: To resolve a name we need a directory (initial) node. How do we actually find that initial node? Closure mechanism: The mechanism to select the implicit context from which to start name resolution: Question: Why are closure mechanisms always implicit? Observation: A closure mechanism may also determine how name resolution should proceed 04 – Naming/4.1 Naming Entities
112
04 – 7 Naming/4.1 Naming Entities
Name Linking (1/2) Hard link: What we have described so far is a path name: a name that is resolved by following a specific path in a naming graph from one node to another. Soft link: Allows a node O to contain a name of another node: First resolve O’s name (leading to O) Read the content of O, yielding name Name resolution continues with name Observations: The name resolution process determines that we read the content of a node, in particular, the name in the other node that we need to go to. One way or the other, we know where and how to start name resolution given name 04 – Naming/4.1 Naming Entities
113
04 – 8 Naming/4.1 Naming Entities
Name Linking (2/2) Observation: the path name /home/steen/keys, which refers to a node containing the absolute path name /keys, is a symbolic link to node n5. 04 – Naming/4.1 Naming Entities
114
04 – 9 Naming/4.1 Naming Entities
Merging Name Spaces (1/3) Problem: We have different name spaces that we wish to access from any given name space. Solution 1: Introduce a naming scheme by which pathnames of different name spaces are simply concatenated (URLs). 04 – Naming/4.1 Naming Entities
115
04 – 10 Naming/4.1 Naming Entities
Merging Name Spaces (2/3) Solution 2: Introduce nodes that contain the name of a node in a “foreign” name space, along with the information how to select the initial context in that foreign name space. Mount point: (Directory) node in naming graph that refers to other naming graph Mounting point: (Directory) node in other naming graph that is referred to. 04 – Naming/4.1 Naming Entities
116
04 – 11 Naming/4.1 Naming Entities
Merging Name Spaces (3/3) Solution 3: Use only full pathnames, in which the starting context is explicitly identified, and merge by adding a new root node (DCE’s Global Name Space). Note: In principle, you always have to start from the new root 04 – Naming/4.1 Naming Entities
117
04 – 12 Naming/4.1 Naming Entities
Name Space Implementation (1/2) Basic issue: Distribute the name resolution process as well as name space management across multiple machines, by distributing nodes of the naming graph. Consider a hierarchical naming graph and distinguish three levels: Global layer: Consists of the high-level directory nodes. Main aspect is that these directory nodes have to be jointly managed by different administrations Administrational layer: Contains mid-level directory nodes that can be grouped in such a way that each group can be assigned to a separate administration. Managerial layer: Consists of low-level directory nodes within a single administration. Main issue is effectively mapping directory nodes to local name servers. 04 – Naming/4.1 Naming Entities
118
04 – 13 Naming/4.1 Naming Entities
Name Space Implementation (2/2) 04 – Naming/4.1 Naming Entities
119
04 – 14 Naming/4.1 Naming Entities
Iterative Name Resolution resolve(dir, [name1,…, nameK]) is sent to Server0 responsible for dir Server0 resolves resolve(dir, name1) → dir1, returning the identification (address) of Server1, which stores dir1. Client sends resolve(dir1,[name2,…, nameK]) to Server1 etc. 04 – Naming/4.1 Naming Entities
120
04 – 15 Naming/4.1 Naming Entities
Recursive Name Resolution resolve(dir,[name1,…,nameK]) is sent to Server0 responsible for dir Server0 resolves resolve(dir, name1) → dir1, and sends resolve(dir,[name2,…,nameK]) to Server1, which stores dir1. Server0 waits for the result from Server1, and returns it to the client 04 – Naming/4.1 Naming Entities
121
Caching in Recursive Name Resolution
Also see Figure 4-11 for the comparison between recursive and iterative name resolution with respect to communication costs. 04 – Naming/4.1 Naming Entities
122
04 – 17A Naming/4.1 Naming Entities
Example 1: Internet Domain Name System (DNS) used for looking up IP addresses of hosts and mail servers in Internet comparable to a telephone book (white pages) for looking up phone numbers DNS name space is hierarchically organized as a rooted tree The contents of a node is formed by a collection of resource records Multiple (primary, secondary, etc.) DNS servers are usually deployed for an organization to increase availability nslookup is a utility for querying DNS service 04 – 17A Naming/4.1 Naming Entities
123
04 – 17B Naming/4.1 Naming Entities
DNS Resource Records Type of record Associated entity Description SOA Zone Holds information on the represented zone A Host Contains an IP address of the host this node represents MX Domain Refers to a mail server to handle mail addressed to this node SRV Refers to a server handling a specific service NS Refers to a name server that implements the represented zone CNAME Node Symbolic link with the primary name of the represented node PTR Contains the canonical name of a host HINFO Holds information on the host this node represents TXT Any kind Contains any entity-specific information considered useful Figure The most important types of resource records forming the contents of nodes in the DNS name space. 04 – 17B Naming/4.1 Naming Entities
124
Sample DNS Records Figure 4-13.
An excerpt from the DNS database for the zone cs.vu.nl. 04 – 17C
125
04 – 18A Naming/4.1 Naming Entities
Example 2: X.500 Directory Service (1) ITU standard for directory services provides directory service based on a description of properties instead of a full name (e.g., yellow pages in telephone book) an X.500 directory entry is comparable to a resource record in DNS Each record is made up of a collection of (attribute, value) pairs The collection of all entries is called Directory Information Base (DIB) Each entry in a DIB can be looked up using a sequence of naming attributes, which forms a globally unique name called Distinguished Name (DN). Each naming attribute is called Relative Distinguished Name (RDN) - e.g., /C=KR/O=POSTECH/OU=Dept. of CSE is analogous to the DNS name cse.postech.ac.kr X.500 also forms a hierarchy of the collection of entries called Directory Information Tree (DIT) 04 – 18A Naming/4.1 Naming Entities
126
X.500 Directory Entry Example
Attribute Abbr. Value Country C NL Locality L Amsterdam Organization Vrije Universiteit OrganizationalUnit OU Math. & Comp. Sc. CommonName CN Main server Mail_Servers -- , , FTP_Server WWW_Server A simple example of a X.500 directory entry using X.500 naming conventions. 04 – 18B Naming/4.1 Naming Entities
127
A Part of Directory Information Tree
04 – 18C Naming/4.1 Naming Entities
128
Two directory entries having Host_Name as RDN
Attribute Value Country NL Locality Amsterdam Organization Vrije Universiteit OrganizationalUnit Math. & Comp. Sc. Math. & Comp. Sc. CommonName Main server Host_Name star zephyr Host_Address 04 – 18D Naming/4.1 Naming Entities
129
04 – 18E Naming/4.1 Naming Entities
Example 2: X.500 Directory Service (2) DIT is usually partitioned and distributed across multiple servers known as Directory Service Agents (DSA) Clients are known as Directory User Agents (DUA) Directory Access Protocol (DAP) is used between DUA and DSA to insert/lookup/modify/delete entries in DSA traditionally implemented using OSI protocols Lightweight Directory Access Protocol (LDAP) implemented on top of TCP parameters of operations are passed as strings becoming a de facto standard for Internet-based directory services for various applications 04 – 18E Naming/4.1 Naming Entities
130
Locating Mobile Entities
Naming versus locating objects Simple solutions Home-based approaches Hierarchical approaches 04 – Naming/4.2 Locating Mobile Entities
131
04 – 20 Naming/4.2 Locating Mobile Entities
Naming & Locating Objects (1/2) Location service: Solely aimed at providing the addresses of the current locations of entities. Assumption: Entities are mobile, so that their current address may change frequently. Naming service: Aimed at providing the content of nodes in a name space, given a (compound) name. Content consists of different (attribute,value) pairs. Assumption: Node contents at global and administrational level is relatively stable for scalability reasons. Observation: If a traditional naming service is used to locate entities, we also have to assume that node contents at the managerial level is stable, as we can use only names as identifiers (think of Web pages). 04 – Naming/4.2 Locating Mobile Entities
132
04 – 21 Naming/4.2 Locating Mobile Entities
Naming & Locating Objects (2/2) Problem: It is not realistic to assume stable node contents down to the local naming level Solution: Decouple naming from locating entities Name: Any name in a traditional naming space Entity ID: A true identifier Address: Provides all information necessary to contact an entity Observation: An entity’s name is now completely independent from its location. Question: What may be a typical address? 04 – Naming/4.2 Locating Mobile Entities
133
Simple Solutions for Locating Entities
Broadcasting: Simply broadcast the ID, requesting the entity to return its current address. Can never scale beyond local-area networks (think of ARP/RARP) Requires all processes to listen to incoming location requests Forwarding pointers: Each time an entity moves, it leaves behind a pointer telling where it has gone to. Dereferencing can be made entirely transparent to clients by simply following the chain of pointers Update a client’s reference as soon as present location has been found Geographical scalability problems: Long chains are not fault tolerant Increased network latency at dereferencing Essential to have separate chain reduction mechanisms 04 – Naming/4.2 Locating Mobile Entities
134
04 – 23 Naming/4.2 Locating Mobile Entities
Home-Based Approaches (1/2) Single-tiered scheme: Let a home keep track of where the entity is: An entity’s home address is registered at a naming service The home registers the foreign address of the entity Clients always contact the home first, and then continues with the foreign location 04 – Naming/4.2 Locating Mobile Entities
135
04 – 24 Naming/4.2 Locating Mobile Entities
Home-Based Approaches (2/2) Two-tiered scheme: Keep track of visiting entities: Check local visitor register first Fall back to home location if local lookup fails Problems with home-based approaches: The home address has to be supported as long as the entity lives. The home address is fixed, which means an unnecessary burden when the entity permanently moves to another location Poor geographical scalability (the entity may be next to the client) Question: How can we solve the “permanent move” problem? 04 – Naming/4.2 Locating Mobile Entities
136
Hierarchical Location Services (HLS)
Basic idea: Build a large-scale search tree for which the underlying network is divided into hierarchical domains. Each domain is represented by a separate directory node. 04 – Naming/4.2 Locating Mobile Entities
137
04 – 26 Naming/4.2 Locating Mobile Entities
HLS: Tree Organization The address of an entity is stored in a leaf node, or in an intermediate node Intermediate nodes contain a pointer to a child if and only if the subtree rooted at the child stores an address of the entity The root knows about all entities 04 – Naming/4.2 Locating Mobile Entities
138
04 – 27 Naming/4.2 Locating Mobile Entities
HLS: Lookup Operation Basic principles: Start lookup at local leaf node If node knows about the entity, follow downward pointer, otherwise go one level up Upward lookup always stops at root 04 – Naming/4.2 Locating Mobile Entities
139
04 – 28 Naming/4.2 Locating Mobile Entities
HLS: Insert Operation 04 – Naming/4.2 Locating Mobile Entities
140
04 – 29 Naming/4.2 Locating Mobile Entities
HLS: Record Placement Observation: If an entity E moves regularly between leaf domains D1 and D2, it may be more efficient to store E’s contact record at the least common ancestor LCA of dir(D1) and dir(D2) Lookup operations from either D1 or D2 are on average cheaper Update operations (i.e., changing the current address) can be done directly at LCA Note: assuming that E generally stays in dom(LCA), it does make sense to cache a pointer to LCA 04 – Naming/4.2 Locating Mobile Entities
141
04 – 30 Naming/4.2 Locating Mobile Entities
HLS: Scalability Issues Size scalability: Again, we have a problem of overloading higher- level nodes: Only solution is to partition a node into a number of subnodes and evenly assign entities to subnodes Naive partitioning may introduce a node management problem, as a subnode may have to know how its parent and children are partitioned. Geographical scalability: We have to ensure that lookup operations generally proceed monotonically in the direction of where we’ll find an address: If entity E generally resides in California, we should not let a root subnode located in France store E’s contact record. Unfortunately, subnode placement is not that easy, and only a few tentative solutions are known 04 – Naming/4.2 Locating Mobile Entities
142
04 – 31 Naming/4.3 Reclaiming References
Reference counting Reference listing Scalability issues 04 – Naming/4.3 Reclaiming References
143
04 – 32 Naming/4.3 Reclaiming References
Unreferenced Objects: Problem Assumption: Objects may exist only if it is known that they can be contacted: Each object should be named Each object can be located A reference can be resolved to client–object communication Problem: Removing unreferenced objects: How do we know when an object is no longer referenced (think of cyclic references)? Who is responsible for (deciding on) removing an object? 04 – Naming/4.3 Reclaiming References
144
04 – 33 Naming/4.3 Reclaiming References
Reference Counting (1/2) Principle: Each time a client creates (removes) a reference to an object O, a reference counter local to O is incremented (decremented) Problem 1: Dealing with lost (and duplicated) messages: An increment is lost so that the object may be prematurely removed A increment is lost so that the object is never removed An ACK is lost, so that the increment/decrement is resent. Solution: Keep track of duplicate requests. 04 – Naming/4.3 Reclaiming References
145
04 – 34 Naming/4.3 Reclaiming References
Reference Counting (2/2) Problem 2: Dealing with duplicated references – client P1 tells client P2 about object O: Client P2 creates a reference to O, but dereferencing (communicating with O) may take a long time If the last reference known to O is removed before P2 talks to O, the object is removed prematurely Solution 1: Ensure that P2 talks to O on time: Let P1 tell O it will pass a reference to P2 Let O contact P2 immediately A reference may never be removed before O has acked that reference to the holder 04 – Naming/4.3 Reclaiming References
146
04 – 35 Naming/4.3 Reclaiming References
Weighted Reference Counting Solution 2: Avoid increment and decrement messages: Let O allow a maximum M of references Client P1 creates reference grant it M/2 credit Client P1 tells P2 about O, it passes half of its credit grant to P2 Pass current credit grant back to O upon reference deletion 04 – Naming/4.3 Reclaiming References
147
04 – 36 Naming/4.3 Reclaiming References
Reference Listing Observation: We can avoid many problems if we can tolerate message loss and duplication Reference listing: Let an object keep a list of its clients: Increment operation is replaced by an (idempotent) insert Decrement operation is replaced by an (idempotent) remove There are still some problems to be solved: Passing references: client B has to be listed at O before last reference at O is removed (or keep a chain of references) Client crashes: we need to remove outdated registrations (e.g., by combining reference listing with leases) 04 – Naming/4.3 Reclaiming References
148
04 – 37 Naming/4.3 Reclaiming References
Leases Observation: If we cannot be exact in the presence of communication failures, we will have to tolerate some mistakes Essential issue: We need to avoid that object references are never reclaimed Solution: Hand out a lease on each new reference: The object promises not to decrement the reference count for a specified time Leases need to be refreshed (by object or client) Observations: Refreshing may fail in the face of message loss Refreshing can tolerate message duplication Does not solve problems related to cyclic references 04 – Naming/4.3 Reclaiming References
149
Distributed Systems Principles and Paradigms
Chapter 06 Synchronization 149
150
Communication & Synchronization
Why do processes communicate in DS? To exchange messages To synchronize processes Why do processes synchronize in DS? To coordinate access of shared resources To order events 150
151
Time, Clocks and Clock Synchronization
Why is time important in DS? E.g. UNIX make utility (see Fig. 5-1) Clocks (Timer) Physical clocks Logical clocks (introduced by Leslie Lamport) Vector clocks (introduced by Collin Fidge) Clock Synchronization How do we synchronize clocks with real-world time? How do we synchronize clocks with each other? 05 – Distributed Algorithms/5.1 Clock Synchronization 151
152
Physical Clocks (1/3) Problem: Clock Skew – clocks gradually get out of synch and give different values Solution: Universal Coordinated Time (UTC): Formerly called GMT (Greenwich Mean Time) Based on the number of transitions per second of the cesium 133 atom (very accurate). At present, the real time is taken as the average of some 50 cesium- clocks around the world – International Atomic Time Introduces a leap second from time to time to compensate that days are getting longer. UTC is broadcasted through short wave radio (with the accuracy of +/- 1 msec) and satellite (Geostationary Environment Operational Satellite, GEOS, with the accuracy of +/- 0.5 msec). Question: Does this solve all our problems? Don’t we now have some global timing mechanism? 05 – Distributed Algorithms/5.1 Clock Synchronization 152
153
Physical Clocks (2/3) Problem: Suppose we have a distributed system with a UTC- receiver somewhere in it, we still have to distribute its time to each machine. Basic principle: Every machine has a timer that generates an interrupt H (typically 60) times per second. There is a clock in machine p that ticks on each timer interrupt. Denote the value of that clock by Cp (t) , where t is UTC time. Ideally, we have that for each machine p, Cp (t) = t, or, in other words, dC/ dt = 1 Theoretically, a timer with H=60 should generate 216,000 ticks per hour In practice, the relative error of modern timer chips is 10**-5 (or between 215,998 and 216,002 ticks per hour) 05 – Distributed Algorithms/5.1 Clock Synchronization 153
154
Where is the max. drift rate
Physical Clocks (3/3) Where is the max. drift rate Goal: Never let two clocks in any system differ by more than time units => synchronize at least every 2seconds. 05 – Distributed Algorithms/5.1 Clock Synchronization 154
155
Clock Synchronization Principles
Principle I: Every machine asks a time server for the accurate time at least once every /2seconds (see Fig. 5-5). But you need an accurate measure of round trip delay, including interrupt handling and processing incoming messages. Principle II: Let the time server scan all machines periodically, calculate an average, and inform each machine how it should adjust its time relative to its present time. Ok, you’ll probably get every machine in sync. Note: you don’t even need to propagate UTC time (why not?) 05 – Distributed Algorithms/5.1 Clock Synchronization 155
156
Clock Synchronization Algorithms
The Berkeley Algorithm The time server polls periodically every machine for its time The received times are averaged and each machine is notified of the amount of the time it should adjust Centralized algorithm, See Figure 5-6 Decentralized Algorithm Every machine broadcasts its time periodically for fixed length resynchronization interval Averages the values from all other machines (or averages without the highest and lowest values) Network Time Protocol (NTP) the most popular one used by the machines on the Internet uses an algorithm that is a combination of centralized/distributed 05 – Distributed Algorithms/5.2 Logical Clocks 156
157
Network Time Protocol (NTP)
a protocol for synchronizing the clocks of computers over packet- switched, variable-latency data networks (i.e., Internet) NTP uses UDP port 123 as its transport layer. It is designed particularly to resist the effects of variable latency NTPv4 can usually maintain time to within 10 milliseconds (1/100 s) over the public Internet, and can achieve accuracies of 200 microseconds (1/5000 s) or better in local area networks under ideal conditions visit the following URL to understand NTP in more detail 157
158
The Happened-Before Relationship
Problem: We first need to introduce a notion of ordering before we can order anything. The happened-before relation on the set of events in a distributed system is the smallest relation satisfying: If a and b are two events in the same process, and a comes before b, then a b. (a happened before b) If a is the sending of a message, and b is the receipt of that message, then a b. If a b and b c, then a c. (transitive relation) Note: if two events, x and y, happen in different processes that do not exchange messages, then they are said to be concurrent. Note: this introduces a partial ordering of events in a system with concurrently operating processes. 05 – Distributed Algorithms/5.2 Logical Clocks 158
159
Logical Clocks (1/2) Problem: How do we maintain a global view on the system’s behavior that is consistent with the happened-before relation? Solution: attach a timestamp C(e) to each event e, satisfying the following properties: P1: If a and b are two events in the same process, and a b, then we demand that C (a) < C (b) P2: If a corresponds to sending a message m, and b to the receipt of that message, then also C (a) < C (b) Problem: How do we attach a timestamp to an event when there’s no global clock? maintain a consistent set of logical clocks, one per process. 05 – Distributed Algorithms/5.2 Logical Clocks 159
160
Logical Clocks (2/2) Each process Pi maintains a local counter Ci and adjusts this counter according to the following rules: (1) For any two successive events that take place within Pi, Ci is incremented by 1. (2) Each time a message m is sent by process Pi, the message receives a timestamp Tm = Ci. (3) Whenever a message m is received by a process Pj, Pj adjusts its local counter Cj: Property P1 is satisfied by (1); Property P2 by (2) and (3). This is called the Lamport’s Algorithm 05 – Distributed Algorithms/5.2 Logical Clocks 160
161
Logical Clocks – Example
Fig 5-7. (a) Three processes, each with its own clock. The clocks run at different rates. (b) Lamport’s algorithm corrects the clocks 05 – Distributed Algorithms/5.2 Logical Clocks 161
162
a b c d e f g h i j k l P1 P2 P3 Assign the Lamport’s logical clock values for all the events in the above timing diagram. Assume that each process’s local clock is set to 0 initially. 162
163
a b c d e f g h i j k l P1 P2 P3 1 2 3 4 5 6 From the above timing diagram, what can you say about the following events? between a and b: a b between b and f : b f between e and k: concurrent between c and h: concurrent between k and h: k h 163
164
Total Ordering with Logical Clocks
Problem: it can still occur that two events happen at the same time. Avoid this by attaching a process number to an event: Pi timestamps event e with Ci (e) i Then: Ci (a) i happened before Cj (b) j if and only if: 1: Ci (a) < Cj (a) ; or 2: Ci (a) = Cj (b) and i < j 05 – Distributed Algorithms/5.2 Logical Clocks 164
165
Example: Totally-Ordered Multicast (1/2)
Problem: We sometimes need to guarantee that concurrent updates on a replicated database are seen in the same order everywhere: Process P1 adds $100 to an account (initial value: $1000) Process P2 increments account by 1% There are two replicas Outcome: in absence of proper synchronization, replica #1 will end up with $1111, while replica #2 ends up with $1110. 05 – Distributed Algorithms/5.2 Logical Clocks 165
166
Example: Totally-Ordered Multicast (2/2)
Process Pi sends timestamped message msgi to all others. The message itself is put in a local queue queuei. Any incoming message at Pj is queued in queuej, according to its timestamp. Pj passes a message msgi to its application if: (1) msgi is at the head of queuej (2) for each process Pk, there is a message msgk in queuej with a larger timestamp. Note: We are assuming that communication is reliable and FIFO ordered. 05 – Distributed Algorithms/5.2 Logical Clocks 166
167
Fidge’s Logical Clocks
with Lamport’s clocks, one cannot directly compare the timestamps of two events to determine their precedence relationship - if C(a) < C(b) then a b - if C(a) < C(b), it could be a b or a b - e.g., events e and b in the previous example Figure * C(e) = 1 and C(b) = 2 * thus C(e) < C(b) but e b the main problem is that a simple integer clock can not order both events within a process and events in different processes Collin Fidge developed an algorithm that overcomes this problem Fidge’s clock is represented as a vector [c1 , c 2 , …, cn] with an integer clock value for each process (ci contains the clock value of process i) / / / / 167
168
Fidge’s Algorithm The Fidge’s logical clock is maintained as follows:
1: Initially all clock values are set to the smallest value. 2: The local clock value is incremented at least once before each primitive event in a process. 3: The current value of the entire logical clock vector is delivered to the receiver for every outgoing message. 4: Values in the timestamp vectors are never decremented. 5: Upon receiving a message, the receiver sets the value of each entry in its local timestamp vector to the maximum of the two corresponding values in the local vector and in the remote vector received. The element corresponding to the sender is a special case; it is set to one greater than the value received, but only if the local value is not greater than that received. 168
169
ep fq iff Tep [p] < Tfq [p]
Get r_vector from the received msg sent by process q; if l_vector [q] r_vector[q] then l_vector[q] : = r_vector[q] + 1; for i : = 1 to n do l_vector[i] := max(l_vector[i], r_vector[i]); Timestamps attached to the events are compared as follows: ep fq iff Tep [p] < Tfq [p] (where ep represents an event e occurring in process p, Tep represents the timestamp vector of the event ep , and the ith element of Tep is denoted by Tep [i].) This means event ep happened before event fq if and only if process q received a direct or indirect message from p and that message was sent after ep had occurred. If ep and fq are in the same process (i,e., p = q), the local elements of their timestamps represent their occurrences in the process. 169
170
a b c d e f g h i j k l P1 P2 P3 Assign the Lamport’s and Fidge’s logical clock values for all the events in the above timing diagram. Assume that each process’s logical clock is set to 0 initially. 170
171
P1 P2 P3 a b c d e f g h i j k l 1 2 3 4 5 6 [1,0,0] [2,0,0] [3,0,0]
[4,0,0] [0,1,0] [3,2,0] [3,3,3] [3,4,3] [5,5,3] [0,0,3] [0,0,2] [0,0,1] 171
172
The above diagram shows both Lamport timestamps (an integer value ) and Fidge timestamps (a vector of integer values ) for each event. Lamport clocks: 2 < 5 since b h, 3 < 4 but c g. Fidge clocks: f h since 2 < 4 is true, b h since 2 < 3 is true, h a since 4 < 0 is false, c h since (3 < 3) is false and (4 < 0) is false. 172
173
P1 P2 P3 P4 a e j m b f k c n g d h l o i Assign the Lamport’s and Fidge’s logical clock values for all the events in the above timing diagram. Assume that each process’s logical clock is set to 0 initially. 173
174
From the above timing diagram, what can you say about the following events?
between b and n: between b and o: between m and g: between c and h: between c and l: between j and g: between k and i: between j and h: 174
175
READING Reference: Colin Fidge, “Logical Time in Distributed Computing Systems”, IEEE Computer, Vol. 24, No. 8, pp , August 1991. 175
176
Global State (1/3) Basic Idea: Sometimes you want to collect the current state of a distributed computation, called a distributed snapshot. It consists of all local states and messages in transit. Important: A distributed snapshot should reflect a consistent state: 05 – Distributed Algorithms/5.3 Global State 176
177
Global State (2/3) Note: any process P can initiate taking a distributed snapshot P starts by recording its own local state P subsequently sends a marker along each of its outgoing channels When Q receives a marker through channel C, its action depends on whether it had already recorded its local state: – Not yet recorded: it records its local state, and sends the marker along each of its outgoing channels – Already recorded: the marker on C indicates that the channel’s state should be recorded: all messages received before this marker and the time Q recorded its own state. Q is finished when it has received a marker along each of its incoming channels 05 – Distributed Algorithms/5.3 Global State 177
178
Global State (3/3) 05 – 17 Distributed Algorithms/5.3 Global State
(a) Organization of a process and channels for a distributed snapshot (b) Process Q receives a marker for the first time and records its local state (c) Q records all incoming message (d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel 05 – Distributed Algorithms/5.3 Global State 178
179
Election Algorithms Principle: Many distributed algorithms require that some process acts as a coordinator. The question is how to select this special process dynamically. Note: In many systems the coordinator is chosen by hand (e.g., file servers, DNS servers). This leads to centralized solutions => single point of failure. Question: If a coordinator is chosen dynamically, to what extent can we speak about a centralized or distributed solution? Question: Is a fully distributed solution, i.e., one without a coordinator, always more robust than any centralized/coordinated solution? 05 – Distributed Algorithms/5.4 Election Algorithms 179
180
Election by Bullying (1/2)
Principle: Each process has an associated priority (weight). The process with the highest priority should always be elected as the coordinator. Issue: How do we find the heaviest process? Any process can just start an election by sending an election message to all other processes (assuming you don’t know the weights of the others). If a process Pheavy receives an election message from a lighter process Plight, it sends a take-over message to Plight. Plight is out of the race. If a process doesn’t get a take-over message back, it wins, and sends a victory message to all other processes. 05 – Distributed Algorithms/5.4 Election Algorithms 180
181
Election by Bullying (2/2)
Question: We’re assuming something very important here – what? Assumption: Each process knows the process number of other processes 05 – Distributed Algorithms/5.4 Election Algorithms 181
182
Election in a Ring Principle: Process priority is obtained by organizing processes into a (logical) ring. Process with the highest priority should be elected as coordinator. Any process can start an election by sending an election message to its successor. If a successor is down, the message is passed on to the next successor. If a message is passed on, the sender adds itself to the list. When it gets back to the initiator, everyone had a chance to make its presence known. The initiator sends a coordinator message around the ring containing a list of all living processes. The one with the highest priority is elected as coordinator. See Figure 5-12. Question: Does it matter if two processes initiate an election? Question: What happens if a process crashes during the election? 05 – Distributed Algorithms/5.4 Election Algorithms 182
183
Mutual Exclusion Problem: A number of processes in a distributed system want exclusive access to some resource. Basic solutions: Via a centralized server. Completely distributed, with no topology imposed. Completely distributed, making use of a (logical) ring. Centralized: Really simple: 05 – Distributed Algorithms/5.5 Mutual Exclusion 183
184
Mutual Exclusion: Ricart & Agrawala
Principle: The same as Lamport except that acknowledgments aren’t sent. Instead, replies (i.e., grants) are sent only when: The receiving process has no interest in the shared resource; or The receiving process is waiting for the resource, but has lower priority (known through comparison of timestamps). In all other cases, reply is deferred (see the algorithm on pg. 267) 05 – Distributed Algorithms/5.5 Mutual Exclusion 184
185
Mutual Exclusion: Token Ring Algorithm
Essence: Organize processes in a logical ring, and let a token be passed between them. The one that holds the token is allowed to enter the critical region (if it wants to) 05 – Distributed Algorithms/5.5 Mutual Exclusion 185
186
Distributed Transactions
The transaction model Classification of transactions Concurrency control 186
187
The Transaction Model (1)
Updating a master tape is fault tolerant. Question: What happens if this computer operation fails? Both tapes are rewound and the job is restarted from the beginning without any harm being done 187
188
The Transaction Model (2)
Primitive Description BEGIN_TRANSACTION Make the start of a transaction END_TRANSACTION Terminate the transaction and try to commit ABORT_TRANSACTION Kill the transaction and restore the old values READ Read data from a file, a table, or otherwise WRITE Write data to a file, a table, or otherwise Figure 5-18 Example primitives for transactions. 188
189
The Transaction Model (3)
BEGIN_TRANSACTION reserve BOS -> JFK; reserve JFK -> ICN; reserve SEL -> KPO; END_TRANSACTION (a) BEGIN_TRANSACTION reserve BOS -> JFK; reserve JFK -> ICN; reserve SEL -> KPO full => ABORT_TRANSACTION (b) Transaction to reserve three flights commits Transaction aborts when third flight is unavailable 189
190
ACID Properties of Transactions
Atomic To the outside world, the transaction happens indivisibly Consistent The transaction does not violate system invariants Isolated Concurrent transactions do not interfere with each other Durable Once a transaction commits, the changes are permanent 190
191
Nested Transactions Constructed from a number of subtransactions
The top-level transaction may create children that run in parallel with one another to gain performance or simplify programming Each of these children is called a subtransaction and it may also have one or more subtransactions When any transaction or subtransaction starts, it is conceptually given a private copy of all data in the entire system for it to manipulate as it wishes If it aborts, its private space is destroyed If it commits, its private space replaces the parent’s space If the top-level transaction aborts, all the changes made in the subtransactions must be wiped out 191
192
Distributed Transactions
- Transactions involving subtransactions that operate on data that are distributed across multiple machines - Separate distributed algorithms are needed to handle the locking of data and committing the entire transaction 192
193
Implementing Transactions
Private Workspace Gives a private workspace (i.e., all the data it has access to) to a process when it begins a transaction Writeahead Log Files are actually modified in place but before any block is changed, a record is written to a log telling which transaction is making the change which file and block is being changed what the old and new values are Only after the log has been written successfully, the change is made to the file Question: Why is a log needed? for “rollback” if necessary 193
194
Private Workspace The file index and disk blocks for a three-block file The situation after a transaction has modified block 0 and appended block 3 After committing 194
195
(b) – (d) The log before each statement is executed
Writeahead Log x = 0; y = 0; BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y; END_TRANSACTION; (a) Log [x = 0 / 1] (b) [y = 0 / 2] (c) [x = 1 / 4] (d) (a) a transaction (b) – (d) The log before each statement is executed 195
196
Concurrency Control (1)
The goal of concurrency control is to allow multiple transactions to be executed simultaneously Final result should be the same as if all transactions had run sequentially Fig General organization of managers for handling transactions 196
197
Concurrency Control (2)
General organization of managers for handling distributed transactions. 197
198
(a) – (c) Three transactions T1, T2, and T3
Serializability (1) BEGIN_TRANSACTION x = 0; x = x + 1; END_TRANSACTION (a) BEGIN_TRANSACTION x = 0; x = x + 2; END_TRANSACTION (b) BEGIN_TRANSACTION x = 0; x = x + 3; END_TRANSACTION (c) (a) – (c) Three transactions T1, T2, and T3 Schedule 1 x = 0; x = x + 1; x = 0; x = x + 2; x = 0; x = x + 3 Legal Schedule 2 x = 0; x = 0; x = x + 1; x = x + 2; x = 0; x = x + 3; Schedule 3 x = 0; x = 0; x = x + 1; x = 0; x = x + 2; x = x + 3; Illegal (d) (d) Possible schedules Question: Why is Schedule 3 illegal? 198
199
Serializability (2) Two operations conflict is they operate on the same data and if at least one of them is a write operation read-write conflict: exactly one of the operations is a write write-write conflict: involves more than one write operations Concurrency control algorithms can generally be classified by looking at the way read and write operations are synchronized Using locking Explicitly ordering operations using timestamps 199
200
Fig. 5-26 Two-phase locking
In two-phase locking (2PL), the scheduler first acquires all the locks it needs during the growing (1st) phase, and then releases them during the shrinking (2nd) phase See the rules on pg. 284 Fig Two-phase locking 200
201
Fig. 5-27 Strict two-phase locking
In strict two-phase locking, the shrinking phase does not take place until the transaction has finished running and has either committee or aborted. Fig Strict two-phase locking 201
202
READING: Read Chapter 5 202
203
Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 08 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency and Replication 07 Fault Tolerance 08 Security 09 Distributed Object-Based Systems 10 Distributed File Systems 11 Distributed Document-Based Systems 12 Distributed Coordination-Based Systems 00 – /
204
Introduction Basic concepts Process resilience
Reliable client-server communication Reliable group communication Distributed commit Recovery 07 – Fault Tolerance/
205
07 – 2 Fault Tolerance/7.1 Introduction
Dependability Basics: A component provides services to clients. To provide services, the component may require the services from other components a component may depend on some other component. Specifically: A component C depends on C* if the correctness of C’s behavior depends on the correctness of C*’s behavior. Some properties of dependability: Availability Readiness for usage Reliability Continuity of service delivery Safety Very low probability of catastrophes Maintainability How easy can a failed system be repaired Note: For distributed systems, components can beeither processes or channels 07 – Fault Tolerance/7.1 Introduction
206
07 – 3 Fault Tolerance/7.1 Introduction
Terminology Failure: When a component is not living up to its specifications, a failure occurs Error: That part of a component’s state that can lead to a failure Fault: The cause of an error Fault prevention: prevent the occurrence of a fault Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults (i.e., mask the presence of faults) Fault removal: reduce the presence, number, seriousness of faults Fault forecasting: estimate the present number, future incidence, and the consequences of faults 07 – Fault Tolerance/7.1 Introduction
207
07 – 4 Fault Tolerance/Failure Models
Crash failures: A component simply halts, but behaves correctly before halting Omission failures: A component fails to respond Timing failures: The output of a component is correct, but lies outside a specified real-time interval (performance failures: too slow) Response failures: The output of a component is incorrect (but can at least not be accounted to another component) Value failure: The wrong value is produced State transition failure: Execution of the component’s service brings it into a wrong state Arbitrary failures: A component may produce arbitrary output and be subject to arbitrary timing failures Observation: Crash failures are the least severe; arbitrary failures are the worst 07 – Fault Tolerance/Failure Models
208
07 – 5 Fault Tolerance/Failure Models
Crash Failures Problem: Clients cannot distinguish between a crashed component and one that is just a bit slow Examples: Consider a server from which a client is exepcting output: Is the server perhaps exhibiting timing or omission failures Is the channel between client and server faulty (crashed, or exhibiting timing or omission failures) Fail-silent: The component exhibits omission or crash failures; clients cannot tell what went wrong Fail-stop: The component exhibits crash failures, but its failure can be detected (either through announcement or timeouts) Fail-safe: The component exhibits arbitrary, but benign failures (they can’t do any harm) 07 – Fault Tolerance/Failure Models
209
07 – 6 Fault Tolerance/7.2 Process Resilience
Basic issue: Protect yourself against faulty processes by replicating and distributing computations in a group. Flat groups: Good for fault tolerance as information exchange immediately occurs with all group members; however, may impose more overhead as control is completely distributed (hard to implement). Hierarchical groups: All communication through a single coordinator not really fault tolerant and scalable, but relatively easy to implement. 07 – Fault Tolerance/7.2 Process Resilience
210
07 – 7 Fault Tolerance/7.2 Process Resilience
Groups and Failure Masking (1/3) Terminology: when a group can mask any k concurrent member failures, it is said to be k-fault tolerant (k is called degree of fault tolerance). Problem: how large does a k-fault tolerant group need to be? Assume crash/performance failure semantics a total of k + 1 members are needed to survive k member failures. Assume arbitrary failure semantics, and group output defined by voting a total of 2k + 1 members are needed to survive k member failures. Assumption: all members are identical, and process all input in the same order only then are we sure that they do exactly the same thing. 07 – Fault Tolerance/7.2 Process Resilience
211
07 – 7 Fault Tolerance/7.2 Process Resilience
Groups and Failure Masking (2/3) Assumption: Group members are not identical, i.e., we have a distributed computation Problem: Nonfaulty group members should reach agreement on the same value Observation: Assuming arbitrary failure semantics, we need 3k + 1 group members to survive the attacks of k faulty members Note: This is also known as Byzantine failures. Essence: We are trying to reach a majority vote among the group of loyalists, in the presence of k traitors need 2k + 1 loyalists. 07 – Fault Tolerance/7.2 Process Resilience
212
07 – 9 Fault Tolerance/7.2 Process Resilience
Groups and Failure Masking (3/3) (a) what they send to each other (b) what each one got from the other (c) what each one got in second step 07 – Fault Tolerance/7.2 Process Resilience
213
07 – 10 Fault Tolerance/Reliable Communication
So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels? Error detection: Framing of packets to allow for bit error detection Use of frame numbering to detect packet loss Error correction: Add so much redundancy that corrupted packets can be automatically corrected Request retransmission of lost, or last N packets Observation: Most of this work assumes point-to-point communication 07 – Fault Tolerance/Reliable Communication
214
07 – 11 Fault Tolerance/Reliable Communication
Reliable RPC (1/3) What can go wrong?: 1: Client cannot locate server 2: Client request is lost 3: Server crashes 4: Server response is lost 5: Client crashes [1:] Relatively simple – just report back to client [2:] Just resend message 07 – Fault Tolerance/Reliable Communication
215
07 – 12 Fault Tolerance/Reliable Communication
Reliable RPC (2/3) [3:] Server crashes are harder as you don’t what it had already done: Problem: We need to decide on what we expect from the server At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what At-most-once-semantics: The server guarantees it will carry out an operation at most once. 07 – Fault Tolerance/Reliable Communication
216
07 – 13 Fault Tolerance/Reliable Communication
Reliable RPC (3/3) [4:] Detecting lost replies can be hard, because it can also be that the erver had crashed. You don’t know whether the server has carried out the operation Solution: None, except that you can try to make your operations idempotent: repeatable without any harm done if it happened to be carried out before. [5:] Problem: The server is doing work and holding resources for nothing (called doing an orphan computation). Orphan is killed (or rolled back) by client when it reboots Broadcast new epoch number when recovering servers kill orphans Require computations to complete in a T time units. Old ones are simply removed. Question: What’s the rolling back for? 07 – Fault Tolerance/Reliable Communication
217
07 – 14 Fault Tolerance/Reliable Communication
Reliable Multicasting (1/2) Basic model: We have a multicast channel c with two (possibly overlapping) groups: The sender group SND(c) of processes that submit messages to channel c The receiver group RCV(c) of processes that can receive messages from channel c Simple reliability: If process P ∈ RCV(c) at the time message m was submitted to c, and P does not leave RCV(c) , m should be delivered to P Atomic multicast: How can we ensure that a message m submitted to channel c is delivered to process P ∈ RCV(c) only if m is delivered to all members of RCV(c) 07 – Fault Tolerance/Reliable Communication
218
07 – 15 Fault Tolerance/Reliable Communication
Reliable Multicasting (2/2) Observation: If we can stick to a local-area network, reliable multicasting is “easy” Principle: Let the sender log messages submitted to channel c: If P sends message m, m is stored in a history buffer Each receiver acknowledges the receipt of m, or requests retransmission at P when noticing message lost Sender P removes m from history buffer when everyone has acknowledged receipt Question: Why doesn’t this scale? 07 – Fault Tolerance/Reliable Communication
219
07 – 16 Fault Tolerance/Reliable Communication
Scalable Reliable Multicasting: Feedback Suppression Basic idea: Let a process P suppress its own feedback when it notices another process Q is already asking for a retransmission Assumptions: All receivers listen to a common feedback channel to which feedback messages are submitted Process P schedules its own feedback message randomly, and suppresses it when observing another feedback message Question: Why is the random schedule so important? 07 – Fault Tolerance/Reliable Communication
220
Hierarchical Solutions
Scalable Reliable Multicasting: Hierarchical Solutions Basic solution: Construct an hierarchical feedback channel in which all submitted messages are sent only to the root. Intermediate nodes aggregate feedback messages before passing them on. Question: What’s the main problem with this solution? Observation: Intermediate nodes can easily be used for retransmission purposes 07 – Fault Tolerance/Reliable Communication
221
07 – 18 Fault Tolerance/Reliable Communication
Atomic Multicast Idea: Formulate reliable multicasting in the presence of process failures in terms of process groups and changes to group membership: Guarantee: A message is delivered only to the nonfaulty members of the current group. All members should agree on the current group membership. Keyword: Virtually synchronous multicast 07 – Fault Tolerance/Reliable Communication
222
07 – 19 Fault Tolerance/Reliable Communication
Virtual Synchrony (1/2) Essence: We consider views V ⊆ RCV(c) ∪ SND(c) Processes are added or deleted from a view V through view changes to V* ; a view change is to be executed locally by each P∈ V ∩ V* For each consistent state, there is a unique view on which all its members agree. Note: implies that all nonfaulty processes see all view changes in the same order (2) If message m is sent to V before a view change vc to V* , then either all P ∈ V that excute vc receive m, or no processes P ∈ V that execute vc receive m. Note: all nonfaulty members in the same view get to see the same set of multicast messages. (3) A message sent to view V can be delivered only to processes in V, and is discarded by successive views A reliable multicast algorithm satisfying (1)–(3) is virtually synchronous 07 – Fault Tolerance/Reliable Communication
223
07 – 20 Fault Tolerance/Reliable Communication
Virtual Synchrony (2/2) A sender to a view V need not be member of V If a sender S ∈ V crashes, its multicast message m is flushed before S is removed from V: m will never be delivered after the point that S V Note: Messages from S may still be delivered to all, or none (nonfaulty) processes in V before they all agree on a new view to which S does not belong If a receiver P fails, a message m may be lost but can be recovered as we know exactly what has been received in V. Alternatively, we may decide to deliver m to members in V –{p} Observation: Virtually synchronous behavior can be seen independent from the ordering of message delivery. The only issue is that messages are delivered to an agreed upon group of receivers. 07 – Fault Tolerance/Reliable Communication
224
Virtual Synchrony Implementation (1/3)
The current view is known at each P by means of a delivery list dest[P] If P ∈ dest[Q] then Q ∈ dest[P] Messages received by P are queued in queue[P] If P fails, the group view must change, but not before all messages from P have been flushed Each P attaches a (stepwise increasing) timestamp with each message it sends Assume FIFO-ordered delivery; the highest numbered message from Q that has been received by P is recorded in rcvd[P][Q] The vector rcvd[P][] is sent (as a control message) to all members in dest[P] Each P records rcvd[Q][] in remote[P][Q] 07 – Fault Tolerance/Reliable Communication
225
Virtual Synchrony Implementation (2/3)
Observation: remote[P][Q] shows what P knows about message arrival at Q A message is stable if it has been received by all Q ∈ dest[P] (shown as the min vector) Stable messages can be delivered to the next layer (which may deal with ordering). Note: Causal message delivery comes for free As soon as all messages from the faulty process have been flushed, that process can be removed from the (local) views 07 – Fault Tolerance/Reliable Communication
226
Virtual Synchrony Implementation (3/3)
Remains: What if a sender P failed and not all its messages made it to the non-faulty members of the current view? Solution: Select a coordinator which has all (unstable) messages from P, and forward those to the other group members. Note: Member failure is assumed to be detected and subsequently multicast to the current view as a view change. That view change will not be carried out before all messages in the current view have been delivered. 07 – Fault Tolerance/Reliable Communication
227
07 – 24 Fault Tolerance/7.5 Distributed Commit
Two-phase commit Three-phase commit Essential issue: Given a computation distributed across a process group, how can we ensure that either all processes commit to the final result, or none of them do (atomicity)? 07 – Fault Tolerance/7.5 Distributed Commit
228
07 – 25 Fault Tolerance/7.5 Distributed Commit
Two-Phase Commit (1/2) Model: The client who initiated the computation acts as coordinator; processes required to commit are the participants Phase 1a: Coordinator sends VOTE_REQUEST to participants (also called a pre-write) Phase 1b: When participant receives VOTE_REQUEST it returns either YES or NO to coordinator. If it sends NO, it aborts its local computation Phase 2a: Coordinator collects all votes; if all are YES, it sends COMMIT to all participants, otherwise it sends ABORT Phase 2b: Each participant waits for COMMIT or ABORT and handles accordingly. 07 – Fault Tolerance/7.5 Distributed Commit
229
07 – 26 Fault Tolerance/7.5 Distributed Commit
Two-Phase Commit (2/2) 07 – Fault Tolerance/7.5 Distributed Commit
230
07 – 27 Fault Tolerance/7.5 Distributed Commit
2PC – Failing Participant Observation: Consider participant crash in one of its states, and the subsequent recovery to that state: Initial state: No problem, as participant was unaware of the protocol Ready state: Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make log the coordinator’s decision Abort state: Merely make entry into abort state idempotent, e.g., removing the workspace of results Commit state: Also make entry into commit state idempotent, e.g., copying workspace to storage. Observation: When distributed commit is required, having participants use temporary workspaces to keep their results allows for simple recovery in the presence of failures. 07 – Fault Tolerance/7.5 Distributed Commit
231
07 – 28 Fault Tolerance/7.5 Distributed Commit
2PC – Failing Coordinator Observation: The real problem lies in the fact that the coordinator’s final decision may not be available for some time (or actually lost) Alternative: Let a participant P in the ready state timeout when it hasn’t received the coordinator’s decision; P tries to find out what other participants know. Question: Can P not succeed in getting the required information? Observation: Essence of the problem is that a recovering participant cannot make a local decision: it is dependent on other (possibly failed) processes 07 – Fault Tolerance/7.5 Distributed Commit
232
07 – 29 Fault Tolerance/7.5 Distributed Commit
Three-Phase Commit (1/2) Phase 1a: Coordinator sends VOTE_REQUEST to participants Phase 1b: When participant receives VOTE_REQUEST it returns either YES or NO to coordinator. If it sends NO, it aborts its local computation Phase 2a: Coordinator collects all votes; if all are YES, it sends PREPARE to all participants, otherwise it sends ABORT, and halts Phase 2b: Each participant waits for PREPARE, or waits for ABORT after which it halts Phase 3a: (Prepare to commit) Coordinator waits until all participants have ACKed receipt of PREPARE message, and then sends COMMIT to all Phase 3b: (Prepare to commit) Participant waits for COMMIT 07 – Fault Tolerance/7.5 Distributed Commit
233
07 – 30 Fault Tolerance/7.5 Distributed Commit
Three-Phase Commit (2/2) 07 – Fault Tolerance/7.5 Distributed Commit
234
07 – 31 Fault Tolerance/7.5 Distributed Commit
3PC – Failing Participant Basic issue: Can P find out what it should it do after crashing in the ready or pre-commit state, even if other participants or the coordinator failed? Essence: Coordinator and participants on their way to commit, never differ by more than one state transition Consequence: If a participant timeouts in ready state, it can find out at the coordinator or other participants whether it should abort, or enter pre-commit state Observation: If a participant already made it to the pre-commit state, it can always safely commit (but is not allowed to do so for the sake of failing other processes) Observation: We may need to elect another coordinator to send off the final COMMIT 07 – Fault Tolerance/7.5 Distributed Commit
235
07 – 32 Fault Tolerance/Recovery
Introduction Checkpointing Message Logging 07 – Fault Tolerance/Recovery
236
07 – 33 Fault Tolerance/Recovery
Recovery: Background Essence: When a failure occurs, we need to bring the system into an error-free state: Forward error recovery: Find a new state from which the system can continue operation Backward error recovery: Bring the system back into a previous error-free state Practice: Use backward error recovery, requiring that we establish recovery points Observation: Recovery in distributed systems is complicated by the fact that processes need to cooperate in identifying a consistent state from where to recover 07 – Fault Tolerance/Recovery
237
07 – 34 Fault Tolerance/Recovery
Consistent Recovery State Requirement: Every message that has been received is also shown to have been sent in the state of the sender Recovery line: Assuming processes regularly checkpoint their state, the most recent consistent global checkpoint. Observation: If and only if the system provides reliable communication, should sent messages also be received in a consistent state 07 – Fault Tolerance/Recovery
238
07 – 35 Fault Tolerance/Recovery
Cascaded Rollback Observation: If checkpointing is done at the “wrong” instants, the recovery line may lie at system startup time cascaded rollback 07 – Fault Tolerance/Recovery
239
07 – 36 Fault Tolerance/Recovery
Checkpointing: Stable Storage Principle: Replicate all data on at least two disks, and keep one copy “correct” at all times. After a crash: If both disks are identical: you’re in good shape. If one is bad, but the other is okay (checksums): choose the good one. If both seem okay, but are different: choose the main disk. If both aren’t good: you’re not in a good shape. 07 – Fault Tolerance/Recovery
240
07 – 37 Fault Tolerance/Recovery
Independent Checkpointing Essence: Each process independently takes checkpoints, with the risk that a cascaded rollback to system startup. Let CP[i](m) denote mth of process Pi and INT[i](m) the interval between CP[i](m - 1) and CP[i](m) When process Pi sends a message in interval INT[i](m) , it piggybacks (i,m) When process Pj receives a message in interval INT[j](n) , it records the dependency INT[i](m) → INT[j](n) The dependency INT[i](m) → INT[j](n) is saved to stable storage when taking checkpoint CP[j](n) Observation: If process Pi rolls back to CP[i](m - 1), Pj must roll back to CP[j](n - 1). Question: How can Pj find out where to roll back to? 07 – Fault Tolerance/Recovery
241
07 – 38 Fault Tolerance/Recovery
Coordinated Checkpointing Essence: Each process takes a checkpoint after a globally coordinated action Question: What advantages are there to coordinated checkpointing? Simple solution: Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue Observation: It is possible to consider only those processes that depend on the recovery of the coordinator, and ignore the rest 07 – Fault Tolerance/Recovery
242
07 – 39 Fault Tolerance/Recovery
Message Logging Alternative: Instead of taking an (expensive) checkpoint, try to replay your (communication) behavior from the most recent checkpoint store messages in a log Assumption: We assume a piecewise deterministic execution model: The execution of each process can be considered as a sequence of state intervals Each state interval starts with a nondeterministic event (e.g., message receipt) Execution in a state interval is deterministic Conclusion: If we record nondeterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay Question: Why is logging only messages not enough? Question: Is logging only nondeterministic events enough? 07 – Fault Tolerance/Recovery
243
Message Logging and Consistency
Problem: When should we actually log messages? Issue: Avoid orphans: Process Q has just received and subsequently delivered messages m1 and m2 Assume that m2 is never logged. After delivering m1 and m2 , Q sends message m3 to process R Process R receives and subsequently delivers m3 Goal: Devise message logging schemes in which orphans do not occur 07 – Fault Tolerance/Recovery
244
07 – 41 Fault Tolerance/Recovery
Message-Logging Schemes (1/2) HDR[m]: The header of message m containing its source, destination, sequence number, and delivery number The header contains all information for resending a message and delivering it in the correct order (assume data is reproduced by the application) A message m is stable if HDR[m] cannot be lost (e.g., because it has been written to stable storage) DEP[m]: The set of processes to which message m has been delivered, as well as any message that causally depends on delivery of m COPY[m]: The set of processes that have a copy of HDR[m] in their volatile memory If C is a collection of crashed processes, then Q C is an orphan if there is a message m such that Q ∈ DEP[m] and COPY[m] ⊆ C 07 – Fault Tolerance/Recovery
245
07 – 42 Fault Tolerance/Recovery
Message-Logging Schemes (2/2) Goal: No orphans means that for each message m, DEP[m] ⊆ COPY[m] Pessimistic protocol: for each nonstable message m, there is at most one process dependent on m, that is DEP[m] ≤ 1 Consequence: An unstable message in a pessimistic protocol must be made stable before sending a next message Optimistic protocol: for each unstable message m, we ensure that if COPY[m] ⊆ C, then eventually also DEP[m] ⊆ C where C denotes a set of processes that have been marked as faulty Consequence: To guarantee that DEP[m] ⊆ C, we generally rollback each orphan process Q until Q DEP[m] 07 – Fault Tolerance/Recovery
246
Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 09 Security 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency and Replication 07 Fault Tolerance 08 Security 09 Distributed Object-Based Systems 10 Distributed File Systems 11 Distributed Document-Based Systems 12 Distributed Coordination-Based Systems 00 – /
247
Overview Introduction Secure channels Access control
Security management 08 – Security/
248
Security: Dependability Revisited
Basics: A component provides services to clients. To provide services, the component may require the services from other components a component may depend on some other component. Observation: In distributed systems, security is the combination of availability, integrity, and confidentiality. A dependable distributed system is thus fault tolerant and secure. 08 – Security/8.1 Introduction
249
08 – 3 Security/8.1 Introduction
Security Threats Subject: Entity capable of issuing a request for a service as provided by objects Channel: The carrier of requests and replies for services offered to subjects Object: Entity providing services to subjects. Channels and objects are subject to security threats: 08 – Security/8.1 Introduction
250
08 – 4 Security/8.1 Introduction
Security Mechanisms Issue: To protect against security threats, we have a number of security mechanisms at our disposal: Encryption: Transform data into something that an attacker cannot understand (confidentiality). It is also used to check whether something has been modified (integrity). Authentication: Verify the claim that a subject says it is S: verifying the identity of a subject. Authorization: Determining whether a subject is permitted to make use of certain services. Auditing: Trace which subjects accessed what, and in which way. Useful only if it can help catch an attacker. Note: authorization makes sense only if the requesting subject has been authenticated 08 – Security/8.1 Introduction
251
08 – 5 Security/8.1 Introduction
Security Policies (1/2) Policy: Prescribes how to use mechanisms to protect against attacks. Requires that a model of possible attacks is described (i.e., security architecture). Example: Globus security architecture There are multiple administrative domains Local operations subject to local security policies Global operations require requester to be globally known Interdomain operations require mutual authentication Global authentication replaces local authentication Users can delegate privileges to processes Credentials can be shared between processes in the same domain 08 – Security/8.1 Introduction
252
08 – 6 Security/8.1 Introduction
Security Policies (2/2) Policy statements leads to the introduction of mechanisms for cross- domain authentication and making users globally known user proxies and resource proxies 08 – Security/8.1 Introduction
253
08 – 7 Security/8.1 Introduction
Design Issue: Focus of Control Essence: What is our focus when talking about protection: (a) data, (b) invalid operations, (c) unauthorized users Note: We generally need all three, but each requires different mechanisms 08 – Security/8.1 Introduction
254
Design Issue: Layering of Mechanisms and TCB
Essence: At which logical level are we going to implement security mechanisms? Important: Whether security mechanisms are actually used is related to the trust a user has in those mechanisms. No trust implement your own mechanisms. Trusted Computing Base: What is the set of mechanisms needed to enforce a policy. The smaller, the better. 08 – Security/8.1 Introduction
255
08 – 9 Security/8.1 Introduction
Cryptography Symmetric system: Use a single key to (1) encrypt the plaintext and (2) decrypt the ciphertext. Requires that sender and receiver share the secret key. Asymmetric system: Use different keys for encryption and decryption, of which one is private, and the other public. Hashing system: Only encrypt data and produce a fixed-length digest. There is no decryption; only comparison is possible. 08 – Security/8.1 Introduction
256
08 – 10 Security/8.1 Introduction
Cryptographic Functions (1/2) Essence: Make the encryption method E public, but let the encryption as a whole be parameterized by means of a key S (Same for decryption) One-way function: Given some output mout of ES, it is (analytically or) computationally infeasible to find min:ES (min) = (mout ) Weak collision resistance: Given a pair <m, ES (m)> it is computationally infeasible to find an m* ≠ m such that ES (m*) = ES (m) Strong collision resistance: It is computationally infeasible to find any two different inputs m and m* such that ES (m) = ES (m*) 08 – Security/8.1 Introduction
257
08 – 11 Security/8.1 Introduction
Cryptographic Functions (2/2) One-way key: Given an encrypted message mout, message min,and encryption function E, it is analytically and computationally infeasible to find a key K such that mout = Ek (min) Weak key collision resistance: Given a triplet <m, S, E>, it is computationally infeasible to find an K * ≠ K such that EK* (m) = EK(m) Strong key collision resistance: It is computationally infeasible to find any two different keys K and K* such that for all m:EK(m) = EK* (m) Note: Not all cryptographic functions have keys (such as hash functions) 08 – Security/8.1 Introduction
258
08 – 12 Security/8.2Secure Channels
Authentication Message Integrity and confidentiality Secure group communication 08 – Security/8.2Secure Channels
259
08 – 13 Security/8.2Secure Channels
Goal: Set up a channel allowing for secure communication between two processes: They both know who is on the other side (authenticated). They both know that messages cannot be tampered with (integrity). They both know messages cannot leak away (confidentiality). 08 – Security/8.2Secure Channels
260
08 – 14 Security/8.2Secure Channels
Authentication versus Integrity Note: Authentication and data integrity rely on each other: Consider an active attack by Trudy on the communication from Alice to Bob. Authentication without integrity: Alice’s message is authenticated, and intercepted by Trudy, who tampers with its content, but leaves the authentication part as is. Authentication has become meaningless. Integrity without authentication: Trudy intercepts a message from Alice, and then makes Bob believe that the content was really sent by Trudy. Integrity has become meaningless. Question: What can we say about confidentiality versus authentication and integrity? 08 – Security/8.2Secure Channels
261
08 – 15 Security/8.2Secure Channels
Authentication: Secret Keys 1: Alice sends ID to Bob 2: Bob sends challenge RB (i.e. a random number) to Alice 3: Alice encrypts RB with shared key KA,B. Now Bob knows he’s talking to Alice 4: Alice send challenge RA to Bob 5: Bob encrypts RA with KA,B. Now Alice knows she’s talking to Bob Note: We can “improve” the protocol by combining steps 1&4, and 2&3. This costs only the correctness. 08 – Security/8.2Secure Channels
262
Authentication: Secret Keys Reflection Attack
1: Chuck claims he’s Alice, and sends challenge RC 2: Bob returns a challenge RB and the encrypted RC 3: Chuck starts a second session, claiming he is Alice, but uses challenge RB 4: Bob sends back a challenge, plus KA,B (RB) 5: Chuck sends back KA,B (RB) for the first session to prove he is Alice 08 – Security/8.2Secure Channels
263
Authentication: Public Keys
1: Alice sends a challenge RA to Bob, encrypted with Bob’s public key KB+ 2: Bob decrypts the message, generates a secret key KA,B, proves he’s Bob (by sending RA back), and sends a challenge RB to Alice. Everything’s encrypted with Alice’s public key KA+ . 3: Alice proves she’s Alice by sending back the decrypted challenge, encrypted with generated secret key KA,B Note: KA,B is also known as a session key (we’ll come back to these keys later on). 08 – Security/8.2Secure Channels
264
08 – 18 Security/8.2Secure Channels
Authentication: KDC (1/2) Problem: With N subjects, we need to manage N(N - 1)/2 keys, each subject knowing N - 1 keys Essence: Use a trusted Key Distribution Center that generates keys when necessary. Question: How many keys do we need to manage? 08 – Security/8.2Secure Channels
265
08 – 19 Security/8.2Secure Channels
Authentication: KDC (2/2) Inconvenient: We need to ensure that Bob knows about KA,B before Alice gets in touch. Solution: Let Alice do the work and pass her a ticket to set up a secure channel with Bob Note: This is also known as the Needham-Schroeder authentication protocol, and is widely applied (in different forms). 08 – Security/8.2Secure Channels
266
08 – 20 Security/8.2Secure Channels
Needham-Schroeder: Subtleties Q1: Why does the KDC put Bob into its reply message, and Alice into the ticket? Q2: The ticket sent back to Alice by the KDC is encrypted with Alice’s key. Is this necessary? Security flaw: Suppose Chuck finds out Alice’s key he can use that key anytime to impersonate Alice, even if Alice changes her private key at the KDC. Reasoning: Once Chuck finds out Alice’s key, he can use it to decrypt a (possibly old) ticket for a session with Bob, and convince Bob to talk to him using the old session key. Solution: Have Alice get an encrypted number from Bob first, and put that number in the ticket provided by the KDC we’re now ensuring that every session is known at the KDC. 08 – Security/8.2Secure Channels
267
08 – 21 Security/8.2Secure Channels
Confidentiality (1/2) Secret key: Use a shared secret key to encrypt and decrypt all messages sent between Alice and Bob Public key: If Alice sends a message m to Bob, she encrypts it with Bob’s public key: KB+ (m) There are a number of problems with keys: Keys wear out: The more data is encrypted by a single key, the easier it becomes to find that key don’t use keys too often Danger of replay: Using the same key for different communication sessions, permits old messages to be inserted in the current session don’t use keys for different sessions 08 – Security/8.2Secure Channels
268
08 – 22 Security/8.2Secure Channels
Confidentiality (2/2) Compromised keys: If a key is compromised, you can never use it again. Really bad if all communication between Alice and Bob is based on the same key over and over again don’t use the same key for different things Temporary keys: Untrusted components may play along perhaps just once, but you would never want them to have knowledge about your really good key for all times make keys disposable Essence: Don’t use valuable and expensive keys for all communication, but only for authentication purposes. Solution: Introduce a “cheap” session key that is used only during one single conversation or connection (“cheap” also means efficient in encryption and decryption) 08 – Security/8.2Secure Channels
269
08 – 23 Security/8.2Secure Channels
Digital Signatures Harder requirements: Authentication: Receiver can verify the claimed identity of the sender Nonrepudiation: The sender can later not deny that he/she sent the message Integrity: The message cannot be maliciously altered during, or after receipt Solution: Let a sender sign all transmitted messages, in such a way that (1) the signature can be verified and (2) message and signature are uniquely associated 08 – Security/8.2Secure Channels
270
08 – 24 Security/8.2Secure Channels
Public Key Signatures 1: Alice encrypts her message m with her private key KA–⇒m' = KA–(m) 2: She then encrypts m' with Bob’s public key, along with the original message m⇒ m'' = KB+ (m, KA–(m)) ,and sends m'' to Bob. 3: Bob decrypts the incoming message with his private key KB–. We know for sure that no one else has been able to read m, nor m' during their transmission. 4: Bob decrypts m' with Alice’s public key KA+ Bob. now knows the message came from Alice. Question: Is this good enough against nonrepudiation? 08 – Security/8.2Secure Channels
271
08 – 25 Security/8.2Secure Channels
Message Digests Basic idea: Don’t mix authentication and secrecy. Instead, it should also be possible to send a message in the clear, but have it signed as well. Solution: take a message digest, and sign that: Recall: Message digests are computed using a hash function, which produces a fixed-length message from arbitrary-length data. 08 – Security/8.2Secure Channels
272
08 – 26 Security/8.2Secure Channels
Secure Group Communication Design issue: How can you share secret information between multiple members without losing everything when one member turns bad. Confidentiality: Follow a simple (hard-to-scale) approach by maintaining a separate secret key between each pair of members. Replication: You also want to provide replication transparency. Apply secret sharing: No process knows the entire secret; it can be revealed only through joint cooperation Assumption: at most k out of N processes can produce an incorrect answer At most c ≤ k processes have been corrupted Note: We are dealing with a k fault tolerant process group. 08 – Security/8.2Secure Channels
273
08 – 27 Security/8.2Secure Channels
Secure Replicated Group (1/2) Let N = 5, c = 2 Each server Si gets to see each request and responds with ri Response is sent along with digest md(ri), and signed with private key Ki–. Signature is denoted as sig (Si ,ri) = Ki– (md(ri)). 08 – Security/8.2Secure Channels
274
08 – 28 Security/8.2Secure Channels
Secure Replicated Group (2/2) Client uses special decryption function D that computes a single digest d from three signatures: d = D(sig (S ,r), sig (S' ,r'), sig (S'' ,r'')) If d = md(ri) for some ri , ri is considered correct Also known as (m,n)-threshold scheme(with m = c + 1, n = N) 08 – Security/8.2Secure Channels
275
08 – 29 Security/8.3 Access Control
General issues Firewalls Secure mobile code 08 – Security/8.3 Access Control
276
Authorization versus Authentication
Authentication: Verify the claim that a subject says it is S : verifying the identity of a subject Authorization: Determining whether a subject is permitted certain services from an object Note: authorization makes sense only if the requesting subject has been authenticated 08 – Security/8.3 Access Control
277
08 – 31 Security/8.3 Access Control
Access Control Matrix Essence: Maintain an access control matrix ACM in which entry ACM[S,O] contains the permissible operations that subject S can perform on object O Implementation (a): Each object maintains an access control list (ACL): ACM[*,O] describing the permissible operations per subject (or group of subjects) Implementation (b): Each subject S has a capability: ACM[S,*] describing the permissible operations per object (or category of objects) 08 – Security/8.3 Access Control
278
08 – 32 Security/8.3 Access Control
Protection Domains Issue: ACLs or capability lists can be very large. Reduce information by means of protection domains: Set of (object, access rights) pairs Each pair is associated with a protection domain For each incoming request the reference monitor first looks up the appropriate protection domain Common implementation of protection domains: Groups: Users belong to a specific group; each group has associated access rights Roles: Don’t differentiate between users, but only the roles they can play. Your role is determined at login time. Role changes are allowed. 08 – Security/8.3 Access Control
279
08 – 33 Security/8.3 Access Control
Firewalls Essence: Sometimes it’s better to select service requests at the lowest level: network packets. Packets that do not fit certain requirements are simply removed from the channel Solution: Protect your company by a firewall: it implements access control Question: What do you think would be the biggest breach in firewalls? 08 – Security/8.3 Access Control
280
08 – 34 Security/8.3 Access Control
Secure Mobile Code Problem: Mobile code is great for balancing communication and computation, but is hard to implement a general-purpose mechanism that allows different security policies for local-resource access. In addition, we may need to protect the mobile code (e.g., agents) against malicious hosts. 08 – Security/8.3 Access Control
281
08 – 35 Security/8.3 Access Control
Protecting an Agent Ajanta: Detect that an agent has been tampered with while it was on the move. Most important: appendonly logs: Data can only be appended, not removed There is always an associated checksum. Initially, Cinit = K+owner(N) , with N a nonce. Adding data X by server S: Cnew = K+owner(Cold , sig (S ,X),S) Removing data from the log: K–owner (C) → Cprev , sig (S ,X), S allowing the owner to check integrity of X 08 – Security/8.3 Access Control
282
08 – 36 Security/8.3 Access Control
Protecting a Host (1/2) Simple solution: Enforce a (very strict) single policy, and implement that by means of a few simple mechanisms Sandbox model: Policy: Remote code is allowed access to only a pre-defined collection of resources and services. Mechanism: Check instructions for illegal memory access and service access Playground model: Same policy, but mechanism is to run code on separate “unprotected” machine. 08 – Security/8.3 Access Control
283
08 – 37 Security/8.3 Access Control
Protecting a Host (2/2) Observation: We need to be able to distinguish local from remote code before being able to do anything Refinement 1: We need to be able to assign a set of permissions to mobile code before its execution and check operations against those permissions at all times Refinement 2: We need to be able to assign different sets of permissions to different units of mobile code authenticate mobile code (e.g. through signatures) Question: What would be a very simple policy to follow (Microsoft’s approach)? 08 – Security/8.3 Access Control
284
08 – 38 Security/8.4 Security Management
Key establishment and distribution Secure group management Authorization management 08 – Security/8.4 Security Management
285
Key Establishment: Diffie-Hellman
Observation: We can construct secret keys in a safe way without having to trust a third party (i.e. a KDC): Alice and Bob have to agree on two large numbers, n and g. Both numbers may be public. Alice chooses large number x, and keeps it to herself. Bob does the same, say y. 1: Alice sends (n, g, gx mod n) to Bob 2: Bob sends (gy mod n) to Alice 3: Alice computes KA,B = (gy mod n )x = gxy mod n 4: Bob computes KA,B = (gx mod n )y = gxy mod n 08 – Security/8.4 Security Management
286
08 – 40 Security/8.4 Security Management
Key Distribution (1/2) Essence: If authentication is based on cryptographic protocols, and we need session keys to establish secure channels, who’s responsible for handing out keys? Secret keys: Alice and Bob will have to get a shared key. They can invent their own and use it for data exchange. Alternatively, they can trust a key distribution center (KDC) and ask it for a key. Public keys: Alice will need Bob’s public key to decrypt (signed) messages from Bob, or to send private messages to Bob. But she’ll have to be sure about actually having Bob’s public key, or she may be in big trouble. Use a trusted certification authority (CA) to hand out public keys. A public key is put in a certificate, signed by a CA. 08 – Security/8.4 Security Management
287
08 – 41 Security/8.4 Security Management
Key Distribution (2/2) 08 – Security/8.4 Security Management
288
08 – 42 Security/8.4 Security Management
Secure Group Management (1/2) Structure: Group uses a key pair (KG+, KG–) for communication with nongroup members. There is a separate shared secret key CKG for internal communication. Assume process P wants to join the group and contacts Q. 1: P generates a one-time reply pad RP, and a secret key KP,G . It sends a join request to Q, signed by itself (notation: [JR]P), along with a certificate containing its public key KP+ . 08 – Security/8.4 Security Management
289
08 – 43 Security/8.4 Security Management
Secure Group Management (2/2) 2: Q authenticates P, checks whether it can be allowed as member. It returns the group key CKG, encrypted with the one-time pad, as well as the group’s private key, encrypted as CKG ( KG–). 3: Q authenticates P and sends back KP,G (N) letting Q know that it has all the necessary keys. Question: Why didn’t we send KP+(CKG) instead of using RP? 08 – Security/8.4 Security Management
290
08 – 44 Security/8.4 Security Management
Authorization Management Issue: To avoid that each machine needs to know about all users, we use capabilities and attribute certificates to express the access rights that the holder has. In Amoeba, restricted access rights are encoded in a capability, along with data for an integrity check to protect against tampering: 08 – Security/8.4 Security Management
291
08 – 45 Security/8.4 Security Management
Delegation (1/2) Observation: A subject sometimes wants to delegate its privileges to an object O1, to allow that object to request services from another object O2 Example: A client tells the print server PS to fetch a file F from the file server FS to make a hard copy the client delegates its read privileges on F to PS Nonsolution: Simply hand over your attribute certificate to a delegate (which may pass it on to the next one, etc.) Problem: To what extent can the object trust a certificate to have originated at the initiator of the service request, without forcing the initiator to sign every certificate? 08 – Security/8.4 Security Management
292
08 – 46 Security/8.4 Security Management
Delegation (2/2) Solution: Ensure that delegation proceeds through a secure channel, and let a delegate prove it got the certificate through such a path of channels originating at the initiator. 08 – Security/8.4 Security Management
293
08 – 47 Security/8.4 Security Management
Putting it all together: SESAME SMIB: Database holding shared secret keys, basic access rights, and so on AS: Authenticates a user, and returns a ticket PAS: Hands out attribute certificates KDS: Generates session keys for authenticated users Security Manager: Handles setting up and communicating over a secure channel PVF: Validates access rights contained in attribute certificates 08 – Security/8.4 Security Management
294
Principles and Paradigms Distributed Object-Based Systems
Distributed Systems Principles and Paradigms Chapter 10 Distributed Object-Based Systems 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency and Replication 07 Fault Tolerance 08 Security 09 Distributed Object-Based Systems 10 Distributed File Systems 11 Distributed Document-Based Systems 12 Distributed Coordination-Based Systems 00 – /
295
09 – 1 Distributed Object-Based Systems/9.1 CORBA
DCOM Globe 09 – Distributed Object-Based Systems/9.1 CORBA
296
09 – 2 Distributed Object-Based Systems/9.1 CORBA
CORBA: Common Object Request Broker Architecture Background: Developed by the Object Management Group (OMG) in response to industrial demands for object-based middleware Currently in version #2.4 with #3 (almost) done CORBA is a specification: different implementations of CORBA exist Very much the work of a committee: there are over 800 members of the OMG and many of them have a say in what CORBA should look like Essence: CORBA provides a simple distributed-object model, with specifications for many supporting services it may be here to stay (for a couple of years) 09 – Distributed Object-Based Systems/9.1 CORBA
297
09 – 3 Distributed Object-Based Systems/9.1 CORBA
CORBA Overview (1/2) Object Request Broker (ORB): CORBA’s object broker that connects clients, objects, and services Proxy/Skeleton: Precompiled code that takes care of (un)marshaling invocations and results Dynamic Invocation/Skeleton Interface (DII/DSI): To allow clients to “construct” invocation requests at runtime instead of calling methods at a proxy, and having the server-side “reconstruct” those request into regular method invocations Object adapter: Server-side code that handles incoming invocation requests. 09 – Distributed Object-Based Systems/9.1 CORBA
298
09 – 4 Distributed Object-Based Systems/9.1 CORBA
CORBA Overview (2/2) Interface repository: Database containing interface definitions and which can be queried at runtime Implementation repository: Database containing the implementation (code, and possibly also state) of objects. Effectively: a server that can launch object servers. 09 – Distributed Object-Based Systems/9.1 CORBA
299
09 – 5 Distributed Object-Based Systems/9.1 CORBA
CORBA Object Model Essence: CORBA has a “traditional” remote-object model in which an object residing at an object server is remote accessible through proxies Observation: All CORBA specifications are given by means of interface descriptions, expressed in an IDL. CORBA follows an interface-based approach to objects: Not the objects, but interfaces are the really important entities An object may implement one or more interfaces Interface descriptions can be stored in an interface repository, and looked up at runtime Mappings from IDL to specific programming are part of the CORBA specification (languages include C, C++, Smalltalk, Cobol, Ada, and Java. 09 – Distributed Object-Based Systems/9.1 CORBA
300
09 – 6 Distributed Object-Based Systems/9.1 CORBA
CORBA Services Service Description Collection Facilities for grouping objects into lists, queue, sets, etc. Query Facilities for querying collections of objects in a declarative Manner Concurrency Facilities to allow concurrent access to shared objects Transaction Flat and nested transactions on method calls over multiple objects Event Facilities for asynchronous communication through events Notification Advanced facilities for event-based asynchronous communication Externalization Facilities for marshaling and unmarshaling of objects Life cycle Facilities for creation, deletion, copying, and moving of objects Licensing Facilities for attaching a license to an object Naming Facilities for systemwide naming of objects Property Facilities for associating (attribute, value) pairs with objects Trading Facilities to publish and find the services an object has to offer Persistence Facilities for persistently storing objects Relationship Facilities for expressing relationships between objects Security Mechanisms for secure channels, authorization, and auditing Time Provides the current time within specified error margins 09 – Distributed Object-Based Systems/9.1 CORBA
301
09 – 7 Distributed Object-Based Systems/9.1 CORBA
Communication Models (1/2) Object invocations: CORBA distinguishes three different forms of direct invocations: Event communication: There are also additional facilities by means of event channels:: 09 – Distributed Object-Based Systems/9.1 CORBA
302
09 – 8 Distributed Object-Based Systems/9.1 CORBA
Communication Models (2/2) Messaging facilities: reliable asynchronous and persistent method invocations: 09 – Distributed Object-Based Systems/9.1 CORBA
303
09 – 9 Distributed Object-Based Systems/9.1 CORBA
Processes Most aspects of processes for in CORBA have been discussed in previous classes. What remains is the concept of interceptors: Request-level: Allows you to modify invocation semantics (e.g., multicasting) Message-level: Allows you to control message-passing between client and server (e.g., handle reliability and fragmentation) 09 – Distributed Object-Based Systems/9.1 CORBA
304
09 – 10 Distributed Object-Based Systems/9.1 CORBA
Naming Important: In CORBA, it is essential to distinguish specification- level and implementation-level object references Specification level: An object reference is considered to be the same as a proxy for the referenced object having an object reference means you can directly invoke methods; there is no separate client-to-object binding phase Implementation level: When a client gets an object reference, the implementation ensures that, one way or the other, a proxy for the referenced object is placed in the client’s address space: ObjectReference ObjRef; ObjRef = bindTo(object O in server S at host H); Conclusion: Object references in CORBA used to be highly implementation dependent: different implementations of CORBA could normally not exchange their references. 09 – Distributed Object-Based Systems/9.1 CORBA
305
Interoperable Object References (1/2)
Observation: Recognizing that object references are implementation dependent, we need a separate referencing mechanism to cross ORB boundaries Solution: Object references passed from one ORB to another are transformed by the bridge through which they pass (different transformation schemes can be implemented) Observation: Passing an object reference refA from ORB A to ORB B circumventing the A-to-B bridge may be useless if ORB B doesn’t understand refA 09 – Distributed Object-Based Systems/9.1 CORBA
306
Interoperable Object References (2/2)
Observation: To allow all kinds of different systems to communicate, we standardize the reference that is passed between bridges: 09 – Distributed Object-Based Systems/9.1 CORBA
307
09 – 13 Distributed Object-Based Systems/9.1 CORBA
Naming Service Essence: CORBA’s naming service allows servers to associate a name to an object reference, and have clients subsequently bind to that object by resolving its name Observation: In most CORBA implementations, object references denote servers at specific hosts; naming makes it easier to relocate objects Observation: In the naming graph all nodes are objects; there are no restrictions to binding names to objects CORBA allows arbitrary naming graphs Question: How do you imagine cyclic name resolution stops? Observation: There is no single root; an initial context node is returned through a special call to the ORB. Also: the naming service can operate across different ORBs interoperable naming service 09 – Distributed Object-Based Systems/9.1 CORBA
308
09 – 14 Distributed Object-Based Systems/9.1 CORBA
Fault Tolerance Essence: Mask failures through replication, by putting objects into object groups. Object groups are transparent to clients: they appear as “normal” objects. This approach requires a separate type of object reference: Interoperable Object Group Reference: Note: IOGRs have the same structure as IORs; the main difference is that they are used differently. In IORs an additional profile is used as an alternative; in IOGR, it denotes another replica. 09 – Distributed Object-Based Systems/9.1 CORBA
309
09 – 15 Distributed Object-Based Systems/9.1 CORBA
Security Essence: Allow the client and object to be mostly unaware of all the security policies, except perhaps at binding time; the ORB does the rest. Specific policies are passed to the ORB as (local) objects and are invoked when necessary: Examples: Type of message protection, lists of trusted parties. 09 – Distributed Object-Based Systems/9.1 CORBA
310
09 – 16 Distributed Object-Based Systems/9.2 Distributed COM
DCOM: Distributed Component Object Model Microsoft’s solution to establishing inter-process communication, possibly across machine boundaries. Supports a primitive notion of distributed objects Evolved from early Windows versions to current NT-based systems (including Windows 2000) Comparable to CORBA’s object request broker 09 – Distributed Object-Based Systems/9.2 Distributed COM
311
09 – 17 Distributed Object-Based Systems/9.2 Distributed COM
DCOM Overview (1/2) Somewhat confused? DCOM is related to many things that have been introduced by Microsoft in the past couple of years: DCOM: Adds facilities to communicate across process and machine boundaries. 09 – Distributed Object-Based Systems/9.2 Distributed COM
312
09 – 18 Distributed Object-Based Systems/9.2 Distributed COM
DCOM Overview (2/2) SCM: Service Control Manager, responsible for activating objects (cf., to CORBA’s implementation repository). Proxy marshaler: handles the way that object references are passed between different machines 09 – Distributed Object-Based Systems/9.2 Distributed COM
313
09 – 19 Distributed Object-Based Systems/9.2 Distributed COM
COM Object Model An interface is a collection of semantically related operations Each interface is typed, and therefore has a globally unique interface identifier A client always requests an implementation of an interface: Locate a class that implements the interface Instantiate that class, i.e., create an object Throw the object away when the client is done 09 – Distributed Object-Based Systems/9.2 Distributed COM
314
09 – 20 Distributed Object-Based Systems/9.2 Distributed COM
DCOM Services Note: COM+ is effectively COM plus services that were previously available in an ad-hoc fashion 09 – Distributed Object-Based Systems/9.2 Distributed COM
315
09 – 21 Distributed Object-Based Systems/9.2 Distributed COM
Communication Models Object invocations: Synchronous remote-method calls with at- most-once semantics. Asynchronous invocations are supported through a polling model, as in CORBA. Event communication: Similar to CORBA’s push style model: Messaging: Completely analogous to CORBA messaging. 09 – Distributed Object-Based Systems/9.2 Distributed COM
316
09 – 22 Distributed Object-Based Systems/9.2 Distributed COM
Communication Models Observation: Objects are referenced by means of a local interface pointer. The question is how such pointers can be passed between different machines: Question: Where does the proxy marshaler come from? Do we always need it?. 09 – Distributed Object-Based Systems/9.2 Distributed COM
317
09 – 23 Distributed Object-Based Systems/9.2 Distributed COM
Naming: Monikers Observation: DCOM can handle only objects as temporary instances of a class. To accommodate objects that can outlive their client, something else is needed. Moniker: A hack to support real objects A moniker associates data (e.g., a file), with an application or program Monikers can be stored A moniker can contain a binding protocol, specifying how the associated program should be “launched” with respect to the data. 09 – Distributed Object-Based Systems/9.2 Distributed COM
318
09 – 24 Distributed Object-Based Systems/9.2 Distributed COM
Active Directory Essence: a worldwide distributed directory service, but one that does not provide location transparency. Basics: Associate a directory service (called domain controller) with each domain; look up the controller using a normal DNS query: Note: Controller is implemented as an LDAP server 09 – Distributed Object-Based Systems/9.2 Distributed COM
319
09 – 25 Distributed Object-Based Systems/9.2 Distributed COM
Fault Tolerance Automatic transactions: Each class object (from which objects are created), has a transaction attribute that determines how its objects behave as part of a transaction: Note: Transactions are essentially executed at the level of a method invocation. 09 – Distributed Object-Based Systems/9.2 Distributed COM
320
09 – 26 Distributed Object-Based Systems/9.2 Distributed COM
Security (1/2) Declarative security: Register per object what the system should enforce with respect to authentication. Authentication is associated with users and user groups. There are different authentication levels: 09 – Distributed Object-Based Systems/9.2 Distributed COM
321
09 – 27 Distributed Object-Based Systems/9.2 Distributed COM
Security (2/2) Delegation: A server can impersonate a client depending on a level: Note: There is also support for programmatic security by which security levels can be set by an application, as well as the required security services (see book). 09 – Distributed Object-Based Systems/9.2 Distributed COM
322
09 – 28 Distributed Object-Based Systems/Globe
Experimental wide-area system currently being developed at Vrije Universiteit Unique for its focus on scalability by means of truly distributed objects Prototype version up and running across multiple machines distributed in NL and across Europe and the US. 09 – Distributed Object-Based Systems/Globe
323
09 – 29 Distributed Object-Based Systems/Globe
Object Model (1/3) Essence: A Globe object is a physically distributed shared object: the object’s state may be physically distributed across several machines Local object: A nondistributed object residing a single address space, often representing a distributed shared object Contact point: A point where clients can contact the distributed object; each contact point is described through a contact address 09 – Distributed Object-Based Systems/Globe
324
09 – 30 Distributed Object-Based Systems/Globe
Object Model (2/3) Observation: Globe attempts to separate functionality from distribution by distinguishing different local subobjects: Semantics subobject: Contains the methods that implement the functionality of the distributed shared object Communication subobject: Provides a (relatively simple), network- independent interface for communication between local objects 09 – Distributed Object-Based Systems/Globe
325
09 – 31 Distributed Object-Based Systems/Globe
Object Model (3/3) Replication subobject: Contains the implementation of an object- specific consistency protocol that controls exactly when a method on the semantics subobject may be invoked Control subobject: Connects the user-defined interfaces of the semantics subobject to the generic, predefined interfaces of the replication subobject 09 – Distributed Object-Based Systems/Globe
326
09 – 32 Distributed Object-Based Systems/Globe
Client-to-Object Binding Observation: Globe’s contact addresses correspond to CORBA’s object references 09 – Distributed Object-Based Systems/Globe
327
09 – 33 Distributed Object-Based Systems/Globe
Globe Services 09 – Distributed Object-Based Systems/Globe
328
09 – 34 Distributed Object-Based Systems/Globe
Object References Essence: Globe uses location-independent object handles which are to be resolved to contact addresses (which describes where and how an object can be contacted): Associated with a contact point of the distributed object Specifies (for example) a transport-level network address to which the object will listen Contains an implementation handle, specifying exactly what the client should implement if it wants to communicate through the contact point: ftp://ftp.globe.org/pub/common/ip/tcp/… …master-slave/standard/slave.jar “slave/master-slave/tcp/ip” Observation: Objects in Globe have their own objectspecific implementations; there is no “standard” proxy that is implemented for all clients 09 – Distributed Object-Based Systems/Globe
329
09 – 35 Distributed Object-Based Systems/Globe
Naming Objects Observation: Globe separates naming from locating objects (as described in Chapter 04). The current naming service is based on DNS, using TXT records for storing object handles Observation: The location service is implemented as a generic, hierarchical tree, similar to the approach explained in Chapter 04. 09 – Distributed Object-Based Systems/Globe
330
09 – 36 Distributed Object-Based Systems/Globe
Caching and Replication Observation: Here’s where Globe differs from many other systems: The organization of a local object is such that replication is inherently part of each distributed shared object All replication subobjects have the same interface: This approach allows to implement any object specific caching/replication strategy 09 – Distributed Object-Based Systems/Globe
331
09 – 37 Distributed Object-Based Systems/Globe
Security Essence: Additional security subobject checks for authorized communication, invocation, and parameter values. Globe can be integrated with existing security services: 09 – Distributed Object-Based Systems/Globe
332
09 – 38 Distributed Object-Based Systems/Globe
Comparison 09 – Distributed Object-Based Systems/Globe
333
Principles and Paradigms Distributed File Systems
Distributed Systems Principles and Paradigms Chapter 11 Distributed File Systems 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency and Replication 07 Fault Tolerance 08 Security 09 Distributed Object-Based Systems 10 Distributed File Systems 11 Distributed Document-Based Systems 12 Distributed Coordination-Based Systems 00 – /
334
10 – 1 Distributed File Systems/
Sun NFS Coda 10 – Distributed File Systems/
335
10 – 2 Distributed File Systems/10.1 NFS
Sun NFS Sun Network File System: Now version 3, version 4 is coming up. Basic model: Remote file service: try to make a file system transparently available to remote clients. Follows remote access model (a) instead of upload/download model (b): 10 – Distributed File Systems/10.1 NFS
336
10 – 3 Distributed File Systems/10.1 NFS
NFS Architecture NFS is implemented using the Virtual File System abstraction, which is now used for lots of different operating systems: Essence: VFS provides standard file system interface, and allows to hide difference between accessing local or remote file system. Question: Is NFS actually a file system? 10 – Distributed File Systems/10.1 NFS
337
10 – 4 Distributed File Systems/10.1 NFS
NFS File Operations Question: Anything unusual between v3 and v4? 10 – Distributed File Systems/10.1 NFS
338
10 – 5 Distributed File Systems/10.1 NFS
Communication in NFS Essence: All communication is based on the (besteffort) Open Network Computing RPC (ONC RPC). Version 4 now also supports compound procedures: (a) Normal RPC (b) Compound RPC: first failure breaks execution ofrest of the RPC Question: What’s the use of compound RPCs? 10 – Distributed File Systems/10.1 NFS
339
10 – 6 Distributed File Systems/10.1 NFS
Naming in NFS (1/2) Essence: NFS provides support for mounting remote file systems (and even directories) into a client’s local name space: Watch it: Different clients may have different local name spaces. This may make file sharing extremely difficult (Why?). Question: What are the solutions to this problem? 10 – Distributed File Systems/10.1 NFS
340
10 – 7 Distributed File Systems/10.1 NFS
Naming in NFS (2/2) Note: A server cannot export an imported directory. The client must mount the server-imported directory: 10 – Distributed File Systems/10.1 NFS
341
10 – 8 Distributed File Systems/10.1 NFS
Automounting in NFS Problem: To share files, we partly standardize local name spaces and mount shared directories. Mounting very large directories (e.g., all subdirectories in home/users) takes a lot of time (Why?). Solution: Mount on demand — automounting Question: What’s the main drawback of having the automounter in the loop? 10 – Distributed File Systems/10.1 NFS
342
10 – 9 Distributed File Systems/10.1 NFS
File Sharing Semantics (1/2) Problem: When dealing with distributed file systems, we need to take into account the ordering of concurrent read/write operations, and expected semantics (=consistency). 10 – Distributed File Systems/10.1 NFS
343
10 – 10 Distributed File Systems/10.1 NFS
File Sharing Semantics (2/2) UNIX semantics: a read operation returns the effect of the last write operation can only be implemented for remote access models in which there is only a single copy of the file Transaction semantics: the file system supports transactions on a single file issue is how to allow concurrent access to a physically distributed file Session semantics: the effects of read and write operations are seen only to the client that has opened (a local copy) of the file what happens when a file is closed (only one client may actually win) 10 – Distributed File Systems/10.1 NFS
344
10 – 11 Distributed File Systems/10.1 NFS
File Locking in NFS Observation: It could have been simple, but it isn’t. NFS supports an explicit locking protocol (stateful), but also an implicit share reservation approach: Question: What’s the use of these share reservations? 10 – Distributed File Systems/10.1 NFS
345
10 – 12 Distributed File Systems/10.1 NFS
Caching & Replication Essence: Clients are on their own. Open delegation: Server will explicitly permit a client machine to handle local operations from other clients on that machine. Good for performance. Does require that the server can take over when necessary: Question: Would this scheme fit into v3 Question: What kind of file access model are we dealing with? 10 – Distributed File Systems/10.1 NFS
346
10 – 13 Distributed File Systems/10.1 NFS
Fault Tolerance Important: Until v4, fault tolerance was easy due to the stateless servers. Now, problems come from the use of an unreliable RPC mechanism, but also stateful servers that have delegated matters to clients. RPC: Cannot detect duplicates. Solution: use a duplicaterequest cache: Locking/Open delegation: Essentially, recovered server offers clients a grace period to reclaim locks. When period is over, the server starts its normal local manager function again. 10 – Distributed File Systems/10.1 NFS
347
10 – 14 Distributed File Systems/10.1 NFS
Security Essence: Set up a secure RPC channel between client and server: Secure NFS: Use Diffie-Hellman key exchange to set up a secure channel. However, it uses only 192-bit keys, which have shown to be easy to break. RPCSEC GSS: A standard interface that allows integration with existing security services: 10 – Distributed File Systems/10.1 NFS
348
10 – 15 Distributed File Systems/10.2 Coda
Coda File System Developed in the 90s as a descendant of the Andrew File System (CMU) Now shipped with Linux distributions (after 10 years!) Emphasis: support for mobile computing, in particular disconnected operation. 10 – Distributed File Systems/10.2 Coda
349
10 – 16 Distributed File Systems/10.2 Coda
Coda Architecture Note: The core of the client machine is the Venus process. Note that most stuff is at user level. 10 – Distributed File Systems/10.2 Coda
350
10 – 17 Distributed File Systems/10.2 Coda
Communication in Coda (1/2) Essence: All client-server communication (and server-server communication) is handled by means of a reliable RPC subsystem. Coda RPC supports side effects: Note: side effects allows for separate protocol to handle, e.g., multimedia streams. 10 – Distributed File Systems/10.2 Coda
351
10 – 18 Distributed File Systems/10.2 Coda
Communication in Coda (2/2) Issue: Coda servers allow clients to cache whole files. Modifications by other clients are notified through invalidation messages there is a need for multicast RPC: (a) Sequential RPCs (b) Multicast RPCs Question: Why do multi RPCs really help? 10 – Distributed File Systems/10.2 Coda
352
10 – 19 Distributed File Systems/10.2 Coda
Naming in Coda Essence: Similar remote mounting mechanism as in NFS, except that there is a shared name space between all clients: 10 – Distributed File Systems/10.2 Coda
353
10 – 20 Distributed File Systems/10.2 Coda
File Handles in Coda Background: Coda assumes that files may be replicated between servers. Issue becomes to track a file in a location-transparent way: Files are contained in a volume (cf. to UNIX file system on disk) Volumes have a Replicated Volume Identifier (RVID) Volumes may be replicated; physical volume has a VID 10 – Distributed File Systems/10.2 Coda
354
10 – 21 Distributed File Systems/10.2 Coda
File Sharing Semantics in Coda Essence: Coda assumes transactional semantics, but without the full-fledged capabilities of real transactions. Note: Transactional issues reappear in the form of “this ordering could have taken place.” 10 – Distributed File Systems/10.2 Coda
355
10 – 22 Distributed File Systems/10.2 Coda
Caching in Coda Essence: Combined with the transactional semantics, we obtain flexibility when it comes to letting clients operate on local copies: Note: A writer can continue to work on its local copy; a reader will have to get a fresh copy on the next open. Question: Would it be OK if the reader continued to use its own local copy? 10 – Distributed File Systems/10.2 Coda
356
10 – 23 Distributed File Systems/10.2 Coda
Server Replication in Coda (1/2) Essence: Coda uses ROWA for server replication: Files are grouped into volumes (cf. traditional UNIX file system) Collection of servers replicating the same volume form that volume’s Volume Storage Group) Writes are propagated to a file’s VSG Reads are done from one server in a file’s VSG Problem: what to do when the VSG partitions and partition is later healed? 10 – Distributed File Systems/10.2 Coda
357
10 – 24 Distributed File Systems/10.2 Coda
Server Replication in Coda (2/2) Solution: Detect inconsistencies using version vectors: CVVi(f)[j] = k means that server Si knows that server Sj has seen version k of file f . When a client reads file f from server Si , it receives CVVi(f) . Updates are multicast to all reachable servers (client’s accessible VSG), which increment their CVVi(f)[i] . When the partition is restored, comparison of version vectors will allow detection of conflicts and possible reconciliation. Note: the client informs a server about the servers in the AVSG where the update has also taken place. 10 – Distributed File Systems/10.2 Coda
358
10 – 25 Distributed File Systems/10.2 Coda
Fault Tolerance Note: Coda achieves high availability through clientside caching and server replication Disconnected operation: When a client is no longer connected to one of the servers, it may continue with the copies of files that it has cached. Requires that the cache is properly filled (hoarding). Compute a priority for each file Bring the user’s cache into equilibrium (hoard walk): There is no uncached file with higher priority than a cached file The cache is full or no uncached file has nonzero priority Each cached file is a copy of a file maintained by the client’s AVSG Note: Disconnected operation works best when there is hardly any write-sharing . 10 – Distributed File Systems/10.2 Coda
359
10 – 26 Distributed File Systems/10.2 Coda
Security Essence: All communication is based on a secure RPC mechanism that uses secret keys. When logging into the system, a client receives: A clear token CT from an AS (containing a generated shared secret key KS). CT has time-limited validity. A secret token ST = Kvice ([CT]*Kvice ), which is an encrypted and cryptographically sealed version of CT. 10 – Distributed File Systems/10.2 Coda
360
Principles and Paradigms
Distributed Systems Principles and Paradigms Chapter 11 Distributed Document-Based Systems 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency and Replication 07 Fault Tolerance 08 Security 09 Distributed Object-Based Systems 10 Distributed File Systems 11 Distributed Document-Based Systems 12 Distributed Coordination-Based Systems 00 – /
361
Distributed Document-Based Systems
World Wide Web Lotus Notes 11 – Distributed Document-Based Systems/
362
11 – 2 Distributed Document-Based Systems/11.1 World Wide Web
WWW: Overview Essence: The WWW is a huge client-server system with millions of servers; each server hosting thousands of hyperlinked documents: Documents are generally represented in text (plain text, HTML, XML) Alternative types: images, audio, video, but also applications (PDF, PS) Documents may contain scripts that are executed by the client-side software 11 – 2 Distributed Document-Based Systems/11.1 World Wide Web
363
11 – 3 Distributed Document-Based Systems/11.1 World Wide Web
Extensions to Basic Model Issue: Simple documents are not enough – we need a whole range of mechanisms to get information to a client 11 – 3 Distributed Document-Based Systems/11.1 World Wide Web
364
11 – 4 Distributed Document-Based Systems/11.1 World Wide Web
Communication (1/2) Essence: Communication in the Web is generally based on HTTP; a relatively simple client-server transfer protocol having the following request messages: 11 – 4 Distributed Document-Based Systems/11.1 World Wide Web
365
11 – 5 Distributed Document-Based Systems/11.1 World Wide Web
Communication (2/2) 11 – 5 Distributed Document-Based Systems/11.1 World Wide Web
366
11 – 6 Distributed Document-Based Systems/11.1 World Wide Web
WWW Servers Important: The majority of Web servers is a configured Apache server, which breaks down each HTTP request handling into eight phases. This approach allows flexible configuration of servers. 1. Resolving document reference to local file name 2. Client authentication 3. Client access control 4. Request access control 5. MIME type determination of the response 6. General phase for handling leftovers 7. Transmission of the response 8. Logging data on the processing of the request 11 – 6 Distributed Document-Based Systems/11.1 World Wide Web
367
11 – 7 Distributed Document-Based Systems/11.1 World Wide Web
Server Clusters (1/2) Essence: To improve performance and availability, WWW servers are often clustered in a way that is transparent to clients: Problem: The front end may easily get overloaded, so that special measures need to be taken. Transport-layer switching: Front end simply passes the TCP request to one of the servers, taking some performance metric into account. Content-aware distribution: Front end reads the content of the HTTP request and then selects the best server. 11 – 7 Distributed Document-Based Systems/11.1 World Wide Web
368
11 – 8 Distributed Document-Based Systems/11.1 World Wide Web
Server Clusters (2/2) Question: Why can content-aware distribution be so much better? 11 – 8 Distributed Document-Based Systems/11.1 World Wide Web
369
11 – 9 Distributed Document-Based Systems/11.1 World Wide Web
Naming: URL URL: Uniform Resource Locator tells how and where to access a resource. Examples: 11 – 9 Distributed Document-Based Systems/11.1 World Wide Web
370
11 – 10 Distributed Document-Based Systems/11.1 World Wide Web
Synchronization: Web DAV Problem: There is a growing need for collaborative auditing of Web documents, but bare-bones HTTP can’t help here. Solution: Web Distributed Authoring and Versioning. Supports exclusive and shared write locks, which operate on entire documents A lock is passed by means of a lock token; the server registers the client(s) holding the lock Clients modify the document locally and post it back to the server along with the lock token Note: There is no specific support for crashed clients holding a lock. 11 – 10 Distributed Document-Based Systems/11.1 World Wide Web
371
11 – 11 Distributed Document-Based Systems/11.1 World Wide Web
Web Proxy Caching Basic idea: Sites install a separate proxy server that handles all outgoing requests. Proxies subsequently cache incoming documents. Cache-consistency protocols: Always verify validity by contacting server Age-based consistency: Texpire = α·(Tcached – Tlast_modified) + Tcached Cooperative caching, by which you first check your neighbors on a cache miss: 11 – 11 Distributed Document-Based Systems/11.1 World Wide Web
372
11 – 12 Distributed Document-Based Systems/11.1 World Wide Web
Server Replication Content Delivery Network: CDNs act as Web hosting services to replicate documents across the Internet providing their customers guarantees on high availability and performance (example: Akamai). Question: How would consistency be maintained in this system? 11 – 12 Distributed Document-Based Systems/11.1 World Wide Web
373
11 – 13 Distributed Document-Based Systems/11.1 World Wide Web
Security: TLS (SSL) Transport Layer Security: Modern version of the the Secure Socket Layer (SSL), which “sits” between transport layer and application protocols. Relatively simple protocol that can support mutual authentication using certificates: 11 – 13 Distributed Document-Based Systems/11.1 World Wide Web
374
11 – 14 Distributed Document-Based Systems/11.2 Lotus Notes
Lotus Notes: Overview Basics: All documents take the form of notes, which are collected in databases. A note is essentially a list of items. 11 – Distributed Document-Based Systems/11.2 Lotus Notes
375
11 – 15 Distributed Document-Based Systems/11.2 Lotus Notes
Domino Server Essence: A straightforward server design, in which a main server controls various server tasks, spawned as separate processes running on top of NOS: 11 – Distributed Document-Based Systems/11.2 Lotus Notes
376
11 – 16 Distributed Document-Based Systems/11.2 Lotus Notes
Server Clusters Essence: Simple approach – client contacts a known server and gets a list of servers in that cluster, along with a selection of the currently least-loaded one. Question: What happens if the initial server is too busy or down? 11 – Distributed Document-Based Systems/11.2 Lotus Notes
377
11 – 17 Distributed Document-Based Systems/11.2 Lotus Notes
Naming Issue: Lotus is database oriented, and therefore is much tailored to support directory services (and searches) instead of plain name resolution (as in traditional naming services). There is support for URLs: 11 – Distributed Document-Based Systems/11.2 Lotus Notes
378
11 – 18 Distributed Document-Based Systems/11.2 Lotus Notes
Replication Connection documents: Special notes describing exactly when, how, and what to replicate. Servers have replication tasks that are responsible for carrying out replication schemes: Note: This scheme comes very close to the epidemic protocols from Chp. 6. To remove notes, deletion stubs are used, similar to death certificates in epidemic protocols. 11 – Distributed Document-Based Systems/11.2 Lotus Notes
379
11 – 19 Distributed Document-Based Systems/11.2 Lotus Notes
Conflict Resolution (1/2) . Problem: Notes allows concurrent modifications to replicated notes, but follows an optimistic approach (assuming that write shares do not occur often). Here’s where originator IDs come in (= UNID + sequence number & timestamp). Solution: Conflicts are detected by comparing OIDs: if they are different while their UNID is the same, we may have a potential conflict. Updates (per copy) are recorded in history lists When an item is modified, the note’s sequence number is incremented and credited to the item One list is subset of the other update to longest list Two lists the same until sequence number k merge copies only if modifications took place on different items. 11 – Distributed Document-Based Systems/11.2 Lotus Notes
380
11 – 20 Distributed Document-Based Systems/11.2 Lotus Notes
Conflict Resolution (2/2) All other cases: There is a nonresolvable conflict; declare one the winner and let the users solve it. 11 – Distributed Document-Based Systems/11.2 Lotus Notes
381
11 – 21 Distributed Document-Based Systems/11.2 Lotus Notes
Security Essence: Notes uses public-key cryptography for setting secure channels. Crucial becomes the validation of public keys. Example: Alice works in the CS department of the Franeker University (FU); Bob in the EE department. They share the public key for FU. Finally: Having databases around, Lotus Notes has extensive access control mechanisms. See book and references for details. 11 – Distributed Document-Based Systems/11.2 Lotus Notes
382
Principles and Paradigms Distributed Coordination-Based Systems
Distributed Systems Principles and Paradigms Chapter 13 Distributed Coordination-Based Systems 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency and Replication 07 Fault Tolerance 08 Security 09 Distributed Object-Based Systems 10 Distributed File Systems 11 Distributed Document-Based Systems 12 Distributed Coordination-Based Systems 00 – /
383
Distributed Coordination-Based Systems
Coordination models TIB/Rendezvous Jini 12 – Distributed Coordination-Based Systems/
384
12 – 2 Distributed Coordination-Based Systems/12.1 Coordination Models
Essence: We are trying to separate computation from coordination; coordination deals with all aspects of communication between processes, as well as their cooperation. Make a distinction between: Temporal coupling: Are cooperating/communicating processes alive at the same time? Referential coupling: Do cooperating/communicating processes know each other explicitly? 12 – Distributed Coordination-Based Systems/12.1 Coordination Models
385
12 – 3 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
TIB/Rendezvous: Overview Coordination model: makes use of subject-based addressing, leading to what is known as a publish-subscribe architecture Receiving a message on subject X is possible only if the receiver had subscribed to X Publishing a message on subject X, means that the message is sent to all (currently running) subscribers to X. 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
386
12 – 4 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Overall Architecture Essence: TIB/RV uses multicasting to forward messages to subscribers. To cross large-scale networks, it effectively builds an overlay network with proprietary multicast routers: 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
387
12 – 5 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Communication: Events (1/2) Events: Publish-subscribe systems are ideally supported by means of events: you are notified when someone publishes a message that is of interest to you. Listener event: local object that registers a callback for a specific subject. 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
388
12 – 6 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Communication: Events (2/2) Event scheduling: Events for the same listener event are handled one after the other; they may also be lost/ignored if listener event is destroyed at the “wrong” time: 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
389
12 – 7 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Naming Essence: Names are important as they form the “address” of a message. Filtering facilities ensure that the right messages reach their subscribers: Filtering: using special wildcards: 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
390
12 – 8 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Transactional Messaging Essence: Ensure that the messages sent by a single process are delivered only if the sender commits store published messages until commit time, and only then make them available to subscribers Note: Transactional messaging is not the same as a transaction; only a single process is involved. 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
391
12 – 9 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Fault Tolerance: Multicasting Problem: TIB/RV relies on multicasting for publishing messages to all subscribers. This mechanism needs to be extended to wide- area networks and requires reliable multicasting. Solution: Pragmatic General Multicast (PGM): a NACK-based scheme in which receivers tell the sender that they are missing something(no hard guarantees). 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
392
12 – 10 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Fault Tolerance: Process Groups Essence: Process resilience is provided through process groups; active members respond to all incoming messages, inactive ones just listen. Note: If number of active members equals one, we have a primary-based replication protocol. Ranking: All members are ranked; the TIB/RV ensures (automatically) that the highest-ranked process is activated when an active member crashes. Question: How can the middleware guarantee that a specific number of active members are running? 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
393
12 – 11 Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
Security Essence: Establish a secure channel between a specific publisher and a specific subscriber. Question: We are losing something in our coordination model – what? Note: The shared secret KA,B is established through a Diffie- Hellman key exchange. We are now trying to avoid a man-in-the- middle attack (Chuck pretending to be Bob to Alice, and Alice to Bob). 12 – Distributed Coordination-Based Systems/12.2 TIB/Rendezvous
394
12 – 12 Distributed Coordination-Based Systems/12.3 Jini
Jini: Overview (1/2) Coordination model: temporal and referential uncoupling by means of JavaSpaces, a tuple-based storage system. A tuple is a typed set of references to objects Tuples are stored in serialized, that is, marshaled form into a JavaSpace To read a tuple, construct a template, with some fields left open Match a template against a tuple through a field-by-field comparison 12 – Distributed Coordination-Based Systems/12.3 Jini
395
12 – 13 Distributed Coordination-Based Systems/12.3 Jini
Jini: Overview (2/2) Write: A copy of a tuple (tuple instance) is stored in a JavaSpace Read: A template is compared to tuple instances; the first match returns a tuple instance Take: A template is compared to tuple instances; the first match returns a tuple instance and removes the matching instance from the JavaSpace 12 – Distributed Coordination-Based Systems/12.3 Jini
396
12 – 14 Distributed Coordination-Based Systems/12.3 Jini
Communication: Notifications Essence: A process can register itself at an object to be notified when an event happens. Uses a callback mechanism through listener objects. A callback is implemented as an RMI. Note: You can also be notified for matches in a JavaSpace, but there may be a race: 12 – Distributed Coordination-Based Systems/12.3 Jini
397
12 – 15 Distributed Coordination-Based Systems/12.3 Jini
JavaSpace Server (1/2) Essence: A JavaSpace is implemented by means of a single server; it turns out be hard to distribute and replicate a JavaSpace. Replicated version: 12 – Distributed Coordination-Based Systems/12.3 Jini
398
12 – 16 Distributed Coordination-Based Systems/12.3 Jini
JavaSpace Server (2/2) Distributed version: Scalability: Do not replicate, but use different JavaSpaces leading to nontransparent logical distributions. Possibly move a JavaSpace to places where a lot of clients are. 12 – Distributed Coordination-Based Systems/12.3 Jini
399
12 – 17 Distributed Coordination-Based Systems/12.3 Jini
Transactions Essence: Jini provides only a standard interface to a 2PC protocol. It offers a default implementation for this protocol. Question: What good will it do if you only provide interfaces? 12 – Distributed Coordination-Based Systems/12.3 Jini
400
12 – 18 Distributed Coordination-Based Systems/12.3 Jini
Comparison 12 – Distributed Coordination-Based Systems/12.3 Jini
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.