Examples of distributed systems Resource sharing and the web

Examples of distributed systems Resource sharing and the web
Chapter 1: Characterization of Distributed Systems Introduction Examples of distributed systems Resource sharing and the web Challenges Summary

Characteristics of Distributed System
Ubiquitous networks Internet, mobile phone network, corporate network, campus network, home network, in-car network, personal network … Tremendous applications are based on these networks, e.g., Web, ICQ Distributed System Definition a distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages. Characteristics Concurrency: concurrent programs execution – share resource No global clock: programs coordinate actions by exchanging messages Independent failures: when some systems fail, others may not know

The Internet A vast interconnected collection of computer networks of many different types TCP/IP A very large distributed system WWW, , FTP, VOD, etc intranet ISP desktop computer: backbone satellite link server:  network link:

Intranet A portion of the Internet that is separately administered and has a boundary that can be configured to enforce local security policies Composed of several LANs linked by backbone connections Be connected to the Internet via a router A typical intranet Main issues arising in the design of components for use in intranets File services: enable users to share data Firewalls: impede legitimate access to services The cost of software installation and support: reduce cost by the use of system architectures such as network computers and thin clients

A typical intranet

Mobile and ubiquitous computing
Mobile devices Laptop computers Handheld devices, including PDAs, cell phones, pagers, video cameras and digital cameras Wearable devices, such as smart watches Devices embedded in appliances such as washing machines, hi-fi systems, cars Mobile computing (nomadic computing) People can still access resources while he is on the move, or visiting places other than their usual environment Location-aware computing: utilize resources that are conveniently nearby Ubiquitous computing (pervasive computing) The harnessing of many small, cheap computational devices that are present in user’s physical environments, including the home, office and elsewhere A example about portable and handheld devices in a distributed system

Portable and handheld devices in a distributed system

What are people doing now? – Ongoing projects
Computational Grid Meta Computing Idle computers are ubiquitous Computers collaborate together as a whole system in the range of WAN Transparently resource (process, storage, network, etc) sharing for the end users Examples: Globus, CERN Data Grid, ChinaGrid, Entropia.com Distributed Object Computing Object Oriented Middleware for applications Examples: CORBA, DCOM, EJB, Globe Peer to Peer applications Distributed system architecture in contrast to Client/Server Examples: Napster, Gnutella, FreeNet, OceanStore, JXTA Commercial giants’ perspective Distributed Computing Environment Examples: .NET, Autonomous Computing

OceanStore overview

The JXTA Search network architecture

Motivation of distributed computing: resource sharing Resources types
Hardware, e.g. printer, scanner, camera Data, e.g. file, database, web page Service, e.g. search engine, file Some definitions Service: manages a collection of related resources and presents their functionalities to users and applications Server: a process on networked computer that accepts requests from processes on other computers to perform a service and responds appropriately Client: the requesting process Remote invocation: interaction between client and server, from the point when the client sends its request to when it receives the server’s response

Case study: World Wide Web
Motivation of WWW documents sharing between physicists of CERN WWW is an open system be extended and be differently implemented based on standard protocols, different server & different browser types of sharing resource can be extended, MIMES Basic technological components HTML: HyperText Markup Language URL: e.g. Ftp://ftp.cs.pku.edu.cn, HTTP: Request-reply interactions, Content types, One resource per request, Simple access control Advance Features Dynamic content: CGI, ASP, Servlet, etc Dynamic web page: JavaScript, Applet, etc Discussion Dangling: a resource is deleted or moved, but links to it may still remain Find information easily: e.g., Resource Description Framework which standardize the format of metadata about web resources Exchange information easily: e.g., XML – a self describing language Scalability: heavy load on popular web servers 14

Challenges (1) 1. Heterogeneity 2. Openness 3. Security
networks: ethernet, token ring, etc computer hardware: big endian/ little endian operating systems: different message interfaces of Unix and Windows programming languages: different representations for data structures implementations by different developers: no command standards Middleware is a software layer that provides a programming abstraction as well as masking the heterogeneity of the underlying platform. E.g., CORBA, DCOM, Java RMI, etc. 2. Openness characterized by the fact that it can be extended and re-implemented in various ways, e.g. Unix, Internet How to deal with openness? key interfaces are published, e.g. RFC 3. Security confidentiality: protection against disclosure to unauthorized individuals,e.g. ACL in Unix File System integrity: protection against alteration or corruption, e.g. checksum availability: protection against interference with the means to access the resources, e.g. Denial of service 16

Challenges (2) 4. Scalability
Security challenges that are not yet fully met: Denial of service attacks, Security of mobile code, etc. 4. Scalability A system is described as scalable if will remain effective when there is a significant increase in the number of resources and the number of users The Internet is an example distributed system that is scalable design challenges controlling the cost of physical resources. E.g., at most O(n) controlling the performance loss. E.g., no worse than O(logn) preventing software resources running out. E.g., IP address avoiding performance bottlenecks. E.g., name table partitioned and cached in DNS Date Computers Web servers Percentage 1993, July 1,776,000 130 0.008 1995, July 6,642,000 23,500 0.4 1997, July 19,540,000 1,203,096 6 1999, July 56,218,000 6,598,697 12 17

Challenges (3) 5. Failure handling 6. Concurrency 7. Transparency
Detecting, e.g. checksum for corrupted data. Sometimes impossible so suspect, e.g. a remote crashed server in the Internet Masking, e.g. Retransmit message, standby server Tolerating, e.g. Inform problems Recovery, e.g. Roll back Redundancy, e.g. IP route, replicated name table of DNS Availability: measure of the proportion of time that a system is available for use 6. Concurrency ensure the operations on shared resource correct in a concurrent environment 7. Transparency Access transparency: using identical operations to access local and remote resources, e.g. Hyperlink in web page Location transparency: resources to be accessed without knowledge of their location, e.g. URL Concurrency transparency: several processed operate concurrently using shared resources without interference with between them 18

Challenges (4) Replication transparency: multiple instances of resources to be used to increase reliability and performance without knowledge of the replicas by users or application programmers, e.g. Web cache Failure transparency: users and applications to complete their tasks despite the failure of hardware and software components, e.g., Mobility transparency: movement of resources and clients within a system without affecting the operation of users and programs, e.g., mobile phone Performance transparency: allows the system to be reconfigured to improve performance as loads vary Scaling transparency: allows the system and applications to expand in scale without change to the system structure or the application algorithms 19

Distributed systems are pervasive
Summary Distributed systems are pervasive Resource sharing is the main motivation for constructing distributed systems Characterization of Distributed System Concurrency No global clock Independent failures Challenges to construct distributed system Heterogeneity Openness Security Scalability Failure handling Transparency 21

Chapter 2: System Model Introduction Architecture Models Fundamental Models Summary

Introduction Why do we need model? Architecture model
Each model is intended to provide an abstract, simplified but consistent description of a relevant aspect of distributed system design Architecture model defines the way in which the components of systems interact with one another and the way in which they are mapped onto the underlying network of computers Client/Server vs. Peer to Peer variants of C/S partition of data or replication as cooperating servers caching of data by proxy servers and clients use of mobile code and mobile agents requirement to add and remove mobile devices in a convenient manner 23

Introduction …continued
Fundamental model concerned with a more formal description of the properties that are common in all of the architectural models The interaction model deals with performance and with the difficulty of setting time limits in a distributed system The failure model attempts to give a precise specification of the faults that can be exhibited by processes and communication channels The security model discusses the possible threats to processes and communication channels 24

Concepts What does the architecture model consider? Software layers
The placement of the components across a network of computers – seeking to define useful patterns for the distribution of data and workload The interrelationships between the components – their functional roles and patterns of communication between them Software layers Originally refer to the structuring of software as layers or modules in a single computer Recently in terms of services offered and requested between processes located in the same or different computers 26

Software and hardware service layers in distributed systems Platform
Software layer Software and hardware service layers in distributed systems Platform The lowest-level hardware and software layers, e.g., Intel x86/Windows, SPARC/SunOS, PowerPC/MacOS Middleware A layer of software, mask heterogeneity, provide a convenient programming model to application programmers Examples: RPC, RMI, CORBA, DCOM, Isis(group communication system) Limitations: some systems require support at the application level e.g., In case of transferring large message, TCP provides some error detection and correction, but it can’t recover from major network interruptions. So the mail transfer service at the application layer is required. ‘ the end-to-end argument’ [1984] some communication-related functions can be completely and reliably implemented only with the knowledge and help of the application standing at the end points of the communication system Counter to the view of middleware supporters (transparency entirely) 27

Software and hardware service layers in distributed systems

What is system architecture?
System architectures What is system architecture? The division of responsibilities between system components (applications, server and other processes) and the placement of the components on computers in the network Main distributed system architectures 1. Client-Server model Be Historically the most important and remain the most widely employed Servers may in turn be clients of other servers 2. Services provided by multiple servers Partition the set of service objects on different servers, e.g. workflow system Maintain replicated service objects on several hosts, e.g. Sun NIS 3. Proxy servers and caches A cache is a store of recently used data objects that is closer than the objects themselves E.g., web page cache at web browser or web proxy server 29

Clients invoke individual servers

A service provided by multiple servers

Web Proxy Server

System architectures … continued
4. Peer processes ( peer to peer ) All of the processes play similar roles, interacting cooperatively as peers to perform a distributed activity or computation without any distinction between clients and servers Maintain consistency of application-level resources and synchronize application level action when necessary E.g., a peer-to-peer whiteboard 33

A distributed application based on peer processes

Variations on the client-server model
Reasons of variation The use of mobile code and mobile agents Users need for low-cost computers with limited hardware resources The requirement to add and remove mobile devices in a convenient manner Several variations: 1. Mobile code good interactive response, e.g., applet 2. Mobile agent A running program that travels from one computer to another in a network carrying out a task on someone’s behalf, e.g., agilet[IBM], worm[Xerox PARC] 3. Network Computers Download its operating system and any application software from a remote file server All the application data and code is stored by a file server, so users may migrate 35

Web applets

Variations on the client-server model … continued
4. Thin client A software layer that supports a window-based user interface on a computer that is local to the user while executing application programs on a remote computer Drawback : high latencies Implementation: X-11, VNC[AT&T 1998] 5. Spontaneous networking The form of distribution that integrates mobile devices and other devices into a given network Key features: easy connection to a local network, easy integration with local services Key design issues Convenient connection and integration Limited connectivity: mobile device move around continuously, disconnection Security and privacy Discovery Services: registration service, lookup service 37

Thin clients and compute servers
Application Process Network computer or PC Compute server network

Spontaneous networking in a hotel
Internet gateway PDA service Music Discovery Alarm Camera Guests devices Laptop TV/PC Hotel wireless network

Design requirements for distributed architectures
Performance issues Responsiveness: determined by the load and performance of the server and network, delays in the client and server operating system’s communication and middleware services as well as code of the service Throughput: the rate at which computational work is done, the throughput of the intervening software layers is important Balancing computational loads Dependability issues Fault tolerance: redundancy, e.g., data and processes be replicated, messages be retransmitted Security: e.g. locate sensitive data in computers that can be secured effectively against attack Quality of service Reliability, security Performance: ability to meet timeliness guarantees Adaptability: meet changing system configurations and resource Availability: have necessary computing and network resources at the appropriate times ( the abbreviation QoS) Use of caching and replication, e.g. Web caching protocol 40

A system model should address the following questions
Fundamental models A system model should address the following questions What are the main entities in the system? How do they interact? What are the characteristics that affect their individual and collective behavior? Purpose of a model Make explicit all the relevant assumptions about the system we are modeling Make generalizations concerning what is possible or impossible by logical analysis and mathematical proof Fundamental models intend to discuss Interaction Failure Security 42

Interaction model Examples of interaction in distributed system
DNS, NIS: multiple server processes cooperate with one another P2P voice conference system: with strict real-time constraints Distributed algorithm: a definition of the steps to be taken by each of the processes of which the system is composed, including the transmission of messages between them Difficult to describe all the states, because of failures of processes and message transmissions Two significant factors affecting interacting processes Communication performance is often a limiting characteristic Latency – the delay between the sending of a message by one process and its receipt by another. Including: Network delay, Accessing delay, OS delay Bandwidth – total amount of information that can be transmitted over it in a given time Jitter – variation in the time taken to deliver a series of messages Impossible to maintain a single global notion of time Clock drift rate – the relative amount that a computer clock differs from a perfect reference clock Timing event: e.g., GPS, Logical time 43

Two variants of the interaction model
Synchronous distributed system The time to execute each step of a process has know lower and upper bounds Each message transmitted over a channel is received within a known bounded time Each process has a local clock whose drift rate from real time has a known bound Asynchronous distributed system – no bounds on process execution speeds: e.g. each step may take an arbitrarily long time Message transmission delays: e.g. a message may be received after an arbitrarily long time Clock drift rates: the drift rate of a clock is arbitrary Examples of synchronous DS and asynchronous DS Asynchronous DS: , ftp Synchronous DS: VOD, voice conference system 44

Example: disorder of messages
Event ordering Example: disorder of messages A group including X, Y, Z and A X send “Meeting” to all; Y and Z reply “Re: Meeting” to all At A, the messages received are Z.”Re: Meeting”, X.”Meeting”, Y.”Re: Meeting” Logical time[Lamport 1978] Provide an ordering among the events at processes running in different computers in a distributed system 45

Failure model Failure model: defines the ways in which failure may occur in order to provide an understanding of the effects of failures and defeat the failures Failure models: 1. Omission failures A process or communication channel fails to perform actions that it is supposed to do Process omission failure: Crash Fail-stop: Crash that can be detected by other processes certainly, e.g., by timeouts in synchronous DS Communication omission failures: dropping messages Send omission, receive omission, channel omission Benign failures 46

Failure model …continued
2. Arbitrary (Byzantine) failures the worst possible failure semantics Arbitrarily omit intended processing steps or take unintended processing steps. E.g., return a wrong value in response to an invocation Arbitrary failures in process is hard to be detected, Arbitrary failures in communication channel exist but rare, by recognize and reject the faulty msgs Class of failure Affects Description Fail-stop Process Process halts and remains halted. Other processes may detect this state. Crash not be able to detect this state. Omission Channel A message inserted in an outgoing message buffer never arrives at the other end’s incoming message buffer. Send-omission A process completes a send, but the message is not put in its outgoing message buffer. Receive-omission A message is put in a process’s incoming message buffer, but that process does not receive it. Arbitrary (Byzantine) Process or channel Process/channel exhibits arbitrary behaviour: it may send/transmit arbitrary messages at arbitrary times, commit omissions; a process may stop or take an incorrect step. 47

Failure model …continued
3. Timing failures Applicable in synchronous distributed system, but not asynchronous DS Masking failures Hide, e.g., replicated servers convert, e.g., Checksum: arbitrary failure -> omission failure Reliability of one-to-one communication Validity – any message in the outgoing message buffer is eventually delivered to the incoming message buffer Integrity – the message received is identical to one sent, and no messages are delivered twice, against retransmit protocols and malicious messages Reliable communication is defined in terms of validity and integrity Class of Failure Affects Description Clock Process Process’s local clock exceeds the bounds on its rate of drift from real time. Performance Process exceeds the bounds on the interval between two steps. Channel A message’s transmission takes longer than the stated bound. 48

The security of a distributed system
Security model The security of a distributed system The processes The communication channels The objects Protecting the objects Access rights: who is allowed to perform the operations of an object Principal: the authority who has some rights on the object 49

Securing processes and their interactions
The enemy Threats to processes To servers: invocate with a false identity, e.g. cheating a mail server To clients: receive false result, e.g. stealing account password Threats to communication channels Copy, alter or inject messages Save and replay, e.g., retransfer money from one account to another Denial of service: excessive and pointless invocation on services or message transmissions in a network, resulting in overloading of physical resources (network bandwidth, server processing capacity) Mobile code: malicious mobile program, e.g. Trojan horse attachment Communication channel Copy of m Process p q The enemy m’ 50

Securing processes and their interactions … continued
Defeating security threats Cryptography and shared secrets Identify each other by the shared secrets that are only known by themselves. Cryptography is the base. Authentication – proving the identities supplied by their senders Secure channels Each process knows reliably the identities of the principal on whose behalf the other process is executing Ensure the privacy and integrity of the data transmitted across it Each message includes physical or logical time stamp Principal A Secure channel Process p q B 51

Summary Architecture models Fundamental models
Client / Server, e.g. Web, FTP, NEWS Multiple Servers, e.g. DNS Proxy and Cache, e.g. Web Cache Peer process Variations of C/S Mobile code, mobile agent, network computer, thin client, spontaneous networks Fundamental models Interaction models – synchronous DS and asynchronous DS Failure models – omission failures, arbitrary failures and timing failures Security model - the enemy and the approaches of defeating them 53

Chapter 3: Networking and Internetworking
Introduction Types of network Network principles Internet protocols Network case studies: Ethernet, wireless LAN and ATM Summary

Concepts Why should we study network technology? Some concepts
Network is the basis of distributed system The performance, reliability, scalability, … of the underlying networks impact the behavior of DS and therefore affect their design Some concepts transmission media : wire, cable, fiber and wireless channels hardware devices : router, switch, bridge, hub, repeater and network interface software components : protocol stacks, communication handler and driver host : computers and devices that use the network for communication purpose node : computer or switching device attached to a network subnet : a set of interconnected nodes, all of which employ the same technology to communicate amongst themselves

Network issues for distributed systems
Performance latency : the delay that occurs after a send operation is executed before data starts to become available at the destination, i.e. the time to transfer an empty message data transfer rate : the speed at which data can be transferred between two computers in the network once transmission has begun, bits/s Message transmission time = latency + length / data transfer rate data transfer rate is determined primarily by network physical characteristics, whereas the latency is determined primarily by software overheads, routing delays and delay of accessing to transmission channels In distributed systems, messages are always small in size, so the latency is more significant than data transfer rate total system bandwidth : the total volume of traffic that can be transferred across the network in a given time. Ethernet: system bandwidth is as same as data transfer rate WAN: multiple channels, deteriorates when there are too many messages Comparison of different communication channel local network – a null message transmission time is under a millisecond local memory or more times faster than local network local hard disk times slower than fast local network

Networking issues for distributed systems … continued
Scalability WWW: world wide wait ? future Internet: several billion nodes and hundreds of millions of active hosts, new addressing and routing mechanisms in IPv6 Reliability The reliability of most physical transmission media is very high communication errors: timing failures of software of sender or receiver, or buffer overflow Security Firewall: to protect the resources in all of the computers inside the organization from access by external users or processes and to control the use of resources outside the firewall by users inside the organization, always runs on a gateway Secure network environment, e.g. VPN Mobility Although the current mechanisms have been adapted and extended to support them, the expected future growth in the use of mobile devices will require further extension QoS require guaranteed bandwidth and bounded latencies Multicasting Need for one-to-many communication

Local area networks (LANs)
Types of network Local area networks (LANs) high speed, connected to a single communication medium, no routing of messages, may have switches or hubs Ethernet, Token rings and slotted rings (1970s), 10M -> 100M -> 1000M (G) Wide area networks (WAN) lower speeds, links between different cities, countries or continents communication medium is a set of communication circuits linking a set of dedicated computers called routers Metropolitan area networks (MANs) Ethernet, ATM, DSL(digital subscriber line) Wireless networks IEEE (WaveLAN): 2-11mbps over 150 meters WPANs (wireless personal area networks): infra-red links, BlueTooth (1-2 mbps over 10 meters) digital mobile phone network: European GSM/USA CDPD (up to 2 mbps) WAP (Wireless Application Protocol)

Types of networks … continued
Internetworks several networks are linked together to provide common data communication facilities that conceal the technology differences linked by routers and gateways, e.g. TCP/IP Network comparisons different failure model : TCP vs. UDP

Network Switching technology
Packet switching Store and forward, share communication link, asynchronous Packet transmission Messages: arbitrary length Packet: restricted length Sufficient buffer storage at each node, avoid undue delays Data streaming Video stream: bandwidth requirement, continues flow e.g. frame N arrives no more than N/24 seconds after the first frame arrives Bandwidth, latency and reliability must be guaranteed: predefined route, e.g. ATM and IPv6 Switching schemes Broadcast: no switching, ethernet Circuit switching: telephone system Frame relay: combination of circuit switching and packet switching Delay: Internet -200 milliseconds, telephone - 50 milliseconds, ATM - a few tens of microseconds

Network Protocols Protocol Protocol layers
A specification of the sequence of messages that must be exchanged A specification of the format of the data in the messages Protocol layers Each layer presents an interface to the layers above it that extends the properties of the underlying communication system Layer encapsulations Protocol suits ( protocol stack) Seven-layer Reference Model for OSI adopted by ISO Simplify and generalize the communication interface, but bring significant cost, e.g., N copies

Other important concepts in network protocol
Packet assembly Network–layer protocol packets: MTU (Maximum transfer unit), 1500Bytes in Ethernet, 64K in IP Ports Software definable destination points for communication within a host computer Transport address = network address + port number, Port numbers above 1023 are available for general use Packet delivery Datagram packet delivery, e.g. IP, Ethernet, and most wired and wireless LAN Virtual circuit packet delivery, e.g., ATM Different to the concepts of Connection oriented (TCP)/Connectionless (UDP) in transport layer protocol

Routing Routing algorithm Distance vector
Determine the route taken by each packet. Predetermined for circuit switching (e.g. X.25) and frame relay switching (e.g. ATM); determine on the fly for packet switching Dynamically update its knowledge of the network Distance vector Bellman-Ford protocol[1957] RIP (router information protocol) Periodically, and whenever the local routing table changes, send the router table to all accessible neighbours When a table is received from a neighbouring router, make necessary update Convergence problem Improvement: link cost include bandwidth information, speed convergence, etc

Routing … continued Link state algorithm
Each node maintains all knowledge of the network Each node can compute appropriate routes based on the knowledge Avoid slow convergence and undesirable intermediate states E.g. OSPF

Congestion control A rule of thumb Congestion control
when the load on a network exceeds 80% of its capacity, the total throughput tends to drop as a result of packet losses Congestion control instead of allowing packets to travel through the network until they reach over-congested nodes, hold them at earlier nodes Control approaches Informing nodes along the congested route to reduce packet transmission rate, i.e buffering for long time at intermediate nodes or queue packets at source host In the Internet, congestion control rely on the end-to-end traffic control, e.g. choke packets requesting a reduction in transmission rate in TCP

Internetworking Internetwork Requirements Example
integrate many subnets that use different network technologies Requirements Unified internetwork addressing scheme that enables packets to be addressed to any host connected to any subnet A protocol defining the format of internetwork packets and giving rules according to which they are handled Interconnecting components that route packets to their destinations in terms of internetwork addresses, transmitting the packets using subnets with a variety of network technologies Example

Internetworking components
Router Conduct routing, additionally link networks of different types Bridge link networks of different types, but not conduct routing Hub Connect hosts and extend segments of Ethernet and other broadcast local network Switch Perform similar function to router, but for LANs only Tunnel A software layer that transmits packets through an alien network environment

Internet Overview Protocol stack
TCP(UDP)/IP, web [HTTP], [SMTP,POP], news [NNTP], FTP, SSL, etc Exceptions to the universal adoption of TCP/IP The use of WAP for wireless applications on portable devices Special protocols to support multimedia streaming applications Heterogeneous underlying networks support E.g., IP over ATM, IP over Ethernet, IP over PPP, etc The success of TCP/IP: independence of the underlying transmission technology A software layer that transmits packets through an alien network environment

IP principles IP addressing The IP protocol IP routing
232 addresses are inadequate due to Internet growth and inefficiently allocation The IP protocol Unreliable (best-effort) delivery semantics packet can be lost, duplicated, delayed or delivered out of order Address resolution: ARP IP address -> Ethernet address mapping, (IP address, Ethernet address) pairs cache on each host IP routing Internet topology: AS/Areas Routing algorithms: RIP -> OSPF [Dijkstra1959] Default routes: trade routing efficiency for table size Classless interdomain routing (CIDR): create subnet by means of subdividing address or aggregating addresses by mask field, e.g /24

Future of IP IPv6 Mobile IP
2128 addresses, IP addresses per square meter of the Earth’s surface Routing speed : no checksum, no fragmentation Real time : priority and flow label which is used to reserve resources Extension header ( information of router, authentication, etc), multicast and anycast The packet Migration from IPv4: IPv6 router island, depend on economics Mobile IP Home agent and foreign agent

TCP Connection oriented Message delivery Flow control
two side must shake hands to establish a bi-directional communication channel Message delivery Deliver arbitrary long sequences of bytes via stream-based programming abstraction Sequencing: divide stream into data segments, sequence number on each segment Checksum: cover the header and the data in the segment Flow control Receiver send the highest number of received segment and window size to sender by acknowledge message Buffering: receiver buffer and sender buffer used for flow control In interactive application, receiver inform sender when timeout or the buffer reaches the MTU limit Retransmission: retransmit the segment when no acknowledgement within a specified timeout

UDP features Connectionless Datagram delivery Con Pro
A UDP datagram is encapsulated inside an IP packet, up to 64kb in size Con unreliable delivery due to unreliable IP Pro minimal additional cost and transmission delays

The purpose of a firewall
monitor and control all communication into and out of an intranet, including service control, behavior control and user control Filter approaches IP packet filtering, e.g. router/filter TCP gateway, e.g. bastion Application level gateway, e.g. telnet proxy process Virtual private networks (VPN) Secure connections located at different sites using public Internet links By the use of cryptographically protected secure channels at the IP level

Ethernet IEEE 802.3[Xerox 1973] Ethernet packet layout
Carrier sensing, multiple access with collision detection Frame broadcasting Bandwidth: 3m -> 10m -> 100m -> 1000m Ethernet packet layout 248 different addresses Packet collisions Carrier sensing wait until no signal is present then transmit Collision detection When transmit through output port, also listen on the input port, and compare the two signals, If differ, send jamming signal Back-off wait a time n before retransmitting, n: a random integer

Ethernet … continued Ethernet efficiency
Efficiency = number of packets transmitted successfully / theoretical maximum number without collision Affected by  : windows of opportunity for collisions - 2  number of stations on the network stations’ level of activity

Collision detection failures in 802.11
Wireless LAN Wireless LAN types Infrastructure network, e.g. IEEE Ad hoc network: network built on the fly Collision detection failures in Hidden stations: carrier sensing fail to detect that another station on the network is transmitting, lead to collision at base station Fading: the strength of radio signals diminishes rapidly with the distance from the transmitter, so that defeating both carrier sensing and collision detection Collision masking

introduction Slot reservation added to the MAC protocol in Firstly, sense the medium, if no carrier signal, then medium is available an out-of-range station is in the process of requesting a slot an out-of-range station is using a slot Sender send a RTS (Request To Send) frame to receiver; Receiver reply a CTS (Clear To Send) frame to sender. The effect of the exchange is the station within range of sender will pick up the RTS frame and take note of the duration the station within range of receiver will pick up the CTS frame and take note of the duration Begin to transmit

802.11 introduction … continued
avoid collisions in ways CTS frames avoid the hidden station and fading problem If RTS/CTS is corrupted, then a back-off period is used When RTS/CTS exchange correctly, there is no collisions in the following communication except intermittent fading prevents a third party from receiving either of them Security in Shared-key authentication mechanism XOR operation on the base of shared key to prevent from eavesdropping

Asynchronous Transfer Mode networks (ATM)
Deploy ATM on top of other networks Can be implemented over existing digital telephony networks, Bandwidth from 32 kbps (voice) to 622mbps Native mode: Over optical fiber, copper and other transmission media, bandwidth up to several gigabits per seconds ATM layers Adaptation layer end-to-end layer implemented at the sending and receiving host ATM layer connection-oriented service that transmits fixed length packets called cells, avoid flow control and error checking at the switching, provide bandwidth and latency guarantees VC (virtual channel): a logical unidirectional association between two endpoints of a link in the physical path from source to destination VP (virtual path): a bundle of virtual channel that are associated with a physical path between two switching nodes

ATM… continued The nodes in a ATM network can play three distinct roles Hosts: send and receive messages VP switches: correspondence information between incoming and outgoing VPs VP/VC switches: correspondence information for both VPs and VCs ATM cell: 5-bytes header and a 48-byte data field

Summary Layered protocols Delivery approach Routing mechanism
7 layers in OSI model / 5 layers in the Internet Delivery approach Packet switch, frame relay Routing mechanism distance vector / link state Congestion control The Internet TCP/IP Network cases Ethernet, WLAN, ATM

Encapsulation of a packet

OSI protocol summary FTP Layer Description Examples Application
Protocols that are designed to meet the communication requirements of specific applications, often defining the interface to a service. HTTP, FTP , SMTP, CORBA IIOP Presentation Protocols at this level transmit data in a network representation that is independent of the representations used in individual computers, which may differ. Encryption is also performed in this layer, if required. Secure Sockets ( SSL),CORBA Data Rep. Session At this level reliability and adaptation are performed, such as detection of failures and automatic recovery. Transport This is the lowest level at which messages (rather than packets) are handled. Messages are addressed to communication ports attached to processes, Protocols in this layer may be connection-oriented or connectionless. TCP, UDP Network Transfers data packets between computers in a specific network. In a WAN or an internetwork this involves the generation of a route passing through routers. In a single LAN no routing is required. IP, ATM virtual circuits Data link Responsible for transmission of packets between nodes that are directly connected by a physical link. In a WAN transmission is between pairs of routers or between routers and hosts. In a LAN it is between any pair of hosts. Ethernet MAC, ATM cell transfer, PPP Physical The circuits and hardware that drive the network. It transmits sequences of binary data by analogue signalling, using amplitude or frequency modulation of electrical signals (on cable circuits), light signals (on fibre optic circuits) or other electromagnetic signals (on radio and microwave circuits). Ethernet base- band signalling, ISDN EJB

Distance-Vector Routing table for the network
Routings from A Routings from B Routings from C To Link Cost A B C D E local 1 3 2 4 5 Routings from D To Link Cost A B C D E 3 6 local 1 2 Hosts Links or local networks A D E B C 1 2 5 4 3 6 Routers Routings from E To Link Cost A B C D E 4 5 6 local 2 1

Psudo-code for RIP routing algorithm
Send: Each t seconds or when Tl changes, send Tl on each non-faulty outgoing link. Receive: Whenever a routing table Tr is received on link n: for all rows Rr in Tr { if (Rr.link <> n) { Rr.cost = Rr.cost + 1; Rr.link = n; if (Rr.destination is not in Tl) add Rr to Tl; // add new destination to Tl else for all rows Rl in Tl { if (Rr.destination = Rl.destination and (Rr.cost < Rl.cost or Rl.link = n)) Rl = Rr; // Rr.cost < Rl.cost : remote node has better route // Rl.link = n : remote node is more authoritative }

Simplified view of the QMW Computer Science network
file compute dialup hammer henry hotpoint bruno router/ sickle /29 copper firewall web /29 server desktop computers xx subnet Eswitch custard hub Student subnet Staff subnet other servers  1000 Mbps Ethernet Eswitch: Ethernet switch 100 Mbps Ethernet file server/ gateway printers Campus router xx

Tunnelling A B IPv6 IPv6 encapsulated in IPv4 packets Encapsulators
IPv4 network A B IP IP encapsulated in PPP packets Encapsulators PPP network

Internet protocol stack
Messages (UDP) or Streams (TCP) Application Transport Internet UDP or TCP packets IP datagrams Network-specific frames Message Layers Underlying network Network interface

Encapsulation in a message transmitted via TCP over an Ethernet
Application message TCP header IP header Ethernet header Ethernet frame port TCP IP

The programmer’s conceptual view of a TCP/IP Internet

Internet address structure

Decimal representation of Internet addresses
octet 1 octet 2 octet 3 Class A: 1 to 127 0 to 255 1 to 254 Class B: 128 to 191 Class C: 192 to 223 224 to 239 Class D (multicast): Network ID Host ID Multicast address 240 to 255 Class E (reserved): to to to to to Range of addresses

IP packet layout

IPv6 header layout

The MobileIP routing mechanism
Sender Home Mobile host MH Foreign agent FA Internet agent First IP packet addressed to MH Address of FA returned to sender tunnelled to FA Subsequent IP packets

Internet topology

Firewall configurations

Ethernet frame layout Destination address Source Type
Data for transmission Frame check sequence 7bytes 1byte bytes 6 bytes 2 bytes 46 bytes ≤ length 1500bytes bytes S preamble

Wireless LAN configuration

ATM protocol layers

ATM cell layout

Chapter 4: Interprocess Communication
Introduction The API for the Internet protocols External data representation and marshalling Client-Server communication Group communication Case study: interprocess communication in UNIX Summary

TCP ( UDP) from a programmers point of view
Introduction Middleware layers TCP ( UDP) from a programmers point of view TCP : two stream UDP : datagram, message passing Java interface, UNIX Socket marshalling and demarshalling data form translation Client-Server communication Request-Reply protocols Java RMI, RPC Group communication

Middleware layers

The characteristics of interprocess communication
Synchronous and asynchronous a queue associated with message destination, Sending process add message to remote queue, Receiving process remove message from local queue Synchronous: send and receive are blocking operations asynchronous: send is unblocking, receive could be blocking or unblocking (receive notification by polling or interrupt) Message destination Internet address + local port service name: help by name service at run time location independent identifiers, e.g. in Mach Reliability validity: messages are guaranteed to be delivered despite a reasonable number of packets being dropped or lost Integrity: messages arrive uncorrupted and without duplication Ordering the messages be delivered in sender order

Socket Endpoint for communication between processes
Both forms of communication (UDP and TCP ) use the socket abstraction Originate from BSD Unix, be present in Linux, Windows NT and Macintosh OS etc be bound to a local port (216 possible port number) and one of the Internet address a process cannot share ports with other processes on the same computer message agreed port any port socket Internet address = Internet address = other ports client server

UDP datagram communication
UDP datagrams are sent without acknowledgement or retries Issues relating to datagram communication Message size: not bigger than 64k in size, otherwise truncated on arrival blocking: non-blocking sends (message could be discarded at destination if there is not a socket bound to the port ) and blocking receives (could be timeout) Timeout: receiver set on socket Receive from any: not specify an origin for messages, but could be set to receive from or send to particular remote port by socket connection operation Failure model omission failure: message be dropped due to checksum error or no buffer space at sender side of receiver side ordering: message be delivered out of sender order application maintains the reliability of UDP communication channel by itself

Java API for UDP datagrams
DatagramPacket DatagramSocket send and receive : transmit datagram between a pair of sockets setSoTimeout : receive method will block for the time specified and then throw an InterruptedIOexception connect: connect to a particular remote port and Internet address Examples be acceptable to services that are liable to occasional omission failures, e.g. DNS

UDP client sends a message to the server and gets a reply
import java.net.*; import java.io.*; public class UDPClient{ public static void main(String args[]){ // args give message contents and server hostname DatagramSocket aSocket = null; try { aSocket = new DatagramSocket(); byte [] m = args[0].getBytes(); InetAddress aHost = InetAddress.getByName(args[1]); int serverPort = 6789; DatagramPacket request = new DatagramPacket(m, args[0].length(), aHost, serverPort); aSocket.send(request); byte[] buffer = new byte[1000]; DatagramPacket reply = new DatagramPacket(buffer, buffer.length); aSocket.receive(reply); System.out.println("Reply: " + new String(reply.getData())); }catch (SocketException e){System.out.println("Socket: " + e.getMessage()); }catch (IOException e){System.out.println("IO: " + e.getMessage());} }finally {if(aSocket != null) aSocket.close();} }

UDP server repeatedly receives a request and sends it back to the client
import java.net.*; import java.io.*; public class UDPServer{ public static void main(String args[]){ DatagramSocket aSocket = null; try{ aSocket = new DatagramSocket(6789); byte[] buffer = new byte[1000]; while(true){ DatagramPacket request = new DatagramPacket(buffer, buffer.length); aSocket.receive(request); DatagramPacket reply = new DatagramPacket(request.getData(), request.getLength(), request.getAddress(), request.getPort()); aSocket.send(reply); } }catch (SocketException e){System.out.println("Socket: " + e.getMessage()); }catch (IOException e) {System.out.println("IO: " + e.getMessage());} }finally {if(aSocket != null) aSocket.close();}

TCP stream communication
The API to the TCP provide the abstraction of a stream of bytes to which data may be written and from which data may be read Hidden network characteristics message sizes lost messages flow control message duplication and ordering message destinations issues related to stream communication Matching of data items: agree to the contents of the transmitted data Blocking: send blocked until the data is written in the receiver’s buffer, receive blocked until the data in the local buffer becomes available Threads: server create a new thread when it accept a connection

TCP stream communication … continued
failure model integrity and validity have been achieved by checksum, sequence number, timeout and retransmission in TCP protocol connection could be broken due to unknown failures Can’t distinguish between network failure and the destination process failure Can’t tell whether its recent messages have been received or not

Java API for TCP Streams
ServerSocket accept: listen for connect requests from clients Socket constructor not only create a socket associated with a local port, but also connect it to the specified remote computer and port number getInputStream getOutputStream Examples

TCP client makes connection to server, sends request and receives reply
import java.net.*; import java.io.*; public class TCPClient { public static void main (String args[]) { // arguments supply message and hostname of destination Socket s = null; try{ int serverPort = 7896; s = new Socket(args[1], serverPort); DataInputStream in = new DataInputStream( s.getInputStream()); DataOutputStream out = new DataOutputStream( s.getOutputStream()); out.writeUTF(args[0]); // UTF is a string encoding see Sn 4.3 String data = in.readUTF(); System.out.println("Received: "+ data) ; }catch (UnknownHostException e){ System.out.println("Sock:"+e.getMessage()); }catch (EOFException e){System.out.println("EOF:"+e.getMessage()); }catch (IOException e){System.out.println("IO:"+e.getMessage());} }finally {if(s!=null) try {s.close();}catch (IOException e){System.out.println("close:"+e.getMessage());}} }

TCP server makes a connection for each client and then echoes the client’s request
import java.net.*; import java.io.*; public class TCPServer { public static void main (String args[]) { try{ int serverPort = 7896; ServerSocket listenSocket = new ServerSocket(serverPort); while(true) { Socket clientSocket = listenSocket.accept(); Connection c = new Connection(clientSocket); } } catch(IOException e) {System.out.println("Listen :"+e.getMessage());} // this figure continues on the next slide

TCP Server … continued class Connection extends Thread {
DataInputStream in; DataOutputStream out; Socket clientSocket; public Connection (Socket aClientSocket) { try { clientSocket = aClientSocket; in = new DataInputStream( clientSocket.getInputStream()); out =new DataOutputStream( clientSocket.getOutputStream()); this.start(); } catch(IOException e) {System.out.println("Connection:"+e.getMessage());} } public void run(){ try { // an echo server String data = in.readUTF(); out.writeUTF(data); } catch(EOFException e) {System.out.println("EOF:"+e.getMessage()); } catch(IOException e) {System.out.println("IO:"+e.getMessage());} } finally{ try {clientSocket.close();}catch (IOException e){/*close failed*/}}

External data representation and marshalling introduction
Why does the communication data need external data representation and marshalling? Different data format on different computers, e.g., big-endian/little-endian integer order, ASCII (Unix) / Unicode character coding How to enable any two computers to exchange data values? The values be converted to an agreed external format before transmission and converted to the local form on receipt The values are transmitted in the sender’s format, together with an indication of the format used, and the receipt converts the value if necessary External data representation An agreed standard for the representation of data structures and primitive values Marshalling (unmarshalling) The process of taking a collection of data items and assembling them into a form suitable for transmission in a message Usage: for data transimission or storing in files Two alternative approaches CORBA’s common data representation / Java’s object serialization

CORBA’s Common Data Representation (CDR)
Represent all of the data types that can be used as arguments and return values in remote invocations in CORBA 15 primitive types Short (16bit), long(32bit), unsigned short, unsigned long, float, char, … Constructed types Types that composed by several primitive types A message example The type of a data item is not given with the data representation in message It is assumed that the sender and recipient have common knowledge of the order and types of the data items in a message. RMI and RPC are on the contrary

CORBA CDR for constructed types
Re pr s n ta t i o q ue ce l g th ( u si ed ) fo ll ow b el m nt r d ri ch a ra c te rs n o ca al so h av w de rs) rr ay le s i r ( o l en h s ci f ie d b eca us is x ru ct n t he or r o la at co mp v s a re pe y t r d ec ar ni g f we e s cte d m mb er

CORBA CDR message The flattened form represents a Person
Struct Person { string name; string place; long year; }; The flattened form represents a Person struct with value: {‘Smith’, ‘London’, 1934} 0–3 4–7 8–11 12–15 16–19 20-23 24–27 5 "Smit" "h___" 6 "Lond" "on__" 1934 index in sequence of bytes 4 bytes notes on representation length of string ‘Smith’ ‘London’ unsigned long

Java object serialization
Serialization (deserialization) The activity of flattening an object or a connected set of objects into a serial form that is suitable for storing on the disk or transmitting in a message Include information about the class of each object Handles: references to other objects are serialized as handles Each object is written once only Example Make use of Java serialization ObjectOutputStream.writeObject, ObjectInputStream.readObject The use of reflection Reflection : The ability to enquire about the properties of a class, such as the names and types of its instance variables and methods Reflection makes it possible to do serialization (deserialization) in a completely generic manner Remote object reference An identifier for a remote object that is valid throughout a distributed system Must ensure uniqueness over space and time

Indication of Java serialized form
Public class Person implements Serializable { private String name; private String place; private int year; public Person (String aName, String aPlace, int aYear){ name = aName; place = aPlace; year = aYear; } // followed by methods for accessing the instance variables Person p = new Person(“Smith”, “London”, 1934); The true serialized form contains additional type markers; h0 and h1 are handles Serialized values Person 3 1934 8-byte version number int year 5 Smith java.lang.String name: 6 London h0 place: h1 Explanation class name, version number number, type and name of instance variables values of instance variables

Representation of a remote object reference
Internet address port number time object number interface of remote object 32 bits

The request – reply protocol
Always be implemented over UDP but not TCP Acknowledgements are redundant since requests are followed by replies Establishing a connection involves two extra pairs of messages in addition to the pair required for a request and a reply Flow control is redundant for the majority of invocations, which pass only small arguments and results Request-reply message structure requestID: prevent duplicated request and delayed reply

Request-reply communication
Server Client doOperation (wait) (continuation) Reply message getRequest execute method select object sendReply public byte[] doOperation (RemoteObjectRef o, int methodId, byte[] arguments) sends a request message to the remote object and returns the reply. The arguments specify the remote object, the method to be invoked and the arguments of that method. public byte[] getRequest (); acquires a client request via the server port. public void sendReply (byte[] reply, InetAddress clientHost, int clientPort); sends the reply message reply to the client at its Internet address and port.

Request-reply message structure
messageType requestId objectReference methodId arguments int (0=Request, 1= Reply) int RemoteObjectRef int or Method array of bytes

The request – reply protocol … continued
Failure model Timeout doOperation return exception when repeatedly issued requests are all timeout Duplicate request messages: filter out duplicates by requestID if the server has not yet sent the reply, transmit the reply after finishing operation execution If the server has already sent the reply, execute the operation again to obtain the result. Note idempotent operation, e.g., add an element to a set, and a contrary example, append an item to a sequence History: server contains a record of reply messages that have been transmitted to avoid re-execution of operations Implement the request-reply protocol on TCP Costly, but no need for the request-reply protocol to deal with retransmission and filtering Successive requests and replies can use the same stream to reduce connection overhead

HTTP: an example of a request – reply protocol
Over TCP Each client-server interaction consists of the following steps The client requests and the server accepts a connection at the default server port or at a port specified in the URL The client sends a request message to the server The server sends a reply message to the client The connection is closed Persistent connection Connections that remain open over a series of request-reply exchanges between client and server Marshalling Request and replies are marshalled into messages as ASCII text string Resources are represented as byte sequences and may be compressed HTTP Methods GET, HEAD, POST, PUT, DELETE, OPTIONS, TRACE HTTP Request and reply messages

HTTP request / reply messages
GET // HTTP/ 1.1 URL or pathname method HTTP version headers message body HTTP/1.1 200 OK resource data HTTP version status code reason headers message body

Fault tolerance based on replicated services
The usage of Multicast Fault tolerance based on replicated services Client request are multicast to all the members of the group, each of which performs an identical operation Finding the discovery servers in spontaneous networking Multicast message can be used by servers and clients to locate available discovery services to register their interfaces or to look up the interfaces of other services Better performance through replicated data Data are replicated to increase the performance of a service, e.g., Web Cache. Each time the data changes, the new value is multicast to the processes managing the replicas Propagation of event notification Multicast to a group may be used to notify processes when something happens, e.g., the Jini system uses multicast to inform interested clients when new lookup services advertise their existence

IP Multicast – an implementation of group communication
A multicast group is specified by a class D Internet address Built on top of IP Available only via UDP The membership of a group is dynamic It is possible to send datagram to a multicast group without being a member IPv4 Multicast routers use the broadcast capability of the local network MTTL - specify the number of routers a multicast message is allowed to pass Multicast address allocation Permanent group – to Temporary group – the other addresses, set TTL to a small value Failure model: due to UDP, so it is a unreliable multicast Java API to IP multicast

Multicast peer joins a group and sends and receives datagrams
import java.net.*; import java.io.*; public class MulticastPeer{ public static void main(String args[]){ // args give message contents & destination multicast group (e.g. " ") MulticastSocket s =null; try { InetAddress group = InetAddress.getByName(args[1]); s = new MulticastSocket(6789); s.joinGroup(group); byte [] m = args[0].getBytes(); DatagramPacket messageOut = new DatagramPacket(m, m.length, group, 6789); s.send(messageOut); // this figure continued on the next slide

Multicast peers example… continued
// get messages from others in group byte[] buffer = new byte[1000]; for(int i=0; i< 3; i++) { DatagramPacket messageIn = new DatagramPacket(buffer, buffer.length); s.receive(messageIn); System.out.println("Received:" + new String(messageIn.getData())); } s.leaveGroup(group); }catch (SocketException e){System.out.println("Socket: " + e.getMessage()); }catch (IOException e){System.out.println("IO: " + e.getMessage());} }finally {if(s != null) s.close();}

Reliability and ordering of multicast
Failures Router failure prevent all recipients beyond it receiving the message Members of a group receive the same array of messages in different orders Some examples of the effects of reliability and ordering Fault tolerance based on replicated services if one of them misses a request, it will become inconsistent with the others Finding the discovery servers in spontaneous networking an occasional lost request is not an issue when locating a discovery server Reliable multicast or unreliable multicast? According to application’s requirement

Datagram communication
UNIX socket Datagram communication Datagram Socket Bind Sendto recvfrom Stream communication stream socket , bind Accept Connect Write and read

Sockets used for datagrams
ServerAddress and ClientAddress are socket addresses Sending a message Receiving a message bind(s, ClientAddress) sendto(s, "message", ServerAddress) bind(s, ServerAddress) amount = recvfrom(s, buffer, from) s = socket(AF_INET, SOCK_DGRAM, 0)

Sockets used for streams
Requesting a connection Listening and accepting a connection bind(s, ServerAddress); listen(s,5); sNew = accept(s, ClientAddress); n = read(sNew, buffer, amount) s = socket(AF_INET, SOCK_STREAM,0) connect(s, ServerAddress) write(s, "message", length) ServerAddress and ClientAddress are socket addresses

Two alternative building blocks
Summary Two alternative building blocks Datagram Socket: based on UDP, efficient but suffer from failures Stream Socket: based on TCP, reliable but expensive Marshalling CORBA’s CDR and Java serialization Request-Reply protocol Base on UDP or TCP Multicast IP multicast is a simple multicast protocol

Communication between distributed objects Remote procedure call
Chapter 5: Distributed objects and remote invocation Introduction Communication between distributed objects Remote procedure call Events and notifications Java RMI case study Summary

Provide a programming model Provide transparence
Middleware Layers of Middleware Provide a programming model Provide transparence Location Communication protocols Computer hardware Operating systems Programming languages

Distributed programming model
Remote procedure call (RPC) call procedure in separate process Remote method invocation (RMI) extension of local method invocation in OO model invoke the methods of an object of another process Event-based model Register interested events of other objects Receive notification of the events at other objects

Interface in distributed system
Interfaces Interface Specifies accessible procedures and variables Inner alteration won’t affect the user of the interface Interface in distributed system Can’t access variables directly Input argument and output argument Pointers can’t be passed as arguments or returned results

RPC’s Service interface
Interface cases RPC’s Service interface specification of the procedures of the server input and output arguments of each procedure RMI’s Remote interface Specification of the methods of an object that are available for objects in other processes may pass objects or remote object references as arguments or returned result Interface definition languages program language, e.g. Java RMI Interface definition language (IDL), e.g. CORBA IDL, DCE IDL and DCOM IDL

Discuss RMI under following headings
The object model Distributed objects The distributed object model Design issues semantics of remote invocations Implementation RMI above the request-reply protocol Distributed garbage collections

The object model Object references Interfaces Actions
Objects can be accessed via object references First-class values Interfaces A definition of the signatures of a set of methods No constructers A class can implement several interfaces, e.g. Java Actions Initiated by an object invoking a method in another object Two affects Change the state of the receiver Further invocations on methods in other objects

The object model … continued
Exceptions mechanism A clean way to deal with error conditions List exceptions at the method head throw user know exceptions Catch exceptions Garbage collection Freeing the space occupied by cancelled objects C++: collected by programmers Java: collected by JVM

Benefits of distributed objects
Natural extension physical distribution of objects into different processes or computers in a distributed system Benefits of distributed objects Enforce encapsulation can’t access variables directly Support heterogeneous systems Local object cache Assume other architectural models Replicated objects Migrated objects

The distributed objects model
Remote object reference An unique identifier in a distributed system May be passed as arguments and results of remote method invocation Remote interface remote object class implements the methods of its remote interface Actions in a distributed system may incur a chain of invocations on different computers Garbage collection Usually based on reference counting Exception notify the client and the client handle exceptions

Design Issues – Invocation semantics
Choices for different delivery guarantees retry request message duplicate filtering retransmission of results Three different semantics Fault tolerance measures Invocation semantics Retransmit request message Duplicate filtering Re-execute procedure or retransmit reply No Yes Not applicable Retransmit reply At-most-once At-least-once Maybe

Different invocation semantics
Maybe For invoker: executed once, or not at all ??? Suffer from: (1) message lost; (2) server crash Useful for app. in which occasional failed invocation are acceptable At least once For invoker: execute at least once, or an exception Suffer from: (1) server crash; (2) arbitrary failures for non- idempotent method At most once For invoker: receives result, or an exception Prevent: omission failures by retrying, arbitrary failures

Design Issues - Transparency
What can be made transparent marshal message passing object locating and contacting What can’t be made transparent vulnerable to failure latency Current consensus transparent in syntax different in expression

Remote reference module
Implementation of RMI The inner scene of RMI Communication module Request/reply between client and server Select dispatcher at server side Remote reference module Translate between local and remote object reference Create remote object reference Remote object table entries for remote objects held by the process entries for local proxies

Implementation of RMI – RMI software
Proxy forward invocation to remote object one remote object one proxy Skeleton implement the method in the remote interface unmarshal the arguments in the request invoke the corresponding method in the remote object wait for the invocation complete marshal the result in the reply message Dispatcher select appropriate method in the skeleton one dispatcher and skeleton for one remote object

Implementation of RMI - execution
The classes for proxies, dispatchers and skeletons generated automatically by an interface compiler, e.g. rmic Server program create and initialize at least one of the remote objects register Client program look up the remote object references invoke The binder Maintain mapping information of textual names to remote object references

Implementation of RMI - Object state
Activation of remote objects to avoid resource waste, the servers can be started whenever they are needed a remote object could be active or passive Persistent object stores Persistent object an object that is guaranteed to live between activations of processes different passivate strategies at the end of a transaction when the program exit E.g., Persistent Java, PerDiS

Distributed garbage collection
The aim of a distributed garbage collector Retain the object (local&remote) when it is still be referenced Collect the object when none holds reference to it Java distributed garbage collection algorithm based on reference counting server maintain processes set that hold remote object references to it client notify server to modify the process set when the process set becomes empty, server local garbage collector reclaims the space Leases in Jini lease: the granting of the use of a resource for a period of time avoid to discover whether the resource users are still interested or their programs have not exited

RPC is very similar to RMI
Service interface: the procedures that are available for remote calling Invocation semantics choice: at-least-once or at-most-once Generally implemented over request-reply protocol Building blocks Communication module Client stub procedure (as proxy in RMI): marshalling, sending, unmarshalling Dispatcher: select one of the server stub procedures Server stub procedure (as skeleton in RMI): unmarshalling, calling, marshalling client Request Reply Communication module dispatcher service client stub server stub procedure client process server process program

XDR - Interface definition language
Sun RPC case study Designed for NFS at-least-once semantics XDR - Interface definition language Interface name: Program number, version number Procedure identifier: procedure number Rpcgen – generator of RPC components client stub procedure server main procedure Dispatcher server stub procedure marshalling and unmarshalling procedure

Sun RPC case study …continued
Binding – portmapper Server: register ((program number, version number), port number) Client: request port number by (program number, version number) Authentication Each request contains the credentials of the user, e.g. uid and gid of the user Access control according to the credential information

Event-notification model
Idea one object react to a change occurring in another object Event examples modification of a document an electronically tagged book being at a new location Publish/subscribe paradigm event generator publish the type of events event receiver subscribe to the types of events that are interest to them When event occur, notify the receiver Distributed event-based system – two characteristics Heterogeneous asynchronous

Example - dealing room system
Requirements allow dealers to see the latest market price information System components Information provider receive new trading information publish stocks prices event stock price update notification Dealer process subscribe stocks prices event System architecture

Architecture for distributed event notification
Event service: maintain a database of published events and of subscribers’ interests decouple the publishers from the subscribers subscriber observer object of interest Event service 3. 1. 2. notification

The roles of the participating objects
The object of interest its changes of state might be of interest to other objects Event the completion of a method execution Notification an object that contains information about an event Subscriber an object that has subscribed to some type of events in another object Observer objects the main purpose is to decouple an object of interest from its subscribers Publisher an object that declares that it will generate notifications of particular types of event

Notification delivery
Delivery semantics Unreliable Reliable real-time Roles for observers Forwarding send notifications to subscribers on behalf of one or more objects of interests Filtering of notifications Patterns of events Notification mailboxes notification be delayed until subscriber being ready to receive

Jini distributed event specification
EventGenerator interface Provide register method Event generator implement it Subscriber invoke it to subscribe to the interested events RemoteEventListener interface Provide notify method subscriber implement it receive notifications when the notify method is invoked RemoteEvent a notification that is passed as argument to the notify method Third-party agents interpose between an object of interest and a subscriber equivalent of observer

Arguments and return results of remote method
Java RMI introduction Remote object Must implement the remote interface must handle remote exceptions Arguments and return results of remote method Must be serializable All primitive types serializable remote objects are serializable File handles are unserializable Remote objects are passed as remote object reference, non- remote serializable objects are copied and passed by value RMIregistry access by the Naming class

Example: shared whiteboard
Remote Interface Server program and Client program Callbacks A server’s action of notifying clients about an event Implementation Client create a remote object Client pass the remote object reference to server Whenever an event occurs, server call client via the remote object Advantage Improve performance by avoid constant polling Delivery information in a timely manner

Java classes supporting RMI
Design and implementation of Java RMI Java classes supporting RMI RemoteServer UnicastRemoteObject <servant class> Activatable RemoteObject

Summary Two paradigms for distributed programming RMI Sun RPC
RMI(RPC)/Event notification: sync./async. RMI Distributed object model Remote interface, remote exception, naming service Remote invocation semantics Once, at-least-once, at-most-once Example: whiteboard based on Java RMI Sun RPC Event-notification Publish/subscribe Event service Example: dealing room

Middleware layers Applications RMI, RPC and events Middleware
Request reply protocol External data representation Operating System RMI, RPC and events

CORBA IDL example // In file Person.idl struct Person { string name;
string place; long year; } ; interface PersonList { readonly attribute string listname; void addPerson(in Person p) ; void getPerson(in string name, out Person p); long number(); };

{ A remote object and its remote interface object Data remote
implementation object { of methods

Remote and local method invocations
B C D E F

The role of proxy and skeleton in remote method invocation
object A object B skeleton Request proxy for B Reply Communication Remote Remote reference module reference module for B’s class & dispatcher remote client server

Files interface in Sun XDR
const MAX = 1000; typedef int FileIdentifier; typedef int FilePointer; typedef int Length; struct Data { int length; char buffer[MAX]; }; struct writeargs { FileIdentifier f; FilePointer position; Data data; struct readargs { Length length; program FILEREADWRITE { version VERSION { void WRITE(writeargs)=1; 1 Data READ(readargs)=2; 2 }=2; } = 9999;

Dealing room system Dealer’s computer Information provider Dealer
External source Notification

The Naming class of Java RMIregistry
void rebind (String name, Remote obj) This method is used by a server to register the identifier of a remote object by name, as shown in Figure 15.13, line 3. void bind (String name, Remote obj) This method can alternatively be used by a server to register a remote object by name, but if the name is already bound to a remote object reference an exception is thrown. void unbind (String name, Remote obj) This method removes a binding. Remote lookup(String name) This method is used by clients to look up a remote object by name, as shown in Figure line 1. A remote object reference is returned. String [] list() This method returns an array of Strings containing the names bound in the registry.

Java Remote interfaces Shape and ShapeList
import java.rmi.*; import java.util.Vector; public interface Shape extends Remote { int getVersion() throws RemoteException; GraphicalObject getAllState() throws RemoteException; 1 } public interface ShapeList extends Remote { Shape newShape(GraphicalObject g) throws RemoteException; 2 Vector allShapes() throws RemoteException;

Java class ShapeListServant implements interface ShapeList
import java.rmi.*; import java.rmi.server.UnicastRemoteObject; import java.util.Vector; public class ShapeListServant extends UnicastRemoteObject implements ShapeList { private Vector theList; // contains the list of Shapes 1 private int version; public ShapeListServant()throws RemoteException{...} public Shape newShape(GraphicalObject g) throws RemoteException { 2 version++; Shape s = new ShapeServant( g, version); 3 theList.addElement(s); return s; } public Vector allShapes()throws RemoteException{...} public int getVersion() throws RemoteException { ... }

Java class ShapeListServer with main method
import java.rmi.*; public class ShapeListServer{ public static void main(String args[]){ System.setSecurityManager(new RMISecurityManager()); try{ ShapeList aShapeList = new ShapeListServant(); 1 Naming.rebind("Shape List", aShapeList ); 2 System.out.println("ShapeList server ready"); }catch(Exception e) { System.out.println("ShapeList server main " + e.getMessage());} }

Java client of ShapeList
import java.rmi.*; import java.rmi.server.*; import java.util.Vector; public class ShapeListClient{ public static void main(String args[]){ System.setSecurityManager(new RMISecurityManager()); ShapeList aShapeList = null; try{ aShapeList = (ShapeList) Naming.lookup("//bruno.ShapeList") ; 1 Vector sList = aShapeList.allShapes(); 2 } catch(RemoteException e) {System.out.println(e.getMessage()); }catch(Exception e) {System.out.println("Client: " + e.getMessage());} }

Callback mechanism in the whiteboard system
Client created remote object: Public interface WhiteboardCallback implements Remote{ void callback(int version) throws RemoteException; } Methods added in Shapelist interface: Int register(WhiteboardCallback callback) throws RemoteException; Void deregister(int callbackID) throws RemoteException;

Chapter 6: Operating System Support
Introduction The operating system layer Protection Processes and Threads Communication and invocation Operating system architecture Summary

Network OS & Distributed OS
Network operating system network capability access remote resources, e.g., NFS, rlogin, telnet Multiple system images, one per node Examples: Windows NT, Unix Distributed operating system Single system image: transparency

Middleware and network OS
No distributed OS in general use Users have much invested in their application software Users tend to prefer to have a degree of autonomy for their machines Network OS: meet the requirements of middleware Efficient and robust access to physical resources Flexibility to implement a variety of resource management policies

The relationship between OS and Middleware
Operating System Tasks: processing, storage and communication Components: kernel, library, user-level services Middleware runs on a variety of OS-hardware combinations remote invocations Architecture

Functions that OS should provide for middleware
Encapsulation provide a set of operations that meet their clients’ needs Protection protect resource from illegitimate access Concurrent processing support clients access resource concurrently Supports for RMI Communication in RMI Pass operation parameters and results Scheduling in RMI Schedule the processing of the invoked operation

The core OS components Process manager Thread manager
Handles the creation of and operations upon processes. Thread manager Thread creation, synchronization and scheduling Communication manager Communication between threads attached to different processes on the same computer Memory manager Management of physical and virtual memory Supervisor Dispatching of interrupts, system call traps and other exceptions control of memory management unit and hardware caches processor and floating point unit register manipulations Figure

Illegitimate access Maliciously contrived code Benign code
contains a bug have unanticipated behavior Example: read and write in File System Illegal user vs. access right control Access the file pointer variable directly (setFilePointerRandomly) vs. type-safe language Type–safe language, e.g. Java Non-type-safe language, e.g. C or C++

Kernel and Protection Kernel Different execution mode
always runs complete access privileges for the physical resources Different execution mode supervisor mode (kernel process) / user mode (user process) Interface between kernel and user processes: system call trap Kernel design is a good choice for protection The price for protection switching between different processes take many processor cycles a system call trap is a more expensive operation than a simple method call

Concepts Process Thread
A unit of resource management, a single activity Problem: sharing between related activities are awkward and expensive Nowadays, a process consists of an execution environment together with one or more threads Thread Abstraction of a single activity Objective maximize the degree of concurrent execution between operations E.g. overlap of computation with input and output E.g. concurrent processing on multiple processors within a process

Concepts … continued Execution environment
the unit of resource management Consist of An address space Thread synchronization and communication resources such as semaphores and communication interfaces (e.g. sockets) Higher-level resources such as open files and windows Shared by threads within a process, some times across process Heavyweight process / lightweight process

Address space Address space Region UNIX address space
a unit of management of a process’s virtual memory 232 bytes, 264 bytes consists of one or more regions Region an area of continuous virtual memory that is accessible by the threads of the owning process UNIX address space

Address space … continued
The number of regions is indefinite Support a separate stack for each thread access mapped file Share memory between processes The uses of shared region Libraries Kernel Data sharing and communication between two processes, or between process and kernel

Creation of new process in distributed system
Traditional process creation Fork, exec in Unix Process creation in distributed system The choice of a target host The creation of an execution environment, an initial thread Creation of a new execution environment initializing the address space Statically defined format With respect to an existing execution environment, e.g. fork Copy-on-write scheme

Choice of process host Choice of process host Load sharing policy
run new processes at their originator’s computer load sharing between a set of computers Load sharing policy Transfer policy: situate a new process locally or remotely? Location policy: which node should host the new process? Static policy Adaptive policy Migration policy: when&where should migrate the running process? Load sharing system Centralized Hierarchical Decentralized

Threads – an example Example: client and server with threads
Given: tp = 2 ms, tio = 8 ms, Quest.: T (maximum server throughput) ? Server N threads Input-output Client Thread 2 makes T1 Thread 1 requests to server generates results Requests Receipt & queuing

Different server implementations (1, 2)
A single thread: T = 1000/(2+8) = 100 requests are handled one by one Two threads: T = 1000/8 = 125 processing on one request can overlap the disk IO of another request 8 ms (2)

Different server implementations (3, 4)
Two threads, disk cache (75% hit): T = / 2.5 = 400 tio= 75%*0+25%*8 = 2ms; due to search in cache, processing delay increase, say 2.5ms, tp = 2.5 ms Up to two threads, disk cache, two processors: T = 1000 / 2 = 500 Processing on different requests can be overlapped 2.5 ms (3) 2ms (4)

Architectures for multi-threaded servers
Worker pool Server creates a fixed pool of “worker” threads to process the requests when it starts up Pro: simple Cons Inflexibility: worker threads number unequal current request number High level of switching between the I/O and worker thread Thread-per-request Server spawn a new worker thread for each new request, destroy it when the request processing finish Pro: throughput is potentially maximized Con: overhead of the thread creation and destruction

Architectures for multi-threaded servers … continued
Thread-per-connection Server creates a new worker thread when client creates a connection, destroys the thread when the client closes the connection Pro: lower thread management overheads compared with the thread-per-request Con: client may be delayed while a worker thread has several outstanding requests but another thread has no work to perform Thread-per-object Associate a thread with each remote object Pro&Con are similar to thread-per-connection

Threads within clients
The client example First thread: generates results to be passed to a server by remote method invocation, but does not need a reply Second thread: perform the remote method and block while the first thread is able to continue computing further results Web browser Multiple threads handle multiple concurrent requests for web pages

Threads versus multiple processes
Main state components of Execution Environment and Thread The comparison of processes and threads Creating a new thread within an existing process is cheaper than creating a process 1ms vs.11ms Switching to a different thread within the same process is cheaper than switching between threads belonging to different processes 0.4ms vs. 1.8ms Threads within a process may share data and other resources conveniently and efficiently compared with separate process But, threads within a process are not protected from one another

Threads programming Concurrent programming The Java thread class
Race condition, critical section, monitor, condition variable, semaphore C Threads package or pthreads for C, Java The Java thread class

Thread synchronization
Variable each thread’s local variables in methods are private to it no private copies of static variables or object instance variables Java synchronized methods Example: multiple threads manipulate a queue Mutual exclusive Synchronized method of addto() and removefrom() methods in the queue class Producer-consumer Wait(): block on waiting condition variables notify(): unblock the waiting threads

Thread scheduling Preemptive scheduling Non-preemptive scheduling
A thread may be suspended at any point to make way for another thread Non-preemptive scheduling a thread runs until it makes a call to the threading system, when the system may de-schedule it and schedule another thread to run Avoid race condition Can’t take advantage of a multiprocessor since it run exclusively The programmer need to insert yield() calls

Threads implementation
Kernel (e.g., Windows NT, Solaris, Mach) / User level (runtime library) User-level implementation Cons: Threads within a process can’t take advantage of multiprocessors A thread that takes a page fault blocks the entire process Threads within different processes can’t be scheduled Pros: less costly, e.g. no system call customization more user-level threads can be supported Hybrid approach Two-tier scheduling the kernel provides access to multiple processors, while user-level code handles the details of scheduling policy Solaris 2 operating system lightweight process (kernel level threads) and user level threads

Communication primitives & protocols
TCP(UDP) Socket in Unix and Windows DoOperation, getRequest, sendReply in Amoeba Group communication primitives in V system Protocols and openness provide standard protocols that enable internetworking between middleware integrate low-level protocols without upgrading their application Static stack new layer to be integrated statically as a “driver” Dynamic stack protocol stack be composed on the fly E.g. web browser utilize wide-area wireless link on the road and faster Ethernet connection in the office

Invocation performance
Invocation costs Different invocations The factors that matter synchronous/asynchronous, domain transition, communication across a network, thread scheduling and switching Invocation over the network Delay: the total RPC call time experienced by a client Latency: the fixed overhead of an RPC, measured by null RPC Throughput: the rate of data transfer between computers in a single RPC An example Threshold: one extra packet to be sent, an extra acknowledge packet is needed

Main components accounting for remote invocation delay
Marshalling & unmarshalling Copy and convert data a significant overhead as the amount of data grows Data copying Across the user-kernel boundary Across each protocol layer ( e.g., RPC/UDP/IP/Ethernet ) Between the network interface and kernel buffers Packet initialization Initializing protocol headers and trailers Checksums: in proportional to the amount of data sent Thread scheduling and context switching System calls as stub invoke the kernel’s communication operations one or more server threads network manager process and threads Waiting for acknowledgements Network transmission

Improve the performance of RPC
Memory sharing rapid communication between processes in the same computer Choice of protocol TCP/UDP Persistent connections: several invocations during one OS’s buffer collect several small messages and send them together Invocation within a computer Most cross-address-space invocation take place within a computer LRPC (lightweight RPC)

Asynchronous operation
Performance characteristics of the Internet High latencies, low bandwidths and high server loads Network disconnection and reconnection. outweigh any benefits that the OS can provide Asynchronous operation Concurrent invocations E.g., the browser fetches multiple images in a home page by concurrent GET requests Asynchronous invocation: non-blocking call E.g., CORBA oneway invocation: maybe semantics, collect result by a separate call

Asynchronous operation … continued
Persistent asynchronous invocations Designed for disconnected operation Try indefinitely to perform the invocation, until it is known to have succeeded or failed, or until the application cancels the invocation QRPC (Queued RPC) Client queue outgoing invocation requests in a stable log Server queue invocation results The issues to programmers How user can continue while the results of invocations are still not known?

Monolithic kernels and microkernels
Kernel is massive: perform all basic operating system functions, megabytes of code and data Kernel is undifferentiated: coded in a non-modular way E.g. Unix Pros: efficiency Cons: lack of structure Microkernel Kernel provides only the most basic abstractions: address spaces, threads and local interprocess communication All other system services are provided by servers that are dynamically loaded E.g., VM of IBM 370 Pros: extensibility, modularity, free of bugs Cons: relatively inefficiency Hybrid approaches

Summary Process & thread Remote invocation cost OS architecture
A Process consists of multiple threads and an execution environment Multiple-threads: cheaper concurrency, take advantage of multiprocessors for parallelism Remote invocation cost Marshalling & unmarshalling data copying packet initialization thread scheduling and context switching Network transmission OS architecture Monolithic kernel & microkernel

System layers

Core OS functionality

Address space

Copy-on-write scheme a) Before write b) After write Shared frame
A's page table B's page Process A’s address space Process B’s address space Kernel RA RB RB copied from RA

Alternative server threading architecture

State associated with execution environments and threads
Address space tables Saved processor registers Communication interfaces, open files Priority and execution state (such as BLOCKED ) Semaphores, other synchronization objects Software interrupt handling information List of thread identifiers Execution environment identifier Pages of address space resident in memory; hardware cache entries

Java Thread constructor and management methods
Thread(ThreadGroup group, Runnable target, String name) Creates a new thread in the SUSPENDED state, which will belong to group and be identified as name; the thread will execute the run() method of target. setPriority(int newPriority), getPriority() Set and return the thread’s priority. run() A thread executes the run() method of its target object, if it has one, and otherwise its own run() method (Thread implements Runnable). start() Change the state of the thread from SUSPENDED to RUNNABLE. sleep(int millisecs) Cause the thread to enter the SUSPENDED state for the specified time. yield() Enter the READY state and invoke the scheduler. destroy() Destroy the thread.

Invocations between address spaces

Delay of an RPC operation when returned data size varies

A lightweight remote procedure call

Times for serialized and concurrent invocations

Monolithic kernel and Microkernel

Chapter 7: Security Introduction Overview of security techniques Cryptographic algorithms Digital signatures Cryptography pragmatics Case studies: Needham-Schroeder, Kerberos, SSL&Millicent Summary

Introduction History The emergence of cryptography into the public domain Public-key cryptography Much stronger DES Common terminology and approach Security policies Share resource within limited rights Security mechanisms Implement security policies

Threats and attacks Security threats Methods of attack
Leakage: acquisition of information by unauthorized recipients Tampering: unauthorized alteration of information Vandalism: interference with the proper operation of a system without gain to the perpetrator Methods of attack Eavesdropping: obtain copies of messages without authority Masquerading: send or receive messages using the identity of another principal without their authority Message tampering: intercept messages and alter their contents before pass them on to the intended recipient Replaying: store intercepted messages and send them at a later data Denial of service: flood a channel or other resources with messages in order to deny access for others Attacks in practice discover loopholes Guess password

Threats from mobile code
Sandbox model in Java Security manager most applets can not access local files, printers or network sockets Two further measures to protect the local environment The downloaded classes are stored separately from the local classes, preventing them from replacing local classes with spurious versions The bytecodes are checked for validity, e.g. avoiding accessing illegal memory address

Securing electronic transactions
Examples depending crucially on security , purchase of goods and services, banking transactions, micro-transactions Requirements for securing web purchases Authenticate the vendor to the buyer Keep the buyer’s credit number and other payment details from falling into others’ hands and ensure that they are unaltered from the buyer to vendor Ensure downloadable contents are delivered without alteration and disclosure Authenticate the identity of the account holder to the bank before giving them access to their account Ensure account holder can’t deny they participated in a transaction (non-repudiation)

Design secure systems Design to the best available standards
Formal validation Construct a list of threats, and show that each of them is prevented by the mechanisms employed By informal argument, or logical proof Auditing methods Secure log: record security-sensitive system actions with details of the users performing the actions and their authority Cost and inconvenience Cost in computational effort and in network usage Inappropriately specified security measures may exclude legitimate users from performing necessary actions

Worst-case assumptions and design guidelines
Interfaces are exposed Networks are insecure Limit the lifetime and scope of each secret Algorithms and program code are available to attackers Publish the algorithms used for encryption and authentication, relying only on the secrecy of cryptographic keys Attackers may have access to large resources Minimize the trusted base Trusted base: the portion of a system that are responsible for the implementation of its security, and all the hardware and software components upon which they rely

Introduction Overview of security techniques Cryptographic algorithms Digital signatures Cryptography pragmatics Case studies: Needham-Schroeder, Kerberos, SSL&Millicent Summary

Cryptography Encryption Cryptographic key Shared secret keys
the process of encoding a message in such a way as to hide its contents Cryptographic key a parameter used in an encryption algorithm in such a way that the encryption can not be reversed without a knowledge of the key Shared secret keys The sender and the recipient must share a knowledge of the key and it must not be revealed to anyone else Public/private key pairs The sender of a message uses a public key – one that has already been published by the recipient – to encrypt the messages; the recipient uses a corresponding private key to decrypt the message.

Uses of cryptography – secrecy and integrity
Scenario 1: secret communication with a shared secret key Alice wishes to send some information secretly to Bob. Alice and Bob share a secret key KAB Alice Bob E(KAB, M) D(KAB, {M}KAB) M {M}KAB Problem 1: how can Alice send a shared key KAB to Bob securely? Problem 2: How does Bob know that any {M} KAB is not a copy of an earlier encrypted message from Alice that was captured by Mallory and replayed later?

Uses of cryptography – authentication (1)
Scenario 2: Authenticated communication with a server Alice wishes to access files held by Bob. Sara is an authentication server that is securely managed, and it knows Alice’s key KA and Bob’s key KB Ticket: an encrypted item issued by an authentication server, containing the identity of the principal to whom it is issued and a shared key that has been generated for the current communication session Challenge: Sara issues a ticket to Alice encrypted in Alice’s secret key 3. Decrypt by KA Alice Bob 1. “I am Alice, give me the ticket of Bob” Sara 2. {{Ticket}KB,KAB}KA, (Ticket={KAB,Alice}KB) 4. {Ticket}KB, Alice, R 5. Decrypt by KB 6. {Message}KAB Problem: no protection against the replay of old authentication messages

Uses of cryptography – authentication (2)
Scenario 3: Authenticated communication with public keys Bob has generated a public/private key pair 3. Create a new shared key KAB Alice Bob 1. “give me the public key of Bob” Key distribution service 2. KBpub 4. Keyname, {KAB}KBpub 5. Decrypt by KBpriv 6. {Message}KAB Problem: Mallory may intercept Alice’s initial request to the key distribution service for Bob’s public-key certificate and send a response containing his own public key

Uses of cryptography – digital signatures
Digital signature: verify to a third party that a message is an unaltered copy of one produced by the signer Digest: a fixed-length compressed message Secure digest function: similar to checksum function, unlikely to produce a similar digest value for two different messages Scenario 4: digital signatures with a secure digest function Alice want to sign a document M that she is the originator of it Alice computes Digest(M) Alice make M, {Digest(M)}KApriv available Bob obtains the signed document, extracts M and computes Digest(M) Bob decrypts {Digest(M)}KApriv using Alice’s public key KApub and compares the result with his calculated Digest(M)

Certificates Scenario 5: The use of certificates
Digital certificate: a statement signed by a principal Scenario 5: The use of certificates Bob: a bank, Alice: a customer who has an account with Bob’s bank, Carol: a vendor who accept Alice’s transaction, Fred: a trusted authority Alice Carol 1. Create a account Bob 2. Alice’s certificate 3. Certificate, TR 4. Detect the certificate by KBpub Fred -1. Give me Bob’s public key 0. Certificate of Bob’s public key -3. Register the public key -2. Certificate for Bob’s public key

Certificates … continued
To make certificates useful A standard format and representation Agreement on the manner of certificates chain a trusted authority Time failure include an expire data

Access control Protection domain Capability
An execution environment shared by a collection of processes, contains a set of <resource, rights>, e.g. user ID and group ID in Unix Capability A capability is held by each process according to the domain in which it is located, Unforgeable Drawbacks capabilities may be stolen revocation problem – difficult to cancel capabilities Similarity between capabilities and certificates Resource identifier Operations Authentication code A unique identifier for the target resource A list of the operations permitted on the resource A digital signature making the capability unforgeable

Access control … continued
Access control list A list stored with each resource The form of each entry: <domain, operations> E.g. ACL in Unix and Windows NT file systems Implementation Security API in CORBA and Java Digital signatures Credential Public-key certificate

Credentials and Firewalls
A set of evidence provided by a principal when requesting access to a resource, e.g. certificate Speak for idea Cooperative credentials Delegation certificate Role-based credentials Firewall

Plaintext/ciphertext Encryption function E Decryption function D
Several definitions Plaintext/ciphertext Encryption function E E(K,M) = {M}K Decryption function D D(K, E(K,M)) = M

Symmetric & Asymmetric algorithms
Secret-key cryptography One-way property FK([M])=E(K,M), FK-1([M]) is hard to compute Brute-force attack: try all possible K computing E(K,M) to match {M}K, 2N-1 iterations on average, and a maximum of 2N iterations Asymmetric algorithms Public-key cryptography The pair of keys is derived from a common root, e.g. a pair of very large prime numbers in RSA

Different Ciphers Block ciphers Stream ciphers
Fixed size blocks of data, e.g. 64 bits is popular Recognize repeated patterns, short of integrity guarantee Cipher block chaining (CBC) Each plaintext block is combined with the preceding ciphertext block using the exclusive-or operation before it is encrypted restricted to reliable connection Stream ciphers Convert plaintext to ciphertext one bit at a time Keystream an arbitrary-length sequence of bits, Encrypt the keystream, XOR the keystream with the data stream Keystream is secure, so is the data stream Keystream generator E.g. a random number generator which is agreed between sender and receiver

Design of cryptographic algorithms
Based on Information Theory information-preserving manipulations of M Confusion Combine each block of plaintext with the key Non-destructive operations, e.g. XOR, circular shifting Obscure the relationship between M and {M}K Diffusion Dissipate the regular patterns transpose portions of each plaintext block

Secret-key (symmetric) algorithms
TEA (Tiny Encryption Algorithm) [CU1994] Cipher block: 64 bits, 2 integer Encryption key: 128 bits, 4 integer Against brute-force attack Confusion: XOR (^) and shift (<<, >>) Diffusion: shift and swap delta: obscure the key Two very minor weaknesses Application Example

Secret-key (symmetric) algorithms …continued
DES [IBM1977] Adopted as a US national standard 64-bit cipher block, 56-bit key Be cracked in a widely publicized brute-force attack Triple-DES: E3DES(K1,K2,M)=EDES(K1,DDES(K2,EDES(K1,M))) IDEA [1990] Successor to DES, 128-bit key, 3 times faster than DES AES [NIST1999] 128-bit, 192-bit, 256-bit key Like to be the most widely used symmetric encryption algorithms

Public-key (asymmetric) algorithms
D(Kd, E(Ke,M)) = M Ke is public, Kd is secret RSA based on the use of two very large prime numbers Ke = <e, N>, Kd = <d, N> factorization of N is so time consuming In application, N’s length should be at least 768 bits

Hybrid cryptographic protocols
Public-key cryptography Pros: no need for a secure key-distribution mechanism Cons: high processing cost Secret-key cryptography Pros: effective Cons: need for a secure key-distribution mechanism Hybrid encryption scheme Secret key distribution: public-key cryptography Data transmission: secret-key cryptography

Handwritten Signature and Digital Signature
Handwritten signatures Authentic: no alteration Unforgeable Non-repudiable Digital signatures Bind a unique and secret attribute of the signer to the document Digital signing Example of a singed doc: M,A,[H(M)]KA, A: signer ID, KA: signer’s key Digest functions secure hash functions: ensure H(M)<>H(M`)

Digital signature Keys
Digital signatures with public keys convenient solution in most situations Digital signatures with secret keys Problems caused by secret key digital signature Secure secret key distribution mechanism Verifier could forge signers signature MAC (message authentication code) Sender sends receiver a shared key via secure channel No encryption, 3-10 times faster than symmetric encryption Suffer problems also

Secure digest functions
Requirements on secure digest function h = H(M) Given M, it is easy to compute h Given h, it is hard to compute M Given M, it is hard to find another message M`, such that H(M) = H(M`) Birthday attack MD5 [Rivest 1992], 128-bit digest SHA [NIST 1995], 160-bit digest

Certificate standards and certificate authorities
X.509 The most widely used standard format for certificates[CCITT 1988b] Based on the global uniqueness of distinguished names SPKI Creation and management of sets of public certificates Chains of certificates

Cryptography pragmatics
Performance of cryptographic algorithms Applications of cryptography and political obstacles NSA (National Security Agency): restrict the strength of cryptography FBI: privileged access to all cryptographic keys PGP (Pretty Good Privacy) A example of cryptographic method which is not controlled by US government generate and manage public and secret keys RSA for authentication and secret key transmission IDEA or 3DES for data transmission

Simple authentication scenario
3. Decrypt by KA Alice Bob 1. “I am Alice, give me the ticket of Bob” Sara 2. {{Ticket}KB,KAB}KA, (Ticket={KAB,Alice}KB) 4. {Ticket}KB, Alice, R 5. Decrypt by KB 6. {Message}KAB Problem: no protection against the replay of old authentication messages

Needham and Schroeder authentication protocol
Header Message Notes 1. A->S: A, B, NA A requests S to supply a key for communication with B. 2. S->A: {NA , B, KAB, {KAB, A}KB}KA S returns a message encrypted in A’s secret key, containing a newly generated key KAB and a ‘ticket’ encrypted in B’s secret key. The nonce NA demonstrates that the message was sent in response to the preceding one. A believes that S sent the message because only S knows A’s secret key. 3. A->B: A sends the ‘ticket’ to B. 4. B->A: B decrypts the ticket and uses the new key KAB to encrypt another nonce NB. 5. A->B: A demonstrates to B that it was the sender of the previous message by returning an agreed transformation of NB. {KAB, A}KB {NB}KAB {NB - 1}KAB Nonce: an integer that demonstrates message freshness Remedy for stale 3: {KAB, A, t}KB

Kerberos Three kinds of security objects
ticket: a token that verifies the sender has recently been authenticated authentication: a token then proves user’s identity and currency of communication with a server session key: encrypt communication System architecture of Kerberos Kerberos protocol

Application of Kerberos
Campus network[MIT 1990] Users’ passwords and services secrets Be known by owner and authentication server Login with Kerberos password is prevented from eavesdropping Access servers with kerberos ticket containing expire time

Secure Socket Layer (SSL)
SSL protocol stack hybrid scheme: public-key cryptography for authentication, secret-key cryptography for data communication Negotiable encryption and authentication algorithms requirement of an open network environment handshake protocol Application [netscape 1996], de facto, https, integrated in web browsers and web servers ticket: verify the sender has recently been authenticated

Summary Guide for designing a secure system
worst case assumptions Public-key and secret-key cryptography TEA RSA Access control mechanisms capability and ACL Needham-Schroeder authentication protocol Challenge, Ticket Kerberos Ticket, Authenticator, Session key

Historical context: the evolution of security needs
Current Platforms Multi-user timesharing computers Distributed systems based on local networks The Internet, wide- area services The Internet + mobile devices Shared resources Memory, files Local services (e.g. NFS), local networks , web sites, Internet commerce Distributed objects, mobile code Security requirements User identification and authentication Protection of services Strong security for commercial transactions Access control for individual objects, secure mobile code management environment Single authority, single authorization database (e.g. /etc/ passwd) delegation, replicated authorization databases (e.g. NIS) Many authorities, no network-wide authorities Per-activity authorities, groups with shared responsibilities

Cryptography notations
Alice First participant Bob Second participant Carol Participant in three- and four-party protocols Dave Participant in four-party protocols Eve Eavesdropper Mallory Malicious attacker Sara A server

Alice’s bank account certificate
1. Certificate type : Account number 2. Name Alice 3. Account 4. Certifying authority Bob’s Bank 5. Signature {Digest(field 2 + field 3)} KBpriv

Public-key certificate for Bob’s bank
1. Certificate type : Public key 2. Name Bob’s Bank 3. KBpub 4. Certifying authority Fred – The Bankers Federation 5. Signature {Digest(field 2 + field 3)} KFpriv

Cipher block chaining n n+3 n+2 n+1 XOR E(K, M) n-1 n-2 n-3
plaintext blocks ciphertext blocks

Stream Cipher XOR E(K, M) number generator n+3 n+2 n+1 plaintext
ciphertext buffer keystream

TEA encryption function
void encrypt(unsigned long k[], unsigned long text[]) { unsigned long y = text[0], z = text[1]; 1 unsigned long delta = 0x9e3779b9, sum = 0; int n; 2 for (n= 0; n < 32; n++) { 3 sum += delta; 4 y += ((z << 4) + k[0]) ^ (z+sum) ^ ((z >> 5) + k[1]); 5 z += ((y << 4) + k[2]) ^ (y+sum) ^ ((y >> 5) + k[3]); 6 } text[0] = y; text[1] = z; 7

TEA decryption function
void decrypt(unsigned long k[], unsigned long text[]) { unsigned long y = text[0], z = text[1]; unsigned long delta = 0x9e3779b9, sum = delta << 5; int n; for (n= 0; n < 32; n++) { z -= ((y << 4) + k[2]) ^ (y + sum) ^ ((y >> 5) + k[3]); y -= ((z << 4) + k[0]) ^ (z + sum) ^ ((z >> 5) + k[1]); sum -= delta; } text[0] = y; text[1] = z;

TEA in use void tea(char mode, FILE *infile, FILE *outfile, unsigned long k[]) { /* mode is ’e’ for encrypt, ’d’ for decrypt, k[] is the key.*/ char ch, Text[8]; int i; while(!feof(infile)) { i = fread(Text, 1, 8, infile); /* read 8 bytes from infile into Text */ if (i <= 0) break; while (i < 8) { Text[i++] = ' ';} /* pad last block with spaces */ switch (mode) { case 'e': encrypt(k, (unsigned long*) Text); break; case 'd': decrypt(k, (unsigned long*) Text); break; } fwrite(Text, 1, 8, outfile); /* write 8 bytes from Text to outfile */

RSA Encryption - 1 To find a key pair e, d:
Choose two large prime numbers, P and Q (each greater than 10100), and form: N = P x Q Z = (P–1) x (Q–1) 2. For d choose any number that is relatively prime with Z (that is, such that d has no common factors with Z). Example: P = 13, Q = 17 –> N = 221, Z = 192 d = 5

RSA Encryption - 2 To find e solve the equation:
e x d = 1 mod Z That is, e x d is the smallest element divisible by d in the series Z+1, 2Z+1, 3Z+1, ... . e x d = 1 mod = 1, 193, 385, ... 385 is divisible by d e = 385/5 = 77 3`. To encrypt text using the RSA method, the plaintext is divided into equal blocks of length k bits where 2k < N since N = 221, so set k = 7, because 27 = 128

RSA Encryption - 3 E' and D' are mutual inverses
4. The function for encrypting a single block of plaintext M is: E'(e,N,M) = Me mod N for a message M, the ciphertext is M77 mod 221 5. The function for decrypting a block of encrypted text c to produce the original plaintext block is: D'(d,N,c) = cd mod N E' and D' are mutual inverses E'(D'(x)) = D'(E'(x)) = x for all values of P in the range 0 ≤ P ≤ N. Ke = (e,N), Kd = (d,N) Attack: from Ke to Kd, or from Kd to Ke Difficult to get d from (e,N), or get e from (d,N) Should know P, Q

Digital signatures with public keys

Low-cost signatures with a shared secret key

Birthday attack Example Birthday paradox
Prepare two version M and M` of a contact Make several indistinguishable different versions of M and M` Compare the hashes of all Ms and M`s Find one pair, make cheating If hash values are 64 bits long, attacks require only 232 versions of M and M` on average Birthday paradox Probability of finding a matching pair in a given set is far greater than for finding a match for a given individual Probability of a birthday matching in a set of 23 people = probability of a birthday on a given day in a set of 253 people

X.509 Certificate format S u b jec t D i s n g is he d N a m e, Pu l
K e y Iss ue r Si at Pe ri o d f v li N Be Da No A ate ni str ive fo rma ti V er si , mb Ex en I or

System architecture of Kerberos
Server Client DoOperation Authentication database Login session setup Ticket- granting service T Kerberos Key Distribution Centre Authen- tication service A 1. Request for TGS ticket 2. TGS ticket 3. Request for server ticket 4. Server ticket 5. Service request Request encrypted with session key Reply encrypted with session key function Step B Step A Step C C S

Kerberos protocol 1. C- A: Request for TGS ticket Header Message
C, T, n 2. A- C: TGS session key and ticket {KCT, n}KC, {ticket(C,T)}KT containing C,T,t1,t2,KCT 3. C-T: Request ticket for service S {auth(C)}KCT, {ticket(C,T)}KT,S, n {C,t}KCT 4. T-C: Service ticket {KCS,n}KCT, {ticket(C,S)}KS 5. C-S: Service request {auth(C)}KCS, {ticket(C,S)}KS, request,n 6. S-C: Server authentication {n}KCS

SSL protocol stack SSL Handshake protocol SSL Change Cipher Spec
SSL Alert Protocol Transport layer (usually TCP) Network layer (usually IP) SSL Record Protocol HTTP Telnet SSL protocols: Other protocols:

SSL handshake protocol

SSL handshake configuration options
Component Description Example Key exchange method the method to be used for exchange of a session key RSA with public-key certificates Cipher for data transfer the block or stream cipher to be used for data IDEA Message digest function for creating message authentication codes (MACs) SHA

Chapter 7: Distributed File Systems
Introduction File Service Architecture Sun Network File System The Andrew File System Recent advances Summary

Distributed file system
Introduction File system persistent storage Distributed file system information sharing similar (in some case better) performance and reliability Various kinds of storage systems

Characteristics of file systems
Responsibilities Organization, storage, retrieval, naming, sharing and protection Important concepts related to file File Include data and attributes Directory A special file that provides a mapping from text names to internal file identifiers Metadata Extra management information; including attribute, directory etc File system architecture File system operations Applications access via library procedures

Distributed file system requirements
Transparency access transparency location transparency mobility transparency performance transparency scaling transparency Concurrent file updates concurrency control File replication better performance & fault tolerance Hardware and operating system heterogeneity

Distributed file system requirements … continuted
Fault tolerance idempotent operations: support at-least-semantics stateless server: restart from crash without recovery Consistency One-copy update semantics Security Authenticate, access control, secure channel Efficiency comparable with, or better than local file systems in performance and reliability

Case studies SUN NFS Andrew File System
First file service that was designed as a product [1984] Adopted as a internet standard Supported by almost platforms, e.g. Windows NT, Unix Andrew File System Campus information sharing system in CMU [1986] 800 workstations and 40 servers at CMU [1991]

Three components of a file service
File service architecture Flat file service Operate on the contents of files Unique file identifier (UFID) Directory service Provide a mapping between text names to UFIDs Client module Support applications accessing remote file service transparently E.g. iterative request to directory service, cache files

Flat file service interface
Flat file service operations Comparison with Unix No open and close Read or write specifying a starting point Fault tolerance reasons for the differences Repeatable operations except for create, all operations are idempotent Stateless servers E.g. without pointer when operate on files restart after crash without recovery

Access control in DFS Unix File System Stateless DFS
User’s access rights are checked against the access mode requested in the open call Stateless DFS DFS’s interface is opened to public File server can’t retain the user ID Two approaches for access control (1) authenticate based on capability (2) attach user ID on each request Kerberos in AFS and NFS

Directory service interface
Main task Translate text names to UFIDs Lookup(Dir, Name) -> FileId — throws NotFound Locates the text name in the directory and returns the relevant UFID. If Name is not in the directory, throws an exception. AddName(Dir, Name, File) NameDuplicate If Name is not in the directory, adds (Name, File) to the directory and updates the file’s attribute record. If Name is already in the directory: throws an exception. UnName(Dir, Name) If Name is in the directory: the entry containing Name is removed from the directory. If Name is not in the directory: throws an exception. GetNames(Dir, Pattern) -> NameSeq Returns all the text names in the directory that match the regular expression Pattern.

Hierarchic file system
Directory tree Each directory is a special file holds the names of the files and other directories that are accessible from it Pathname Reference a file or a directory Multi-part name, e.g. “/etc/rc.d/init.d/nfsd” Explore in the tree Translate pathname via multiple lookup operations Directory cache at the client

NFS architecture UNIX kernel protocol Client computer Server computer
system calls Local Remote UNIX file system NFS client server Application program Virtual file system Other file system

Virtual file system Keep track of local and remote file system V-node
For local file: refer to an i-node For remote file: refer to a file handle File handle The file identifier used in NFS File handles are passed between client and server to refer to a file Filesystem identifier i-node number of file i-node generation number File handle

Design points Access control and authentication
User ID is attached to each request Kerberos embedded in NFS Authenticate user when mount Ticket, Authenticator and secure channel Client integration into the kernel No recompilation Single client module serving all user-level processes Retain encryption key used to authenticate user ID passed to the server NFS Server interface Defined in RFC 1813

Mount service File server Client Example
/etc/exports: contains the names of local file systems that are available for remote mounting Client mount command: include location, pathname of the remote directory Example

Mount service … continued
Hard-mounted/soft-mounted Hard-mounted: process suspends when the accessing remote directory is unavailable Soft-mounted: indicate the error to the process after several tries Automounter mount dynamically whenever an empty mount point is referenced by a client

Multi-part pathname translation
From pathname to file handle Multi-part pathname translation Client issues several separated lookup requests to server Directory cache Cache the results of translation conducted recently

File cache at the server
Buffer cache in UNIX file system read-ahead delay-write sync periodically, e.g. 30 seconds Cache reading in NFS server Similar to local file system Cache writing in NFS server: enhance reliability write-through commit operation

File cache at the client
Cache file blocks at the client Maintain coherence Client polls server to validate the blocks when using the blocks Validity condition Two timestamp attached to each block in the cache Tc: the time when the cache entry was last validated Tm : the time when the block was last modified at the server Valid: (T-Tc<t)(Tmclient=Tmserver) t is a compromise between consistency and efficiency E.g. 3-30s

File cache at the client … continued
Write cache flush result to server when file is closed conduct sync periodically Cache semantics Not guarantee the Unix file consistency

location transparency Mobility transparency Scalability
NFS Summary Access transparency location transparency Mobility transparency Must remount a filesystem when it migrates Scalability limited performance for hot-spot files File replication not support file replication with updates

Hardware and operating system heterogeneity Fault tolerance
NFS Summary Hardware and operating system heterogeneity Fault tolerance stateless server, idempotent operations Consistency Security be enhanced by kerberos Efficiency

Motivation of AFS Information sharing Scalability Unusual designs
Share information among large number of users Scalability Large number of users Large amount of files Large number users accessing the hot files Unusual designs Whole file serving Whole file caching

Typical scenario of using AFS
Client open an remote file Store the file copy in the client computer Client read(write) on the local copy When client close the file If the file was updated, flush it to the server

Read are 6 times more than write
Observation of typical Unix file system Files are small Most files are less than 10k in size Read are 6 times more than write Sequential access is common, random access is rare Most files are accessed by only one user Most shared files are modified by one user Recently used file is highly probable be used again

Working set – 100 megabytes Not support database files
The base of AFS design File majority Infrequently updated accessed by a single user Working set – 100 megabytes Not support database files

Unix kernel – modified BSD
Implementation System architecture Name space Unix kernel – modified BSD Intercept / forward relevant system calls to Venus

Implementation … continued
Venus Access files by fids step-by-step lookup Translate pathname to fids File cache One file partition for cache: Accommodate several hundred average-size files Maintain cache coherence: call-back mechanism Vice Flat file service Accept requests in terms of fids

Cache coherence State of a cached file Open a file Close a file
valid/cancelled Open a file Venus fetch the file when without the file or the file is cancelled Vice remember each cached file’s location Close a file Venus flushes the file when application updates the file Vice executes the updates on the file in a sequential order Vice informs all caches of the file to be cancelled

Cache coherence …continuted
Validate a file Venus validates the file when client restart or not receive a callback for a time T Scalability Client-server interaction is reduced dramatically Due to most files are read-only, callback reduces client- server interaction dramatically in contrast to client polling

Update semantics Approximation to one-copy file semantics AFS-1
After a successful open: latest(F,S) After a failed open: failure(S) After a successful close: updated(F,S) After a failed close: failure(S) AFS-2: weaker open semantics After a successful open latest(F, S, 0) or (lostCallback(S,T) and inCache(F) and latest(F, S, T)) No concurrent control at server

Other aspects … continued
UNIX kernel modifications Location database Fully replicated map of volumes to servers Threads Multi-threads in Venus and Vice Read-only replicas Bulk transfers Partial file caching

Other aspects … continued
Performance AFS benchmark Server load of AFS and NFS under the same benchmark 40% / 100% Wide-area support Transarc Corp. 96%-98% cache hit

Sprite: one-copy update semantics
NFS enhancements Sprite: one-copy update semantics Multiple readers operate on cached copies One writer, multiple readers operate on the same server copy NONFS: more precise consistency Maintain cache consistency by lease WebNFS: access NFS server via Web NFS version 4: wide-area networks application

Improvements in storage organization
Redundant Arrays of Inexpensive Disks (RAID) Segmented into fixed-size chunks Stored in stripes across several disks Redundant error-correcting codes Log-structured file storage (LFS) Accumulate a set of tiny writes in memory Commit the accumulated wirtes in large, continuous, fixed–sized segments

New design approaches xFS (Serverless File System) Frangipani
Separate file server management task from processing task Distribute file server processing responsibility Software RAID Cooperative cache Frangipani Separate persistent storage responsibility from other service actions Petal: virtual disk abstraction across many disks located on multiple servers Log-structured data store

Summary Key design issues for DFS NFS AFS DFS state-of-the-art
Effective use of client cache Cache coherence Recovery after client or server failure High throughput for reading and writing Scalability NFS Stateless, efficient, poor scalability AFS high scalability DFS state-of-the-art Support application in wide area network and pervasive computing

Storage systems and their properties
Sharing Persis- tence Distributed cache/replicas Consistency maintenance Example Main memory RAM File system UNIX file system Distributed file system Sun NFS Web Web server Distributed shared memory Ivy (Ch. 16) Remote objects (RMI/ORB) CORBA Persistent object store 1 CORBA Persistent Object Service Persistent distributed object store PerDiS, Khazana

File system modules

File attributed record structure
File length Creation timestamp Read timestamp Write timestamp Attribute timestamp Reference count Owner File type Access control list

Unix file system operations
filedes = open(name, mode) filedes = creat(name, mode) Opens an existing file with the given name. Creates a new file with the given name. Both operations deliver a file descriptor referencing the open file. The mode is read, write or both. status = close(filedes) Closes the open file filedes. count = read(filedes, buffer, n) count = write(filedes, buffer, n) Transfers n bytes from the file referenced by filedes to buffer. Transfers n bytes to the file referenced by filedes from buffer. Both operations deliver the number of bytes actually transferred and advance the read-write pointer. pos = lseek(filedes, offset, whence) Moves the read-write pointer to offset (relative or absolute, depending on whence). status = unlink(name) Removes the file name from the directory structure. If the file has no other names, it is deleted. status = link(name1, name2) Adds a new name (name2) for a file (name1). status = stat(name, buffer) Gets the file attributes for file name into buffer.

File service architecture
Client computer Server computer Application program Client module Flat file service Directory service

Flat file service operations
Read(FileId, i, n) -> Data — throws BadPosition If 1 ≤ i ≤ Length(File): Reads a sequence of up to n items from a file starting at item i and returns it in Data. Write(FileId, i, Data) If 1 ≤ i ≤ Length(File)+1: Writes a sequence of Data to a file, starting at item i, extending the file if necessary. Create() -> FileId Creates a new file of length 0 and delivers a UFID for it. Delete(FileId) Removes the file from the file store. GetAttributes(FileId) -> Attr Returns the file attributes for the file. SetAttributes(FileId, Attr) Sets the file attributes (only those attributes that are not shaded in ).

NFS server operations (simplified) - 1
lookup(dirfh, name) -> fh, attr Returns file handle and attributes for the file name in the directory dirfh. create(dirfh, name, attr) ->  newfh, attr Creates a new file name in directory dirfh with attributes attr and returns the new file handle and attributes. remove(dirfh, name) status Removes file name from directory dirfh. getattr(fh) -> attr Returns file attributes of file fh. (Similar to the UNIX stat system call.) setattr(fh, attr) -> attr Sets the attributes (mode, user id, group id, size, access time and modify time of a file). Setting the size to 0 truncates the file. read(fh, offset, count) -> attr, data Returns up to count bytes of data from a file starting at offset. Also returns the latest attributes of the file. write(fh, offset, count, data) -> attr Writes count bytes of data to a file starting at offset. Returns the attributes of the file after the write has taken place. rename(dirfh, name, todirfh, toname) -> status Changes the name of file name in directory dirfh to toname in directory to todirfh . link(newdirfh, newname, dirfh, name) Creates an entry newname in the directory newdirfh which refers to file name in the directory dirfh.

NFS server operations (simplified) - 2
symlink(newdirfh, newname, string) -> status Creates an entry newname in the directory newdirfh of type symbolic link with the value string. The server does not interpret the string but makes a symbolic link file to hold it. readlink(fh) -> string Returns the string that is associated with the symbolic link file identified by fh. mkdir(dirfh, name, attr) -> newfh, attr Creates a new directory name with attributes attr and returns the new file handle and attributes. rmdir(dirfh, name) -> status Removes the empty directory name from the parent directory dirfh. Fails if the directory is not empty. readdir(dirfh, cookie, count) -> entries Returns up to count bytes of directory entries from the directory dirfh. Each entry contains a file name, a file handle, and an opaque pointer to the next directory entry, called a cookie. The cookie is used in subsequent readdir calls to start reading from the following entry. If the value of cookie is 0, reads from the first entry in the directory. statfs(fh) -> fsstats Returns file system information (such as block size, number of free blocks and so on) for the file system containing a file fh.

Local and remote file systems accessible on an NFS client

Distribution of processes in the Andrew File System

File name space seen by clients of AFS

System call interception in AFS

AFS fid Volume number File handle Uniquifier 32 bits

The main components of the Vice service interface
Fetch(fid) -> attr, data Returns the attributes (status) and, optionally, the contents of file identified by the fid and records a callback promise on it. Store(fid, attr, data) Updates the attributes and (optionally) the contents of a specified file. Create() -> fid Creates a new file and records a callback promise on it. Remove(fid) Deletes the specified file. SetLock(fid, mode) Sets a lock on the specified file or directory. The mode of the lock may be shared or exclusive. Locks that are not removed expire after 30 minutes. ReleaseLock(fid) Unlocks the specified file or directory. RemoveCallback(fid) Informs server that a Venus process has flushed a file from its cache. BreakCallback(fid) This call is made by a Vice server to a Venus process. It cancels the callback promise on the relevant file.

Location database in AFS
volumei2 Map of the cell Map of upper volumes volumei1 volumei3 cell a volumej2 Map of the cell Map of upper volumes volumej1 volumej3 cell b client

Xfs architecture

Chapter 9: Name Services
Introduction Name services and the Domain Name System Directory and discovery services Case study of the Global Name Service Case study of the X.500 Directory Service Summary

Names, addresses and other attributes
Pure name: uninterpreted bit patterns None-pure name: contain information Resolve Translate from name to object Example Bind Association between a name and an object Attribute Value of a property associated with an object E.g. IP address in DNS, person name in X.500, remote object reference in CORBA Naming Service

Uniform Resource Identifiers
URL (Uniform Resource Location) Addresses of web resource Dangling problems A resource may be moved URN (Uniform Resource Name) Intend to solve the dangling problems URN lookup service: mapping from URN to URL Urn:nameSpace:nameSpace-specificName. E.g. urn:ISBN: URC (Uniform Resource Characteristics) Descriptive attributed of the resource E.g. “author=Leslie Lamport”, “keywords=time”

Unification Integration
Name management is separated from other services Unification It is convenient for resources managed by different services to use the same naming scheme E.g. DNS Integration It is convenient to integrate service for sharing information by a common naming service

General name service requirements
Arbitrary number of names scalability Arbitrary number of administrative organizations feasibility A long lifetime Accommodate variations, feasibility High availability Fault isolation Isolate location failures from entire service Tolerance of mistrust

Name spaces Name space Internal structure Alias Naming domains
A collection of all valid names recognized by a particular service Require a syntactic definition Internal structure Hierarchic structure, e.g. /etc/passwd Resolve relative to a separate context Potentially infinite Different context managed by different people Alias Naming domains Name space with a single administrative authority, E.g. pku.edu.cn Naming domains are in general stored by different name servers

Combining and customizing name spaces
Homogeneous/heterogeneous name spaces Merging E.g. mount file system in Unix and NFS E.g. create a higher-level root context Heterogeneity DCE name: /…/dcs.qmw.ac.uk/principals/Jean.Dollimore Customization One file with different names, e.g. NFS One name refer to different files, e.g. configuration for multi- platform One name space per people, e.g. Plan 9 cell principals

Name resolution Name servers Iterative navigation
Name space is partitioned in different name servers Iterative navigation Client name resolution software E.g., DNS, NFS Server controlled navigation Non-recursive Recursive server Suitable to environment where there are administrative domain prohibits Caching Enhance response time Eliminate the workload of high-level name servers Isolate the failures of high-level name servers

The Domain Name System Original Internet Naming scheme
A central master files, download to all hosts by FTP Domain names [1987] Name space is partitioned both organizationally and according to geography Com – Commercial organizations Edu – Universities and other educational institutions Gov – US governmental agencies Mil – US military organizations Net – Major network support centres Org – Organizations not mentioned above Int – International organizations Us – united States Uk – United Kingdom Cn - China

DNS queries Host name resolution Mail host location Reverse resolution
From URL to IP address Mail host location Given a domain name, return a list of domain names of hosts that can accept the mail E.g., Reverse resolution From IP to URL

DNS queries …continued
Host information E.g. the architecture type or operating system of a machine Well-known services A list of the services run by a computer Protocol used to obtain them (UDP & TCP)

DNS names are divided into Zones Zone
DNS name servers DNS names are divided into Zones Zone Include names in the domain, less any sub- domains At least two name servers for the zone Hold name servers for the sub-domains Include management parameters Each server hold zero or more Zones Zero zone: the caching name server

DNS name servers …continued
Other servers that a name server holds Lower-level name servers Child name servers high-level name servers One or more root name servers Parent name server Iterative navigation / recursive navigation Example DNS resource types

Availability & Scalability
DNS performance Replication Zone data are replicated on at least two name servers Master server / secondary server Synchronize periodically Cache Any server is free to cache data Time-to-live value Availability & Scalability Achieved by a combination of replication, cache and partition Acceptable inconsistent naming data

A special kind of naming service <attribute, value> pairs
Directory services A special kind of naming service <attribute, value> pairs Each entry is concerned with a set of <attribute, value> pairs Lookup by known attributes, return interested attributes Yellow page / white page Example Active Directory Service, X.500, LDAP

A special kind of directory service
Discovery services A special kind of directory service Registers the services provided in a spontaneous network General operations Register / lookup / de-register E.g. a registered printer ResourceClass=printer, type=laser, colour=yes, resolution=600dpi, Location=room101, url=

Jini A lookup service How to locate lookup service? Leases Example
Services register an object with a set of attributes Clients query lookup service Clients download service object that matches query How to locate lookup service? A priori Multicast to a well-know IP multicast address Lookup services listen on the receiving socket Lookup services announce their existence Leases A limited period of time during which the service can be used Example

Designed by DEC lab [lampson 1986] Design objectives
Introduction to GNS Designed by DEC lab [lampson 1986] Design objectives Millions of computer names Billions of addresses for users Long life time Accommodate changes

Directory tree / value tree Directory identifier (DI)
Architecture of GNS Directory tree / value tree Directory identifier (DI) unique identifier of a directory Name: <directory name, value name> E.g. <EC/K/AC/QMW, Peter.Smith> Directory tree is partitioned and stored in many servers Cache Inconsistency cache data is acceptable

How does GNS accommodate changes?
Merge two name space by a super-root How to it transparent to client applications? Working root & well-known directories Store the working DI by client user agent E.g. </UK/AC/QMW, Peter.Smith>, client stores #599 of the EC Working DI + relative path Uniquely refer to a name in the merged tree E.g. <#599/UK/AC/QMW, Peter.Smith> Implementation: well-know directories Mapping between working DI to new absolute path Well-know directories should be replicated at each nodes, the bottleneck Examples

General purpose directory service Service architecture
X.500 Architecture General purpose directory service Service architecture Directory service agent (DUA) Directory servcie agent (DSA) Directory information tree (DIT) Partitioned and stored in different servers Organized according to distinguished name

Search in X.500 DIB entry Search LDAP
Consist of a name and a set of attributes Search A base name + a filter expression LDAP Distributed object naming service based on LDAP

Summary Basics of naming service Name space Multiple name servers
Map between name and attributes of objects Context, binding, resolve Name space Syntactic rules Multiple name servers Cache & replication Cases DNS GNS: accommodating changes X.500: directory service

Composed naming domains used to access a resource from a URL
URL Resource ID (IP number, port number, pathname) Network address 2:60:8c:2:b0:5a file Web server WebExamples/earth.html 8888 DNS lookup Socket

Iterative navigation NS2 2 Name 1 NS1 servers Client 3 NS3
A client iteratively contacts name servers NS1–NS3 in order to resolve a name NS2 NS1 NS3 Name servers

Non-recursive and recursive server-controlled navigation
1 2 3 5 4 A name server NS1 communicates with other name servers on behalf of a client client Recursive server-controlled NS2 NS1 NS3 Non-recursive

DNS name servers a.root-servers.net (root) ns0.ja.net (ac.uk)
dns0.dcs.qmw.ac.uk (dcs.qmw.ac.uk) alpha.qmw.ac.uk (qmw.ac.uk) dns0-doc.ic.ac.uk (ic.ac.uk) ns.purdue.edu (purdue.edu) uk purdue.edu ic.ac.uk qmw.ac.uk dcs.qmw.ac.uk *.qmw.ac.uk *.ic.ac.uk *.dcs.qmw.ac.uk * .purdue.edu ns1.nic.uk (uk) ac.uk co.uk yahoo.com

DNS resource records Record type Meaning Main contents A
A computer address IP number NS An authoritative name server Domain name for server CNAME The canonical name for an alias Domain name for alias SOA Marks the start of data for a zone Parameters governing the zone WKS A well-known service description List of service names and protocols PTR Domain name pointer (reverse lookups) Domain name HINFO Host information Machine architecture and operating system MX Mail exchange List of < preference, host > pairs TXT Text string Arbitrary text

Service discovery in Jini
Printing service Lookup admin admin, finance finance Client Corporate infoservice 1. ‘finance’ lookup service? 2. Here I am: ..... 3. Request printing 4. Use printing Network

GNS directory tree and value tree for user Peter.Smith
UK FR AC QMW DI: 322 Peter.Smith password mailboxes DI: 599 (EC) DI: 574 DI: 543 DI: 437 Alpha Gamma Beta

Merging trees under a new root
EC UK FR DI: 599 DI: 574 DI: 543 NORTH AMERICA US DI: 642 DI: 457 DI: 732 #599 = #633/EC #642 = #633/NORTH AMERICA Well-known directories: CANADA DI: 633 (WORLD)

Restructuring the directory
UK FR DI: 599 DI: 574 DI: 543 NORTH AMERICA US DI: 642 DI: 457 DI: 732 #599 = #633/EC #642 = #633/NORTH AMERICA Well-known directories: CANADA DI: 633 (WORLD) #633/EC/US

X.500 Service Architecture
DSA DUA

Part of X.500 directory information tree
... France (country) Great Britain (country) Greece (country) BT Plc (organization) University of Gormenghast (organization) Department of Computer Science (organizationalUnit) Computing Service (organizationalUnit) Engineering Department (organizationalUnit) X.500 Service (root) Departmental Staff (organizationalUnit) Research Students (organizationalUnit) ely (applicationProcess) Alice Flintstone (person) Pat King (person) James Healey (person) Janet Papworth (person)

An X.500 DIB entry info alf mail
Alice Flintstone, Departmental Staff, Department of Computer Science, University of Gormenghast, GB commonName Alice.L.Flintstone Alice.Flintstone Alice Flintstone A. Flintstone surname Flintstone telephoneNumber uid alf mail roomNumber Z42 userClass Research Fellow

 Remote object naming service based on LDAP c=China st=Beijing
st=Hubei l=Beijing ou=Tsinghua l=Wuhan ou=Beida dc=Dean dc=Teacher dc=Course ou=Wuda  A B2 B1 C

Chapter 10: Time and Global States
Introduction Clocks,events and process states Synchronizing physical clocks Logical time and logical clocks Global states Distributed debugging Summary

Introduction Time is an important issue in DS
Need to measure accurately E.g. auditing in e-commerce Algorithms depending on E.g. consistency, make No universe physical clock Newton’s opinion Einstein’s Relativity Theory People’s approaches Approximately synchronize Logical clocks

Model of a distributed system
A collection of N processes pi, i = 1,2, .. N si The state of pi E.g. variables Actions of pi Operations that transform pi’s state Send or receive message between pj e Event: occurrence of a single action i occur before in pi , e.g. e i e` Total order of events in pi history(pi ) = hi = <ei0, ei1, ei2, …>

Clocks Clock in computer Clock skew and clock drift
A device that count oscillations occurring in a crystal at a definite frequency hardware time: Hi(t) Relative time Software time: Ci(t) = Hi(t)+ Timestamp of event Clock skew and clock drift Skew: the instantaneous difference between the readings of any two clocks Drift: crystal oscillate at different rate Can’t avoid clock drift example

Coordinated Universal Time
Standard second Atomic oscillator (International Atomic Time) Drift rate: one part in 1013 9,192,631,770 periods of transition between the two hyperfine levels of the ground state of Cs133 Since 1967 Astronomical time Rotation of earth on its axis and about the Sun Skew between astronomical time and atomic time Coordinated Universal Time (UTC) Atomic time which is inserted a leap second occasionally to keep in step with astronomical time Broadcast UTC to the World E.g., by GPS or WWV

External & Internal synchronization
Ci : pi’s clock, I: an interval of real time External synchronization For a synchronization bound D > 0, and for a source S of UTC time, |S(t)-Ci(t)| < D, for i = 1, 2, … N and for all real times t in I Clocks Ci are accurate to within the bound D Internal synchronization For a synchronization bound D > 0, |Ci(t)-Cj(t)| < D for i, j =1,2, … N, and for all real times t in I Clocks Ci agree within the bound D If accurate to within D, then agree within 2D

General synchronization measures
Correctness of a hardware clock H A bounded drift rate , e.g. 106 seconds/second (1 - )(t’ - t) <= H(t’) - H(t) <= ( 1 + )( t’ - t) Correctness of a software clock Monotonicity: t’ > t  C(t’) > C(t) Set clock back Errors in the make Change the clock rate Clock failures Crash failure: stop ticking Arbitrary failure, e.g. Y2K bug

Synchronization in a synchronous system
Protocol Sender: send M(t) receiver: set time to t + Ttrans Bounds are know in synchronous system min < Ttrans < max So, set Ttrans = (min+max) / 2 Receiver clock = t + (min+max) / 2 Clock skew (max – min ) / 2 t t+max t +Ttrans t + min

Cristian’s method of synchronizing clocks
Application circumstance C/S Round-trip time is short compared with the required accuracy Protocol mr, mt, Tround Estimated time: mt + Tround/2 Accuracy analysis If the minimum delay of a message transmission is min, then accuracy: (Tround/2 – min) t t +Tround-min t +Tround/2 t + min t +Tround

The Berkeley algorithms
Internal synchronization Protocol master poll slaves’ clocks master estimate slaves’ clocks by round-trip time Similar to Christian’s algorithm Average the slaves’ clock values Cancel out the individual clock’s tendencies to run fast or slow Send back to the client the amount that the client’s clock should adjust by Positive or negative value Avoid further uncertainty due to the message transmission time

Design aims of Network Time Protocol
External synchronization enable clients across the Internet to be synchronized accurately to UTC Reliability can survive lengthy losses of connectivity Redundant server & redundant path between servers Scalability Enable clients to resynchronize sufficiently frequently to offset the rates of drift found in most computers Security Protect against interference with the time service

Synchronization measures
Network Time Protocol Architecture Reconfigure as servers become unreachable Synchronization measures Multicast mode Intend for use on a high speed LAN Assuming a small delay Low accuracy but efficient Procedure-call mode Similar to Christian’s higher accuracy than multicast Symmetric mode The highest accuracy

Symmetric mode synchronization
Protocol, highest accuracy Assumming t, t’: actual transmission time of m, m’; o: actual B’s clock skew relative to A We have Ti-2 = Ti-3 + t + o , Ti = Ti-1 + t’ – o Then di = t + t’ = Ti-2 –Ti-3 + Ti – Ti-1 o = oi +(t’-t)/2 where oi= (Ti-2 –Ti-3 + Ti-1 –Ti ) /2 Estimated time: oi Accuracy analysis Due t, t’ >=0, then oi - di /2 <= o <= oi + di /2 di is the measure of the accuracy

Symmetric mode synchronization …continued
Implementation NTP servers retain eight most recent pairs <oi,di> The value oi of that corresponds to the minimum value di is chosen to estimate o A NTP server exchange with several peers in addition to with parent Peers with lower stratum numbers are favoured Peer with the lowest synchronization dispersion are favoured

Happen-before relation
 HB1: If process pi: eie`, then ee` HB2: For any message m, send(m) receive(m) HB3: IF e, e`and e`` are events such that e e` and e` e``, then e e`` Causal ordering or potential causal ordering Example a || e Shortcomings Not suitable to processes collaboration that does not involve messages transmission Capture potential causal ordering

Lamport timestamps algorithm
Logical Clock Lamport timestamps algorithm LC1: Li is incremented before each event is issued at process pi : Li :=Li+1 LC2: (a) When a process pi sends a message m, it piggybacks on m the value t = Li; (b) On receiving (m,t), a process Pj computes Lj := max(Lj, t) and then applies LC1 before timestamping the event receive(m) e  e`  L(e) < L(e`) L(e) < L(e`)  e  e` or e||e` Example

Totally ordered logical clocks
Assumption Ti : local timestamp of e that is an event occurring at pi Tj : local timestamp of e` that is an event occurring at pj Define the timestamps of e, e` are (Ti, i), (Tj, j) Define < (Ti, i) < (Tj, j) if Ti < Tj , or Ti = Tj and i < j Useful in some applications

Compare vector timestamps
Vector Clocks Algorithm Each process pi keeps a vector clock Vi VC1: Initially, Vi[j]=0, for i, j = 1,2…, N VC2: Just before pi timestamps an event, it sets Vi[i] := Vi[i] +1 VC3: pi includes the value t= Vi in every message it sends VC4: When pi receives a timestamp t in a message, it sets Vi[j] :=max(Vi[j], t[j]), for j=1,2…,N Compare vector timestamps V = V` iff V[j] = V`[j] for j = 1,2…, N V <= V` iff V[j] <= V`[j] for j = 1,2…, N V < V` iff V <= V` and V <> V`

Vector Clocks …continued
Example V(e) < V(e`)  ee`, V(e) <> V(e`)  e||e` O(N) storage and message payload N is unavoidable Improvement smaller data + reconstruct

Requirements of global states
Distributed garbage collection Based on reference counting Should include the state of communication channels Distributed deadlock detection Look for “waits-for” relationship Distributed termination detection Look for state in which all processes are passive Distributed debugging Need collect values of distributed variables at the same time

Global states and consistent cuts
The essential problem of Global states Absence of global time History of process pi: hi = <ei0, ei1, ei2 …> Prefix of a process’s history: hik = <ei0, ei1… eik > Global history of processes set £: H = h1  h2 …  hN A global state: S = (s1, s2, … sN) A cut of a system execution: C = <h1c1, h2c2… h3c3 > Frontier of a cut: example A cut C is consistent: For all events eC, f  e  f  C <e10, e20> is inconsistent, <e12, e22> is consistent

Global states and consistent cuts … continued
A consistent global state: correspond to a consistent cut The si corresponding to the cut C is that of pi immediately after the last event processed by pi in C – frontier of C Execution of a distributed system: S0  S1  S2  … A run: a total ordering of all the events in a global history that is consistent with each local history’s ordering, i Not all runs pass through consistent global state A linearization (consistent) run: an ordering of the events in a global history that is consistent with this happened- before relation  on H. Pass only consistent global state S’ is reachable from a state S: there is a linearization that pass through S and then S’

Global state predicates, stability, safety and liveness
A function that maps from the set of global states of processes in the system £ to {True, False} Characteristics of global state predicates Stability: once the system enters a state in which the predicate is True, it remains True in all future states reachable from that state Useful in deadlock detecting, or termination detecting Safety with respect to predicate :  evaluates to False for all states S reachable from S0 E.g.,  is a property of being deadlocked Liveness with respect to predicate : for any linearization L starting in the state S0, Evaluates to True for some state SL reachable from S0 E.g.,  is a property of reaching termination

The “snapshot” algorithm of Chandy and Lamport
Aim Capture consistent global state of distributed system Algorithm assumptions Neither channels nor processes fail unidirectional channels, FIFO message delivery Complete connection among all processes Any process may initiate a global snapshot at any time process may continue execution and send and receive normal message while snapshot takes place

The “snapshot” algorithm
Idea When one process record a state Si, make all other processes record states that have been caused by Si Method Incoming channels, outgoing channels Process state + channel state marker message Marker sending rule: a process sends a marker after it has recorded its state, but before it send any other messages Marker receiving rule: a process records its state if the state has changed since last recording, or record the states of the incoming channel Algorithm

The “snapshot” algorithm - example
p1 trade p2 in widget which is 10$ per item Initial state p1 has sent 50$ to p2 to buy 5 widget, and p2 has received the order

Execution of the processes in the example
The final recorded state P1:<$1000, 0>; p2:<$50,1995>;c1:<(five widgets)>;c2:<>

Characterising the observed state
The caught states are consistent Examine two events ei, ej between pi and pj, such that eiej We want to prove: if ej occurred before pj recorded its state, then ei must have occurred before pi recorded its state The opposite of what we want to prove: pi recorded its state before ei occurred Proving: Because ei  ej, then there are messages m1, m2… at pj. Before these messages, there must be a marker saying pi has recorded its state These marker message let pj record state before ej So: the caught state is consistent

Characterising the observed state … continued
Construct reachability relationship Reachability between the observed global state and the initial and final global states Sys = e0, e1, … : linearization of the system as it executed Find a permutation of Sys, Sys` = e0`, e1`, … such that all three states Sinit, Ssnap and Sfinal occur in Sys` Sys` is also a linearization Approach Find pre-snap events / post-snap events according to a snap figure

Distributed debug introduction
example Safety condition of a distributed system: |xi-xj|<= approach A monitor Collect states of other distributed processes Apply a given global state predicate  on the states Possibly : there is a consistent global state s through which a linearization of H passes such that (s) is true Definitely : for all linearizations L of H, there is a consistent global state set S through which L passes such that (S) is true

Observing consistent global states
Vector clock at each process Timestamp each event occurring at each process Each process send the timestamped event to the monitor Find consistent global states by the monitor Let S = (s1, s2, …, sN) S is a global state drawn from the state messages that the monitor has received S is a consistent global state if and only if V(si)[i]>=V(sj)[i] for i,j = 1,2,…, N If one process’s state depends upon another, the global state also encompasses the state upon which it depends

Observing consistent global states … continued
Example of consistent global states and inconsistent global states Two processes manage to maintain |x1-x2| <= 50 When one process adjust the value of its variable largely, it informs the other process to adjust the other variable to than value either The lattice of collected global states Monitor construct the reachability lattice by the consistent global state identification algorithm Find consistent global states Establish the reachability relation between states Sij is in level (i+j) Show all the linearizations corresponding to a history

Evaluating possibly  and definitely 
There is a downwards way in which there is a state evaluated to True by  Evaluating definitely  There is no downwards way in which there is not a state evaluated to True by  Example If  evaluates to True in the state at level 5, then definitely  If  evaluates to false in the state at level 5, then possibly 

Evaluating possibly  and definitely  in synchronous systems
Asynchronous systems High time cost To find consistent global state S = (s1, s2, …, sn), the monitor Should examine any two local states si and sj Synchronous systems |Ci(t)-Cj(t)| < D for i,j = 0, 1,…, N Algorithm modification The observed process sends vector time and physical time with the event to the monitor Monitor find consistency state V(si)[i]>=V(sj)[i] si and sj should occurred at the same real time

Synchronize physical clocks
Summary Clock skew, clock drift Synchronize physical clocks Christian’s algorithm Berkeley algorithm Network Time Protocol Logical time Happen-before relation Lamport timestamp algorithm Vector clock

Summary …continued Global states Global debugging
Consistent cut, consistent state Snapshot algorithm Construct reachability relationship by snapshot Global debugging The monitor collects distributed events with vector timestamp Construct reachability relationship Examine possibly  and definitely 

Skew between computer clocks in a DS

Clock synchronization using a time server
p Time server,S

An example synchronization subnet in an NTP implementation
1 2 3 Note: Arrows denote synchronization control, numbers denote strata.

Messages exchanged between a pair of NTP peers
-2 - 3 Server B Server A Time m m'

Events occurring at three processes

Lamport timestamps for the events

Vector timestamps for the events

Detecting global properties

The “snapshot” algorithm
Marker receiving rule for process pi On pi’s receipt of a marker message over channel c: if (pi has not yet recorded its state) it records its process state now; records the state of c as the empty set; turns on recording of messages arriving over other incoming channels; else pi records the state of c as the set of messages it has received over c since it saved its state. end if Marker sending rule for process pi After pi has recorded its state, for each outgoing channel c: pi sends one marker message over c (before it sends any other message over c).

Pi has record its state? Pi has not recorded its state
marker Op1 Op2 Op3 Pi Channel C Pi has not recorded its state Pi has recorded its state marker Pi Channel C Op1 Op2 Op3 Operations that have executed marker Op1 Op2 Op3 Pi Channel C Operations that have not executed Msg buffer

Reachability between states in the snapshot algorithm

Find pre-snap events and post-snap events
1. The snapshot is consistent global states that record a set of events that occurred on some processes 2. Approach: Swap ej that should belong to post-snap events and ej+1 that should belong to pre-snap events according to the snap 3. Analysis This situation could not happen if ej  ej+1 Since if ej+1 belongs to the pre-snap events, because the snapshot is consistent global states, so ej must belongs to the pre-snap events (2) This situation could happen if and only ej || ej+1 Then swap ej and ej+1 will not change the happen-before relationship, so the linearization condition isn’t broken

Vector timestamps and variable values for the execution of Figure 10.9

The lattice of global states for the execution of Figure 10.14
Sij= global state after i events at process 1 and j events at process 2 S 00 10 20 21 30 31 32 22 23 33 43 Level 0 1 2 3 4 5 6 7

Algorithms to evaluate possibly  and definitely 

Evaluating definitely 

Chapter 11: Coordination and Agreement
Introduction Distributed mutual exclusion Elections Multicast communication Consensus and related problems Summary

Introduction Collaboration behaviors in DS Failure assumptions
Mutual exclusion Election Multicast Basic, reliable, order Agreement between processes Consensus, byzantine agreement Failure assumptions No failures Benign failures Arbitrary failures

The real situation of the Internet
Failure assumptions The real situation of the Internet Network partition Asymmetric routing Intransitive connectivity Channel assumption: reliable Failed link will be repaired or circumvented Process assumption crash failure without mention, otherwise arbitrary failure Correct process: no crash failure and arbitrary failure

Failure detectors Unreliable failure detector
Inaccurate unsuspected or suspected Reliable failure detector Inaccurate Unsuspected Failed: detect process crash Require the system is synchronous Implement an unreliable failure detector Each process announce its liveness every T seconds Detector suspects a process if it has not receive the periodic message for D seconds

Algorithms for mutual exclusion
Assumption Asynchronous, no process fail, reliable channel Application level protocol enter() resourceAccesses() exit() Essential requirements for mutual exlcusion Safety At most one process may execute in the critical section at a time Liveness Requests to enter and exit the critical section(CS) eventually succeed Free from deadlock and starvation Ordering If one request to enter the CS happened-before another, then entry to the CS is granted in that order

Algorithms for mutual exclusion … continued
Evaluate the performance of the algorithms Bandwidth consumed The number of messages sent in each entry and exit operation Client Delay Incurred by a process at each entry and exit operations Throughput Synchronization delay: delay between one process exiting the critical section and the next process entering it

The Central server algorithm
Architecture Meet safety and liveness, but not ordering Performance Bandwidth consumption Enter(): A request message, a grant message Exit(): a release message Client Delay (no waiting processes) Request message + grant message Synchronization delay A release message + a grant message

Meet safety and liveness, but not ordering Performance
Ring-based algorithm Architecture Meet safety and liveness, but not ordering Performance Bandwidth consumed Continuously consume network bandwidth Client Delay Min: 0 message, when it has just received the token Max: N messages, when it has just passed on the token Synchronization delay Min: 1 message, when processes enter CS one by one Max: N message, when a process enter CS continuously and no other process enter CS

An algorithm using multicast and logical clocks
Idea A process enters CS if all other processes promise Multicast + reply Concurrence control Lamport logical clock: avoid dead-lock Algorithm Example P1 and p2 want to enter CS concurrently, but p2 succeed Performance Bandwidth Enter(): N-1 multicast message, N-1 reply Client delay: round-trip time Synchronization delay: one message

A voting set Vi with each process pi
Maekawa’s voting algorithm Idea A process enter CS when part of other processes promise A voting set Vi with each process pi Vi  {p1, p2, …, pn} pi  Vi Vi Vj   | Vi |= K To be fair, K  Each process pj is contained in M of the voting sets Vi To be fair, M = K

Algorithm A deadlock example Maekawa’s voting algorithm …continued
Request to lock Reply to admit to lock Release lock A deadlock example The improved algorithm A total order of each request The wait-for operation executes in accordance with the total order

Maekawa’s voting algorithm …continued
Performance Bandwidth utilization No release messages: Have release messages: Better than 2(N-1) if N>4 Client delay: a round trip time Same as multicast algorithm Synchronization: a round trip time Worse than multicast algorithm’s which is only a single message transmission time

Some concepts about Election algorithm
Choose a unique process to play a particular role Define the election: choose the largest identifier E.g. for electing the process with lowest load, then id = 1/load Requirements of election algorithm Safety A participant process pi has electedi = or electedi = P, where p is chosen as the non-crashed process at the end of the run with the largest identifier Liveness All processes pi participate and eventually set electedi <>  or crash Evaluate the performance of election algorithm Total bandwidth utilization Turnaround time The number of serialized message transmission times between the initiation and termination of a single run

A ring-based election algorithm
All processes are arranged in a logical ring Algorithm Initially, every process is non-participant A process call an election when necessary idmsg = idlocal , send message {elect, idmsg} to the neighbor The process set to be participant Forward the election message Receiver sets to be participant if it has not been forward message {elect, MAX(idlocal, idmsg)} to the neighbor Elect the coordinator when idlocal=idmsg set the process to be non-participant idcoordinator = idlocal, send message {elected, idcoordinator } to neighbor Forward the elected message Each process set to be non-participant Remember idcoordinator

Evaluate the performance
A ring-based election algorithm … continued An example Evaluate the performance The worst turnaround case A process starts an election when its anti- clockwise neighbour has the highest identifier Election message reaches the neighbour: N –1 The neighbour finds it is the coordinator: N The neighbour announces elected message: N So, the turnaround = 3N –1 The best turnaround = 2N Tolerate no failures

The bully algorithm Assumption Synchronous system
Use timeout to detect a process failure Reliable failure detector Process can communicate with processes which have higher identifiers higher processes & lower processes

The bully algorithm … continued
The algorithm Initiate an election A process P begins a election when it notices the coordinator has failed by sending election messages to higher processes Higher processes reply answer and initiate new elections All correct higher processes reply answers, and initiate new elections Coordinator send coordinator messages to lower processes The highest process sends the coordinator message directly If P does not receive any answers, it sends coordinator message to lower processes If P has received some answers, then it waits coordinator message. If the message does not arrive for a period, P initiate a new election The processes that receive the coordinator message set electedi = idcoordinator

The bully algorithm … continued
An example Evaluate the performance bandwidthbest = N - 2 The second highest process initiate the election Send N-2 coordinator message to lower processes Turnaround time = 1 message bandwidthworst : O(N2) The lowest process initiate the election

Multicast introduction
Multicast/broadcast Challenges Efficiency Bandwidth utilization Total transmission time E.g. delivery tree in IP multicast Delivery guarantees Reliability Ordering Group management Processes joining and leaving group at arbitrary times

System model Multicast(g,m) Deliver(m)
A process send the message m to all members of the group g Deliver(m) Deliver the message m sent by multicast to the calling process Closed/Open groups

Implementation scheme is based on reliable one-to-one send operation
Basic multicast Basic multicast A correct process will eventually deliver the message, as long as the multicaster does not crash Primitives: B-multicast / B-deliver Different to IP multicast in the aspect of reliability Implementation scheme is based on reliable one-to-one send operation To B-multicast(g, m): for each process pg, send(p, m); On receive(m) at p: B-deliver(m) at p

Implement basic multicast
Multi-threads The multicaster performs the send operations concurrently Ack-implosion Large number of acknowledgments sent back The multicasting process’s buffer will fill and drop acknowledgements The multicasting process will retransmit the dropped messages

Reliable multicast semantics
Integrity A correct process delivers a message m at most once Validity If a correct process multicasts message m then it will eventually deliver m Agreement If a correct process delivers message m, then all other correct processes in group(m) will eventually deliver m Atomicity: all or nothing Different from basic multicast: it is not met if the multicaster fails when it is multicasting

Implementing reliable multicast over B-multicast
Reliable multicast algorithm

Validity Implementing reliable multicast over B-multicast…continued
A correct process will eventually B-deliver the message to itself Integrity Based on B-multicast Filter duplicated multicasted messages Agreement If a correct process does not R-deliver m, then it must never B- deliver it, then other correct processes must never multicast it, then other correct processes must never B-deliver it, then other correct processes must never R-deliver it Expensive algorithm Each message is sent |group| times to each process

Reliable multicast over IP multicast
Characteristic IP multicast Often successful Piggy-back acknowledgment A process attached an acknowledgment on the messages that they multicast to the group Negative acknowledgement A process sends a separate acknowledgments when it detects it has missed a message

Algorithm Hold-back queue Evaluation
Reliable multicast over IP multicast …continued Algorithm Hold-back queue Evaluation Integrity detection of duplicates; checksum in IP multicast Validity Agreement: retransmit Assume processes retain copies of the messages they have delivered

Uniform properties Uniform property Uniform agreement
A property that holds whether or not processes are correct Uniform agreement If a process, whether it is correct or fails, delivers message m, then all correct processes in group(m) will eventually deliver m Example If reverse the lines “R-deliver m” and “if (q<>p) then B-multicast(g,m); end if ”, the resultant algorithm does not satisfy uniform agreement

Ordered multicast FIFO ordering Causal ordering Total ordering Example
If a correct process issues multicast(g,m) and then multicast(g,m`), then every correct process that delivers m` will deliver m before m` Causal ordering If multicast(g,m)multicast(g,m`), then any correct process that delivers m` will deliver m before m` Total ordering If a correct process delivers message m before it delivers m`, then any other correct process that delivers m` will deliver m before m` Example Example of the bulletin board

Hybrid ordered multicast
FIFO-total ordering Causal-total ordering Hybrid order and reliable protocol Atomic multicast: a reliable totally ordered multicast

Implementing FIFO ordering
FO-multicast/FO-deliver Algorithm Similar to the algorithm of reliable multicast over IP-multicast Sgp, Rgp, Hold-back queue

Implementing total ordering
TO-multicast/TO-deliver Sequencer-based algorithm Sg: sequencer(g) maintains a group-specific sequence number TO-Multicast B-multicast a message m to group and the sequencer B-multicast a sequence number The sequencer multicast a sequence number for the message m to the group TO-deliver Deliver the messages according to the sequence number Bottleneck the sequencer may become a bottleneck

Example Algorithm Performance The ISIS algorithm for total ordering
All correct processes ultimately agree on the same set of sequence numbers which are monotonically increasing Performance High latency: 3 messages

Implementing causal ordering
Vector timestamp Each process maintains its own vector timestamp CO-multicast Attach vector timestamp to the multicasted message CO-deliver Deliver messages according to its vector timestamp Algorithm

Overlapping groups Global FIFO ordering Global causal ordering
If a correct process issues multicast(g,m) and then multicast(g`, m`), then every correct process in gg` that delivers m` will deliver m before m` Global causal ordering If multicast(g,m)  multicast(g`, m`), then any correct process in gg` that delivers m` will deliver m before m` Pairwise total ordering If a correct process delivers message m sent to g before it delivers m` sent to g`, then any other correct process in gg` that delivers m` will deliver m before m` Global total ordering Let “<” be the relation of ordering between delivery events. We require that “<“ obeys pairwise total ordering and that it is acyclic

Consensus introduction
Make agreement in a distributed manner Mutual exclusion: who can enter the critical region Totally ordered multicast: the order of message delivery Byzantine generals: attack or retreat? Consensus problem Agree on a value after one or more of the processes has proposed what the value should be Consensus, byzantine general problem, interactive consistency, totally ordered multicast Failure model Process crash failure, process byzantine (arbitrary) failure

Definition of the consensus problem
Defined variables pi: process i; vi: proposed value of pi ; di: decision variable of pi Example Requirements of a consensus algorithm Termination Eventually each correct process sets its decision variable Agreement If pi and pj are correct and have entered the decided state, then di = dj (i, j = 1,2, …, N) Integrity If the correct processes all proposed the same value, then any correct process in the decided state has chosen that value

Consensus algorithm in no-failure circumstance
Each process multicasts proposed value Each process collects values proposed by other processes V = majority (v1, v2, …, vN) at each process Majority: minimum or maximum Analysis Termination guaranteed by the reliability of the multicast Agreement and integrity Guaranteed by the definition of majority How to deal with the issue when there are failures?

The byzantine generals problem
One commander order “attack” or “retreat”, other generals execute the order There are treacherous generals All correct generals execute the same order Requirements of the algorithm Termination Eventually each correct process sets its decision variable Agreement If pi and pj are correct and have entered the decided state, then di = dj (i, j = 1,2, …, N) Integrity If the commander is correct, then all correct processes decide on the value that the commander proposed

Interactive consistency
Agree on a vector of values Decision vector: each element represents the decided value of each process Requirements of the algorithm Termination Eventually each correct process sets its decision variable Agreement The decision vector of all correct processes is the same Integrity If pi is correct, then all correct processes decide on vi as the ith component of their vector

Relate consensus to other problems
Definitions Ci(v1,v2, …, vN ) decision state of pi in consensus problem BGi(j,v) decision state of pi in byzantine general problem, in which pj is the commander ICi(v1,v2, …, vN )[j] the jth decision state of pi in interactive consistency problem

From BG to IC & From IC to C
Run BG N times, once with each process pi acting as the commander ICi(v1,v2, …, vN )[j] = BGi(j,v), (i, j = 1, 2, …, N) From IC to C Ci(v1,v2, …, vN ) = majority(ICi(v1,v2, …, vN )[1],…, ICi(v1,v2, …, vN )[N])

From C to BG The commander pj sends its proposed value v to itself and each of the remaining processes All processes run C with the values v1,v2, …, vN that they receive (pj may be faulty) BGi(j,v) = Ci(v1,v2, …, vN ), (i= 1, 2, …, N)

Consensus in a synchronous system
Crash failures Assume f of the N processes exhibit crash failures Algorithm Valuesir: the set of proposed values know to process pi at the beginning of round r (1<=r<=f+1) Each process multicasts new accepted proposed values for f+1 rounds di = minimum(valuesif+2)

Consensus in a synchronous system …continued
Analysis Termination: ensured by synchronous system Agreement and integrity Assume: pi holds a value v that pj does not hold Pk1 which sends v to pi crash before it sends v to pj Pk2 which sends v to Pk1 crash before it sends v to pj … Pkf+1 which sends v to Pkf crash before it sends v to pj But, there are only f crashed processes So, the sets of values that pi and pj hold are same So, minimum(Valuesif+2) are same

Arbitrary failures N <= 3f N >= 3f +1
The byzantine generals problem in a synchronous system Arbitrary failures f of the N process exhibits arbitrary failures N <= 3f There is no solution to reach an agreement N >= 3f +1 There is solutions Lamport[1982] give their solution

Impossibility with three processes
The model If there is a solution, according to integrity condition, p2 will choose 1:v in the left scenario p2 can not identify the two scenario, so p2 will choose 1:w in the right scenario By symmetry, p3 will also choose 1:x in the right scenario So, contradict the agreement condition Reach agreement If the generals digitally sign their messages, byzantine agreement can be reached for 3 generals, with one of them faulty

Impossibility with N<=3f
n1, n2, n3 Divide N into 3 groups N = n1 + n2 + n3 and n1, n2, n3 <= N/3 p1, p2, p3 Let p1, p2, p3 simulate the behaviors of n1, n2 and n3 Since N<=3f, so there is one process is faulty If there is a solution … Reach agreement among the N entities Then, there is a solution among p1, p2 and p3 Contradict the impossibility of 3 processes

Solution with one faulty process
Two rounds algorithm for N=4, f=1 The commander sends a value to each of the lieutenants Each of the lieutenants sends the value it received to its peers di = majority(received values) Illustration Left scenario d2 = majority(v, u, v) = v d4 = majority(v, v, w) = v Right scenario d2 = d3 = d4 = majority(u,v,w) = 

Performance discussion
Questions How many message rounds does it take? How many messages are sent, and of what size? Lamport algorithm f + 1 rounds O(Nf+1) messages Conclusion from Fischer and Lynch[1982] At least f+1 rounds message in deterministic algorithm

Impossibility in asynchronous systems
No algorithm can guarantee to reach consensus in an asynchronous system Can not tell a crashed process from a slow one Masking faults Recover from crash to mask crash failure Consensus using failure detectors Consensus using randomization Introduce an element of chance in the process’s behavior

Distributed mutual exclusion
Summary Distributed mutual exclusion Central server ring-based algorithm multicast-based algorithm using logical clocks Maekawa’s voting algorithm Elections Ring-based algorithm bully algorithm

Multicast communication
Summary Multicast communication Basic multicast Reliable multicast Over basic multicast, over IP multicast Ordered multicast FIFO delivery ordering, total delivery ordering, causal delivery ordering Consensus byzantine generals interactive consistency

A network partition

Server managing a mutual exclusion token for a set of processes

A ring of processes transferring a mutual exclusion token

Ricart and Agrawala’s algorithm
On initialization state := RELEASED; To enter the section state := WANTED; T := request’s timestamp; Multicast request to all processes; Wait until (number of replies received = (N – 1)); state := HELD; On receipt of a request <Ti, pi> at pj (i ≠ j) if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi))) then queue request from pi without replying; else reply immediately to pi; end if To exit the critical section reply to any queued requests;

Multicast synchronization
p 3 34 Reply 41 1 2

Maekawa’s algorithm On initialization state := RELEASED;
voted := FALSE; For pi to enter the critical section state := WANTED; Multicast request to all processes in Vi; Wait until (number of replies received = K); state := HELD; On receipt of a request from pi at pj if (state = HELD or voted = TRUE) then queue request from pi without replying; else send reply to pi; voted := TRUE; end if

Maekawa’s algorithm … continued
For pi to exit the critical section state := RELEASED; Multicast release to all processes in Vi ; On receipt of a release from pi at pj if (queue of requests is non-empty) then remove head of queue – from pk, say; send reply to pk; voted := TRUE; else voted := FALSE; end if

Deadlock example in Maekawa’s algorithm
P = { p1 , p2 . p3} V1 = {p1 , p2 }, V2 = {p2 , p3 }, V3 = {p3 , p1 }, Deadlock situation 1. p1 , p2 . p3 request to enter CS concurrently 2. p1 , p2 . p3 have set its own voted to TRUE, and wait for each other’s reply p1 p2 p3 Voted=TRUE V1 V3 Voted=TRUE V2 Voted=TRUE

A ring-based election in progress

The bully algorithm The election of coordinator p2,
after the failure of p4 and then p3

Open and Closed groups

Algorithm of reliable multicast over IP multicast
Sgp: each process p maintains a sequence number for group g, zero initially Rgp: each process records the sequence number of the latest message it has delivered from process p that was sent to group g R-multicast a message: piggy back Sgp and <q, Rgq > with the message, where Rgp records the sequence number of the latest delivered message; Sgp = Sgp +1 R-deliver a message: 1. Deliver a message m from p if and only if m.S = Rgp +1; Rgp = Rgp +1 2. When process r receives a message m which m.S <= Rgp, then r has delivered it before and r discards it 3. When process r receive a message m which m.S > Rgp +1, or m.R > Rgq, then r has missed one or more messages. It retains the message in a hold- back queue, and then requests them by sending negative acknowledgement

The hold-back queue for arriving multicast messages

Total, FIFO and causal ordering of multicast messages
Notice the consistent ordering of totally ordered messages T1 and T2, the FIFO-related messages F1 and F2 and the causally related messages C1 and C3 – and the otherwise arbitrary delivery ordering of messages.

Display from bulletin board program
os.interesting Item From Subject 23 A.Hanlon Mach 24 G.Joseph Microkernels 25 Re: Microkernels 26 T.L’Heureux RPC performance 27 M.Walker Re: Mach end

Total ordering using a sequencer

The ISIS algorithm for total ordering
2 1 1 Message 2 Proposed Seq P 3 4 3 Agreed Seq

The ISIS total ordering algorithm
Agq each process q maintains the largest agreed sequence number it has observed so far Pgq each process q maintains its own largest proposed sequence number TO-multicast p B-multicasts <m, i> to g; i is an unique identifier for m Each process propose a sequence number Each process q: (1) Pgq = max(Agq, Pgq)+1; (2) attach Pgq to m which is in hold-back queue; (3) reply Pgq to p p collect proposed Pgq and select the largest one a, then B-multicast<i,a> to g Each process q in g sets Agq: = max(Agq, a), attach a to m as sequence num TO-deliver Deliver the messages which has agreed sequence number and be at the front of the hold-back queue

Causal ordering using vector timestamps

Consensus for three processes

Consensus in a synchronous system

Three byzantine generals
p 1 (Commander) 2 3 1:v 2:1:v 3:1:u 1:x 1:w 2:1:w 3:1:x Faulty processes are shown shaded

Four byzantine generals
p 1 (Commander) 2 3 1:v 2:1:v 3:1:u Faulty processes are shown shaded 4 4:1:v 3:1:w 1:w 1:u 2:1:u

Chapter 12: Transactions and Concurrency Control
Introduction Transactions Nested transactions Locks Optimistic concurrency control Timestamp ordering Comparison of methods for concurrency control Summary

The goal of transactions
Introduction The goal of transactions All objects remain in a consistent state when they are accessed by multiple transactions and in the presence of server crashes Concurrency control Enhance reliability Recovery from failures Record in permanent storage The banking example

Simple synchronization (without transaction)
Multi-threaded banking server Atomic operations Operations that are free from interference from concurrent operations being performed in other threads Only one thread can access an account at a time Public synchronized void deposit(int amount) {…} Synchronization of server operations Mutual exclusion Synchronized … Producer/Consumer E.g., wait and notify in Java

Failure model for transactions [lamport1981]
Writes to permanent storage may fail Write nothing or wrong value file storage may decay reading bad data can be detect (by checksum) Servers may crash occasionally Memory recover to the last updated state continue recovery using information in permanent storage no arbitrary failure An arbitrary delay of a message A message may be lost, duplicated or corrupted The recipient can detect corrupted messages

Transactions What is a transaction? An example All of nothing
A sequence of separate operations that execute in a atomic manner Free from interference by operations that belong to different transaction Nothing-or-all semantics of the transaction An example All of nothing Failure atomicity durability Isolation

ACID properties Atomicity Consistency Isolation Durability
A transaction must be all or nothing Consistency A transaction takes the system from one consistent state to another consistent state The state during a transaction is invisible to another Isolation Serially equivalent or serializable Durability

Transaction coordinator
Use a transaction Transaction coordinator Each transaction is created and managed by a coordinator Result of a transaction Success Aborted Initiated by client Initiated by server Example

The lost update problem
Concurrency control The lost update problem The final balance of b should be $242 rather than $220 Inconsistent retrievals The total should be $400 rather than $300

What is serial equivalence?
An interleaving of the operations of transactions in which the combined effect is the same as if the transactions had been performed one at a time in some order Significance The criterion for correct concurrent execution Avoid lost update and inconsistent retrieval Example 1 Example 2

Conflicting operations
Conflict between a pair of operations The combined effect depends on the order in which they are executed Conflicting rules Serial equivalence of two transactions All pairs of conflicting operations of the two transactions be executed in the same order at all of the objects they both access A non-serially equivalent example

Recoverability from aborts
Dirty reads Strategy: any commits must be delayed until after the commitment of any other transaction whose uncommitted state has been observed Cascading aborts Strategy: any read operation must be delayed until other transactions that applied a write operation to the same object have committed or aborted Premature writes Some DBMS implement the action of abort by restoring “before images” of all the writes of a transaction Strategy: any write operations must be delayed until earlier transactions that updated the same objects have either committed or aborted

Recoverability from aborts …continued
Strict executions of transactions Delays both read and write operations on an object until all transactions that previously wrote that object have either committed or aborted Tentative versions Recoverable object: all of the update operations performed during a transaction are done in tentative versions of objects in volatile memory

The advantages of nested transactions
Additional concurrency Subtransactions at one level may run concurrently with other subtransactions at the same level E.g. concurrent getBalances in branchTotal operation More robust Subtransactions can commit or abort independently

The rules for commitment of nested transactions
Transaction commit(abort) after its child complete A transaction may commit or abort only after its child transactions have completed Child completes: commit provisionally or abort When a subtransaction completes, it makes an independent decision either to commit provisionally or to aborts. Its decision to abort is final Parent abort, children abort When a parent aborts, all of its subtransactions are aborted

The rules for commitment of nested transactions
Child abort, parent abort or not When a subtransaction aborts, the parent can decide whether to abort or not Top level transaction commit, all provisionally committed subtransactions commit If the top-level transaction commits, then all of the subtransactions that have provisionally committed can commit too, provided that none of their ancestors has aborted

Simple exclusive locks
Lock any object that is about to be used by any operation of a client’s transaction Transaction T : balance = b.getBalance() b.setBalance(bal*1.1) a.withdraw(bal/10) U c.withdraw(bal/10) Operations Locks openTransaction bal = b.getBalance() lock B A waits for ’s lock on closeTransaction unlock , C

To ensure serial equivalence of any two transactions
Two phase locking To ensure serial equivalence of any two transactions A transaction is not allowed any new locks after it has released a lock Growing phase: acquire locks Shrinking phase: release locks Strict two-phase locking Any locks applied during the progress of a transaction are held until the transaction commits or aborts In fact, the lock between two reads are unnecessary

Lock rules Lock granularity Read lock / write lock Lock compatibility
as small as possible: enhance concurrency Read lock / write lock Before access an object, acquire its lock firstly Lock compatibility If a transaction T has already performed a read operation on an object, then a concurrent transaction U must not write that object until T commits or aborts If a transaction T has already performed a write operation on an object, then a concurrent transaction U must not read or write that object until T commits or aborts

Prevent lost update and inconsistent retrieval Promotion of a lock
Lock rules … continued Prevent lost update and inconsistent retrieval Promotion of a lock From read lock to write lock Promotion can not be conducted if the read lock is shared by another transaction Two-phase locking implementation

Lock implementation The Lock class The lock manager
The identifier of the locked object The identifiers of the transactions that currently hold the lock A lock type The lock manager All requests to set locks and to release them are sent to the an instance of Lockmanager

Locking requirement for nested transactions
Each set of nested transaction Must be prevented from observing the partial effects of any other set of nested transactions Locks that are acquired by subtransactions are inherited by its parent when it completes Prevent other set of nested transactions getting the locks Each transaction within a set of nested transactions Must be prevented from observing the partial effects of the other transactions in the set Parent is not allowed to run concurrently with their children, parent’s lock could be allocated to its children temporarily Subtransactions at the same level are allowed to run concurrently So that, subtransactions in the set access same object in a sequential order

Locking rules for nested transactions
Acquire a read lock on an object For a subtransaction to acquire a read lock, no other active transaction can have a write lock on that object, and the only retainers of a write lock are its ancestors Acquire a write lock on an object For a subtransaction to acquire a write lock, no other active transaction can have a read or a write lock on that object, and the only retainers of read and write lock are its ancestors Commit When a subtransaction commits, its locks are inherited by its parent, allowing the parent to retain the locks in the same mode as the child Abort When a subtransaction aborts, its locks are discarded. If the parent already retains the locks, it can continue to do so.

Dead locks Example of dead lock with write locks
Definition of deadlock each member of a group of transactions is waiting for some other member to release a lock Wait-for graph Example of a cycle in wait-for graph Example of a deadlock happening

Lock all of the objects used by a transaction when it starts
Deadlock prevention Lock all of the objects used by a transaction when it starts A single atomic step Restrict access to shared resources Impossible to predict at the start of a transaction which objects will be used Request locks on objects in a predefined order premature locking Reduction in concurrency

Deadlock detection Lock manager Timeouts
find cycles in the wait-for graph periodically Select one transaction to abort Timeouts Each lock is invulnerable in a limited period T After T, lock becomes vulnerable If no other transactions competes for the vulnerable lock, the original transaction remains it Otherwise, the lock is broken and the original transaction aborts

Two-version locking Read/Write Locking rules
Write to tentative versions of objects Read from the committed version Locking rules Read lock: set on an object when attempt to read it write lock: set on an object when attempt to write it commit lock: convert the write lock to commit lock of the object when attempt to commit it

Two-version locking …continued
Significance Read is delayed only while the transaction is being committed More concurrency than read-write locks

Hierarchic locks Mixed granularity locks Locking rules Examples
Read lock, write lock, intentional read lock, intentional write lock Control the conflicts between parent access and children access Before a child node is granted a read/write lock, an intention to read/write lock is set on the parent node and its ancestors Examples Significance Reduce the number of locks Control the granularity of locks

Pessimistic/Optimistic measures
Lock Overhead: Even for read-only operation Deadlock: Unsatisfactory resolution Reduce concurrency: Restrict two stage locking scheme Optimistic measures Observation The likelihood of two transactions accessing the same object is low Scheme No check when accessing object Check conflict when committing If there is conflict, abort transactions

Optimistic concurrency control
Working phase A tentative version of objects per transaction Initially, it is a copy of the most recently committed version Read are performed on the tentative version Written values are recorded as tentative version Reading set / write set per transaction Validation phase Check the conflicts between overlapped transactions when closeTransaction is issued Success: commit Fail: abort Update phase Updates in tentative versions are made permanent

Validation of transaction
Transaction number Each transaction is assigned a transaction number when it enters the validation phase An integer number assigned in ascending sequence Transactions enter validation phase according to the their transaction number Transactions commit according to the transaction number Since the validation and update phase are short, so there is only one transaction at a time Conflict rules For: Tv is serializable with respect to an overlapping transaction Ti

Test the previous overlapped transactions
Backward validation Test the previous overlapped transactions startTn The biggest transaction number assigned to some other committed transaction at the time when transaction Tv started its working phase finishTn The biggest transaction number assigned to some other committed transaction at the time when Tv entered the validation phase Serial equivalence of all committed transactions Since backward validation can ensure the result that Tv commits after all previously committed transactions, so all transactions are committed in a serial equivalent order Example

Backward validation … continued
Backward validation algorithm How to resolve a conflict? Abort Tv Boolean valid = true For ( int Ti = startTn +1; Ti <= finishTn; Ti ++){ if (read set of Tv intersects write set of Ti) valid = false }

Forward validation Test the behind overlapped transactions Example
Active transactions Transactions that are still in their working phase when Tv enter validation phase Serial equivalence of all committed transactions Since forward validation can ensure the result that Tv commits before all behind committed transactions Example Algorithm Boolean valid = true for ( int Tid = active1 +1; Tid <= activen; Tid ++){ if (write set of Tv intersects read set of Tid) valid = false }

Forward validation …continued
How to resolve a conflict? Defer the validation until a later time when the conflicting transactions have finished Some conflicting transactions may have aborted Abort all the conflicting active transactions and commit the transaction being validated Abort the transaction being validated The future conflicting transactions may abort, so the aborting becomes unnecessary

Comparison of forward and backward validation
Overhead of comparison for read set is bigger than write set, so comparison in backward validation is heavier than that in forward validation Overhead of storage Storing old write sets until they are no longer needed Forward validation Overhead of time To validate a transaction must wait until all active transactions finished

The basic idea An unique timestamp for each transaction
Each transaction is assigned an unique timestamp value when it starts Downhill wait: no deadlock The executing order of any two conflicting operations should conform to the timestamps of their belonging transactions A transactions’s request to write an object is valid only if that object was last read and written by earlier transactions A transactions’s request to read an object is valid only if that object was last written by an earlier transaction

Timestamp ordering mechanism
Each object has A write timestamp the maximum timestamp of the committed transactions A set of tentative versions Record uncommitted writes Each write is associated with a timestamp A read timestamp The maximum timestamp of the transaction that has read the object Write operation Write operations are recorded in tentative versions Read operation Is directed to the version with the maximum write timestamp less than the transaction timestamp

Timestamp ordering mechanism …continued
Commit Write the tentative version to the permanent storage Conflict rules

Timestamp ordering write rule
Algorithm decide whether to accept a write operation requested by transaction Tc on object D Example if (Tc ≥ maximum read timestamp on D && Tc > write timestamp on committed version of D) perform write operation on tentative version of D with write timestamp Tc else /* write is too late */ Abort transaction Tc

Timestamp ordering read rule
Algorithm decide whether to accept immediately, to wait or to reject a read operation requested by transaction Tc on object D Example if ( Tc > write timestamp on committed version of D) { let Dselected be the version of D with the maximum write timestamp ≤ Tc if (Dselected is committed) perform read operation on the version Dselected else Wait until the transaction that made version Dselected commits or aborts then reapply the read rule } else Abort transaction Tc

Timestamps in transactions T and U
When transaction U is ready to get the balance of B it will be wait for T to complete so that it can read the value set by T if it commits Timestamps and versions of objects T U A B C RTS WTS {} S openTransaction bal = b.getBalance() {T} b.setBalance(bal*1.1) wait for T a.withdraw(bal/10) commit c.withdraw(bal/10) S, U T, U S, T {U}

Multiversion timestamp ordering
Idea Each object maintains a list of old committed versions as well as tentative versions When a read arrives late, it can be allowed to read from an old committed version, but not reject Conflict rules Tc must not write objects that have been read by any Ti where Ti > Tc Rule 2 and rule 3 have been met by multiversion committed object

Multiversion timestamp ordering write rule
Algorithm The server direct the read operation to the most recent version of an object Example if (read timestamp of DmaxEarlier <= Tc ) perform write operation on a tentative version of D with write timestamp Tc else /* write is too late */ Abort transaction Tc

Timestamp ordering vs. two phase locking
Decide the serialization order statically Better than locking for read-dominated transactions Two phase lock Decide the serialization order dynamically Better than timestamp ordering for update-dominated transactions Both are pessimistic methods

Pessimistic methods vs. optimistic methods
Efficient when there are few conflicts No wait during the working-phase A substantial amount of work may have to be repeated when a transaction is aborted Pessimistic methods Less concurrency but simple relative to optimistic methods

Summary ACID of transaction Nested transactions Concurrency control
Atomic, Consistency, Isolation, Duration Nested transactions Concurrent execution of subtransactions in separate servers Independent recovery of parts of a transaction Concurrency control Criterion: serial equivalence Two problems that should avoid Lost update, inconsistent retrieval Recoverability from aborts Avoid dirty read, premature writes

Summary … continued Concurrency control methods
Two strict-phase locking When conflicts happen, delay some operations Deadlock prevention & detection Timestamp ordering Order transactions accesses to objects There may be transactions aborting Multiversion timestamp ordering: effective Optimistic methods Allow transactions to proceed without any form of conflicts checking Detect conflicts when validating Backward validation / forward validation Effective in expense of overhead in storage and complexity

Operations of the Account Interface
deposit(amount) deposit amount in the account withdraw(amount) withdraw amount from the account getBalance() -> amount return the balance of the account setBalance(amount) set the balance of the account to amount create(name) -> account create a new account with a given name lookUp(name) -> account return a reference to the account with the given name branchTotal() -> amount return the total of all the balances at the branch Operations of the Branch interface

A client’s banking transaction
Transaction T: a.withdraw(100); b.deposit(100); c.withdraw(200); b.deposit(200);

Operations in Coordinator interface
openTransaction() -> trans; starts a new transaction and delivers a unique TID trans. This identifier will be used in the other operations in the transaction. closeTransaction(trans) -> (commit, abort); ends a transaction: a commit return value indicates that the transaction has committed; an abort return value indicates that it has aborted. abortTransaction(trans); aborts the transaction.

Transaction life histories
Successful Aborted by client Aborted by server openTransaction operation server aborts transaction operation ERROR reported to client closeTransaction abortTransaction

The lost update problem
Transaction T : balance = b.getBalance(); b.setBalance(balance*1.1); a.withdraw(balance/10) U c.withdraw(balance/10) balance = b.getBalance(); $200 $220 $80 $280

The inconsistent retrieval problem
Transaction V : a.withdraw(100) b.deposit(100) W aBranch.branchTotal() a.withdraw(100); $100 total = a.getBalance() total = total+b.getBalance() $300 total = total+c.getBalance()

A serially equivalent interleaving of T and U
Transaction T : balance = b.getBalance() b.setBalance(balance*1.1) a.withdraw(balance/10) U c.withdraw(balance/10) balance = b.getBalance() $200 $220 $242 $80 $278

A serially equivalent interleaving of V and W
Transaction V : a.withdraw(100); b.deposit(100) W aBranch.branchTotal() $100 $300 total = a.getBalance() total = total+b.getBalance() $400 total = total+c.getBalance() ...

Read and write operation conflict rules
Operations of different transactions Conflict Reason read No Because the effect of a pair of operations does not depend on the order in which they are executed write Yes Because the effect of a and a operation depends on the order of their execution

A non-serially equivalent interleaving of operations of transactions T and U
: U x = read(i) write(i, 10) y = read(j) write(j, 30) write(j, 20) z = read (i)

A dirty read when transaction T aborts
: a.getBalance() a.setBalance(balance + 10) U a.setBalance(balance + 20) balance = a.getBalance() $100 $110 $130 commit transaction abort transaction If T aborts and U commits, the final balance is $130 which is wrong

Overwriting uncommitted values
Transaction T : a.setBalance(105) U a.setBalance(110) $100 $105 $110 The before image of T is a=$100, the before image of U is a=$105. If U commits but T aborts, then a is restored to $100 which should be $105

Nested transactions T : top-level transaction T 1 = openSubTransaction
2 openSubTransaction : 11 12 211 21 prov.commit prov. commit abort commit

Lock compatibility For one object Lock requested read write
Lock already set none OK wait

Use of locks in strict two-phase locking
1. When an operation accesses an object within a transaction: (a) If the object is not already locked, it is locked and the operation proceeds. (b) If the object has a conflicting lock set by another transaction, the transaction must wait until it is unlocked. (c) If the object has a non-conflicting lock set by another transaction, the lock is shared and the operation proceeds. (d) If the object has already been locked in the same transaction, the lock will be promoted if necessary and the operation proceeds. (Where promotion is prevented by a conflicting lock, rule (b) is used.) 2. When a transaction is committed or aborted, the server unlocks all objects it locked for the transaction.

Lock class public class Lock {
private Object object; // the object being protected by the lock private Vector holders; // the TIDs of current holders private LockType lockType; // the current type public synchronized void acquire(TransID trans, LockType aLockType ){ while(/*another transaction holds the lock in conflicing mode*/) { try { wait(); }catch ( InterruptedException e){/*...*/ } } if(holders.isEmpty()) { // no TIDs hold lock holders.addElement(trans); lockType = aLockType; } else if(/*another transaction holds the lock, share it*/ ) ){ if(/* this transaction not a holder*/) holders.addElement(trans); } else if (/* this transaction is a holder but needs a more exclusive lock*/){ lockType.promote();

Lock class … continued public synchronized void release(TransID trans ){ holders.removeElement(trans); // remove this holder // set locktype to none notifyAll(); }

Lock manager class public class LockManager {
private Hashtable theLocks; public void setLock(Object object, TransID trans, LockType lockType){ Lock foundLock; synchronized(this){ // find the lock associated with object // if there isn’t one, create it and add to the hashtable } foundLock.acquire(trans, lockType); // synchronize this one because we want to remove all entries public synchronized void unLock(TransID trans) { Enumeration e = theLocks.elements(); while(e.hasMoreElements()){ Lock aLock = (Lock)(e.nextElement()); if(/* trans is a holder of this lock*/ ) aLock.release(trans);

Dead lock with write locks
Transaction T U Operations Locks a.deposit(100); write lock A b.deposit(200) B b.withdraw(100) waits for ’s a.withdraw(200); lock on

The wait-for graph B A Waits for Held by T U

A cycle in wait-for graph
U V T

Another wait-for graph
C T U V Held by W B Waits for T, U and V share a read lock on object C W holds a write lock on object B Dead lock happens when T and W request to hold lock on object C

Resolution of the deadlock
Transaction T Transaction U Operations Locks a.deposit(100); write lock A b.deposit(200) B b.withdraw(100) waits for U ’s a.withdraw(200); waits for T’s lock on (timeout elapses) T’s lock on becomes vulnerable, unlock , abort T write locks ,

Lock compatibility (read, write and commit locks)
For one object Lock to be set read write commit Lock already set none OK wait

Lock hierarchy for the banking example
Branch Account A B C

Lock hierarchy for a diary
Week Monday Tuesday Wednesday Thursday Friday 9:00–10:00 time slots 10:00–11:00 11:00–12:00 12:00–13:00 13:00–14:00 14:00–15:00 15:00–16:00

Lock compatibility table for hierarchic locks
For one object Lock to be set read write I-read I-write Lock already set none OK wait

Serializability of transaction Tv with respect to transaction Ti
Rule write read 1. must not read objects written by 2. 3. must not write objects written by and must not write objects written by

Validation of transactions
Earlier committed transactions Working Validation Update T 1 v Transaction being validated 2 3 Later active active

Operation conflicts for timestamp ordering
Rule Tc Ti 1. write read must not an object that has been by any where this requires that ≥ the maximum read timestamp of the object. 2. written > > write timestamp of the committed object. 3. > write timestamp of the committed object.

Write operations and timestamps
(c) T3 object produced by transaction Ti (with write timestamp Ti) (b) (d) T1<T2<T3<T4 Time Before After T 2 3 1 4 Transaction aborts Tentative Committed i Key:

Read operations and timestamps
proceeds Selected T 2 4 read waits 1 Transaction aborts Key: Tentative Committed i object produced by transaction Ti (with write timestamp Ti) T1 < T2 < T3 < T4 (a) T3 read (c) T3 read (d) T3 read (b) T3 read

Late write operation would invalidate a read
Time T 4 write; 5 read; 3 2 1 < T Key: Tentative Committed i k object produced by transaction Ti (with write timestamp Ti and read timestamp Tk)

Chapter 13: Distributed Transactions
Introduction Flat and nested distributed transactions Atomic commit protocols Concurrency control in distributed transactions Distributed deadlocks Transaction recovery Summary

Distributed transaction
Introduction Distributed transaction A flat or nested transaction that accesses objects managed by multiple servers Atomicity of transaction All or nothing for all involved servers Two phase commit Concurrency control Serialize locally + serialize globally Distributed deadlock

Flat and nested distributed transactions
Flat transaction Nested transaction Nested banking transaction The four subtransactions run in parallel

The architecture of distributed transactions
The coordinator Accept client request Coordinate behaviors on different servers Send result to client Record a list of references to the participants The participant One participant per server Keep track of all recoverable objects at each server Cooperate with the coordinator Record a reference to the coordinator Example

One-phase atomic commit protocol
The protocol Client request to end a transaction The coordinator communicates the commit or abort request to all of the participants and to keep on repeating the request until all of them have acknowledged that they had carried it out The problem some servers commit, some servers abort How to deal with the situation that some servers decide to abort?

Introduction to two-phase commit protocol
Allow for any participant to abort First phase Each participant votes to commit or abort The second phase All participants reach the same decision If any one participant votes to abort, then all abort If all participants votes to commit, then all commit The challenge work correctly when error happens Failure model Server crash, message may be lost

The two-phase commit protocol
When the client request to abort The coordinator informs all participants to abort When the client request to commit First phase The coordinator ask all participants if they prepare to commit If a participant prepare to commit, it save in the permanent storage all of the objects that it has altered in the transaction and reply yes. Otherwise, reply no Second phase The coordinator tell all participants to commit ( or abort)

The two-phase commit protocol … continued
Operations for two-phase commit protocol The two-phase commit protocol Record updates that are prepared to commit in permanent storage When the server crash, the information can be retrieved by a new process If the coordinator decide to commit, all participants will commit eventually

Timeout actions in the two-phase commit protocol
Communication in two-phase commit protocol New processes to mask crash failure Crashed process of coordinator and participant will be replaced by new processes Time out for the participant Timeout of waiting for canCommit: abort Timeout of waiting for doCommit Uncertain status: Keep updates in the permanent storage getDecision request to the coordinator Time out for the coordinator Timeout of waiting for vote result: abort Timeout of waiting for haveCommited: do nothing The protocol can work correctly without the confirmation

Nested transaction semantics
Two-phase commit protocol for nested transactions Nested transaction semantics Subtransaction Commit provisionally abort Parent transaction Abort: all subtransactions abort Commit: exclude aborting subtransactions Distributed nested transaction When a subtransaction completes provisionally committed updates are not saved in the permanent storage

Each subtransaction Top level transaction
Distributed nested transactions commit protocol Each subtransaction If commit provisionally Report the status of it and its descendants to its parent If abort Report abort to its parent Top level transaction Receive a list of status of all subtransactions Start two-phase commit protocol on all subtransactions that have committed provisionally

The information held by each coordinator
Example of a distributed nested transactions The execution process The information held by each coordinator Top level coordinator The participant list: the coordinators of all the subtransactions in the tree that have provisionally committed but do not have aborted parent Two-phase commit protocol Conducted on the participant of T, T1 and T12

Different two-phase commit protocol
Hierarchic two-phase commit protocol Messages are transferred according to the hierarchic relationship between successful participants The interface Flat two-phase commit protocol Messages are transferred from top-level coordinator to all successful participants directly

Serial equivalence on all servers
Objective Serial equivalence on all involved servers If transaction T is before transaction U in their conflicting access to objects at one of the server then they must be in that order at all of the servers whose objects are accessed in a conflicting manner by both T and U Approach Each server apply concurrency control it its own objects All servers coordinate together to reach the objective

Each participant locks on objects locally Atomic commit protocol
strict two phase locking scheme Atomic commit protocol a server can not release any locks until it knows that the transaction has been committed or aborted at all Distributed deadlock

Timestamp ordering concurrency control
Globally unique transaction timestamp Be issued to the client by the first coordinator accessed by a transaction The transaction timestamp is passed to the coordinator at each server Each server accesses shared objects according the timestamp Resolution of a conflict Abort a transaction from all servers

Optimistic concurrency control
The validation takes place during the first phase of two phase commit protocol Parallel validation Suitable for distributed transaction Rule 3 must be checked as well as rule 2 for backward validation Possibly different validation order on different server measure 1: global validation check that the combination of the orderings at the individual servers is serializable measure 2: each server validates according to a globally unique transaction number of each transaction

Distributed deadlocks
A cycle in the global wait-for graph An example Simple resolution A centralized deadlock detector collect latest copy of each servers local wait-for graph construct global wait-for graph find cycles in the global wait-for graph Drawbacks poor availability, lack of fault tolerance, poor scalability cost of collecting information is high

Phantom deadlocks Phantom deadlocks An example
a deadlock that is detected but is not really a deadlock may occur when some deadlocked transactions abort or release locks An example at server Y: U request lock V at server X: U release lock for T at global deadlock detector: message from server Y arrives earlier than message from server X, then phantom deadlock happens

Edge chasing Idea Question Detect dead-lock in a distributed manner
Each server involved in the dead-lock forwards the partial knowledge of wait-for edge which is called probes to other servers to construct the wait-for graph Question When to send a probe?

Edge chasing algorithm
Initiation When a server finds that a transaction T starts waiting for another transaction U, where U is waiting to access an object at another server, it initiates detection by sending a probe containing the edge <TU> to the server of the object at which transaction U is blocked Detection Receive probes Detect whether deadlock has occurred Merge the local wait-for knowledge and that of the probes, find cycle Decide whether to forward the probes If there is a new transaction V is waiting for another object elsewhere, the probe is forwarded Resolution When a cycle is detected, a transaction in the cycle is aborted

Edge chasing algorithm - example
V Held by W Waits for Waits for Deadlock detected U C A B Initiation  Z Y X

Transaction priorities
The problem of edge-chasing algorithm Concurrent initiation may cause more than one transaction aborting Example The same cycle is detected at two servers Approach All transactions are totally ordered probe initiation, probe forward and transaction abort are conducted according to the order

So, each transaction is given a priority
Abort When a deadlock is detected, the transaction with the lowest priority is selected to abort Initiation Detect is initiate only when a higher-priority transaction starts to wait for a lower-priority one forward downhill: forward probes from transactions with higher priorities to transactions with lower priorities Example Set the priorities: T > U > V > W Detect initiation starts only when T begin to wait for U

A pitfall of the transaction priority scheme
The pitfall Since initiation starts according to the priority, so some deadlock will not be detected Example When T begin to wait for U but W has not being waiting for V, the deadlock will not be detected Resolution Probe queue Each coordinator save copies of all the probes received on behalf of each transaction in a probe queue Forward the probe queue When a transaction starts waiting for an object, it forwards the probes in its queue to the server of the object, which propagates the probes on downhill routes

A pitfall of the transaction priority scheme… continued
Example Priorities: U > V > W Deadlock will be detected when W begins to detect U

What is transaction recovery?
Requirements of transactions Durability All objects are saved in permanent storage Failure atomicity Effects of transactions are atomic even when the server crashes Recovery Restoring the server with the latest committed versions of its objects from permanent storage

Recovery file Assumption The task of a recovery manager
A server keeps all of its objects in volatile memory Records its committed objects in a recovery file The task of a recovery manager Save objects in permanent storage ( in a recovery file) for committed transactions Restore the server’s objects after a crash Reorganize the recovery file to improve the performance of recovery Reclaim storage space ( in the recovery file)

Intentions list of a transaction
Keep track of the objects accessed by transactions An intention list per active transaction Contains a list of the references and the values of all the objects that are altered by the transaction When the transaction is committed Replace the committed version of each object by the tentative version object Write the new value to the server’s recovery file When the transaction is aborted Delete the tentative version object

Entries in a recovery file
Type of entry Description of contents of entry Object A value of an object. Transaction status Transaction identifier, transaction status ( prepared , committed aborted ) and other status values used for the two-phase commit protocol. Intentions list Transaction identifier and a sequence of intentions, each of which consists of <identifier of object>, <position in recovery file of value of object>.

Logging Recovery file Recovery manager is called when Example
A log containing the history of all the transactions performed by a server Recovery manager is called when Prepare to commit Append all the objects in its intentions list to the recovery file, followed by the current status of that transaction ( prepared) together with its intentions list Commit & abort Append corresponding status of the transaction to its recovery file Example Each transaction status entry contains a pointer to the previous transaction status entry

Recovery of objects When a server is replaced after a crash
Set default initial values for all objects, then hand over to recovery manager Recovery manager’s task Include all the effects of all the committed transactions performed in the correct order none of the effects of incomplete or aborted transactions Two approaches Find most recent checkpoint, and then replay all committed transactions after the checkpoint by the help of intention lists and the committed values of objects Read the recovery file backwards until all objects have been restored to the most recent committed values

Example – recovery of objects
If the server fails at the point reached at P7 Restore by the second approach P7 is ignored P4 is committed, so find P3 Restore A and B by the intention list of P3 Restore C by P0 Reorganize the recovery file Add an aborted transaction status to the recovery file for transaction U

Reorganize the recovery file
Checkpoint Checkpointing The process of writing the current committed values of a server’s object to a new recovery file, together with transaction status entries and intentions lists of transactions that have not yet been fully resolved The information stored by the checkpointing process The purpose of make checkpoints Reduce the number of transactions to be dealt with during recovery, To reclaim file space When to make checkpoint Immediately after recovery, or from time to time Recovery from the checkpoint Discard old recovery file

Alternative way to organize recovery file Version store
Shadow versions Alternative way to organize recovery file Version store A file that records objects values in the order of versions Map Map locates versions of the objects in the version store Recovery To restore objects, locate the objects in the version store by the map

Shadow versions mechanism
When a transaction is prepared to commit Updated objects are appended to the version store these new as yet tentative versions are called shadow versions When a transaction commits New map is made by copying the old map and entering the positions of the shadow versions Example

Shadow versions vs. logging
Faster recovery The positions of the current committed objects are recorded in the map No need to search throughout the log for objects Slower normal activity Switch from the old map to the new map must be performed in a single atomic step, so as to lead to an additional stable storage write Logging: only need a sequence of append operations on the same file

Flat and nested distributed transaction Two-phase commit protocol
Summary Flat and nested distributed transaction Two-phase commit protocol Take an unbounded amount of time to complete but is guaranteed to complete eventually Concurrency control Lock Timestamp ordering Optimistic concurrency control

Summary … continued Distributed deadlock Recovery Logging
Edge-chasing algorithm Recovery Logging Shadow version

Distributed transactions
Client X Y Z M N T 1 2 11 P 12 21 22 (a) Flat transaction (b) Nested transactions

Nested banking transaction
a.withdraw(10) c . deposit(10) b.withdraw(20) d.deposit(20) Client A B C T 1 2 3 4 D X Y Z T = openTransaction openSubTransaction a.withdraw(10); closeTransaction b.withdraw(20); c.deposit(10); d.deposit(20);

A distributed banking transaction
. BranchZ BranchX participant C D Client BranchY B A join T a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3); openTransaction b.withdraw(T, 3); closeTransaction T = Note: the coordinator is in one of the servers, e.g. BranchX

Operations for two-phase commit protocol
canCommit?(trans)-> Yes / No Call from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote. doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction. doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction. haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction. getDecision(trans) -> Yes / No Call from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages.

Phase 1 (voting phase): 1. The coordinator sends a canCommit? request to each of the participants in the transaction. 2. When a participant receives a canCommit? request it replies with its vote (Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately.

Phase 2 (completion according to outcome of vote): 3. The coordinator collects the votes (including its own). (a) If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants. (b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes. 4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator.

Communication in two-phase commit protocol
canCommit? Yes doCommit haveCommitted Coordinator 1 3 (waiting for votes) committed done prepared to commit step Participant 2 4 (uncertain) status

Transaction T decides whether to commit
1 2 T 11 12 22 21 abort (at M) provisional commit (at N) provisional commit (at X) aborted (at Y) provisional commit (at P)

Information held by coordinators of nested transactions
Coordinator of transaction Child transactions Participant Provisional commit list Abort list T 1 , T 2 yes 12 11 21 22 no (aborted) but not no (parent aborted)

canCommit? for hierarchic two-phase commit protocol
canCommit?(trans, subTrans) -> Yes / No Call a coordinator to ask coordinator of child subtransaction whether it can commit a subtransaction subTrans. The first argument trans is the transaction identifier of top-level transaction. Participant replies with its vote Yes / No.

canCommit? for flat two-phase commit protocol
canCommit?(trans, abortList) -> Yes / No Call from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote Yes / No.

Example of a distributed deadlock
Write(A) Locks A At X Read(B) Wait for U At Y Write(B) B Read(A) Wait for T

Interleavings of transactions U, V and W
d.deposit(10) lock D b.deposit(10) B a.deposit(20) A at Y X c.deposit(30) C b.withdraw(30) wait at Z c.withdraw(20) a.withdraw(20)

Distributed deadlock (a) (b) W Waits for Held by C D A Z X Held by
V A C

Local and global wait-for graphs
X T U Y V local wait-for graph global deadlock detector

Two probes initiated   U T V W (a) initial situation
(b) detection initiated at object requested by T U T V W Waits for Waits for   (c) detection initiated at object requested by W

Probes travel downhill
(b) Probe is forwarded when V starts waiting (a) V stores probe when U starts waiting U W V probe queue  Waits for B Waits for C 

Type of entry in a recovery file
Description of contents of entry Object A value of an object. Transaction status Transaction identifier, transaction status ( prepared , committed aborted ) and other status values used for the two-phase commit protocol. Intentions list Transaction identifier and a sequence of intentions, each of which consists of <identifier of object>, <position in recovery file of value of object>.

Log for banking service
P 1 2 3 4 5 6 7 Object: A B C Trans: T U 100 200 300 80 220 prepared committed 278 242 < , > Checkpoint End of log

Shadow versions 100 200 300 80 220 278 242 Map at start
Map when T commits A P 1 B ' 2 C " 3 4 Version store 100 200 300 80 220 278 242 Checkpoint 

Chapter 14: Replication Introduction System model and group communication Fault-tolerant services Highly available services Transactions with replicated data Summary

Replication for distributed service
Replication is a key technology to enhance service Performance enhancement Example caches in DNS servers replicated web servers Load-balance Proximity-based response

Replication for distributed service …continued
Increase availability Factors that affect availability Server failures Network partitions 1 - pn The availability of the service that have n replicated servers each of which would crash in a probability of p Fault tolerance Guarantee strictly correct behavior despite a certain number and type of faults Strict data consistency between all replicated servers

A basic architectural model Replica manager
System model A basic architectural model Replica manager One replica manager per replica Receive FE’s request, apply operations to its replicas atomically Front end One front end per client Receive client’s request, communicate with RM by message passing

Request Coordination Execution
An operation executed on a replicated object Request The front end issues the request to one or more replica managers Coordination The replica managers coordinate in preparation for executing the request consistently Different ordering Execution The replica managers execute the request (perhaps tentatively)

Agreement Response An operation executed on the replicated object (2)
The replica managers reach consensus on the effect of the request Response One or more replica managers responds to the front end

Multicast in a dynamic group
Group communication Multicast in a dynamic group Processes may join and leave the group as the system executes A group membership service Manage the dynamic membership of groups Multicast communication An example

Role of the group membership service
Provide an interface for group membership changes Create and destroy process groups Add or withdraw a process to or from a group Implement a failure detector Mark processes as suspected or unsuspected No messages will be delivered to the suspected process Approaches to treat network partition Primary partition Partitionable

Role of the group membership service (2)
Notify members of group membership changes Group view: a list of identifiers of all active processes in the order of join Perform group address expansion A process multicast a message addressed by a group identifier rather than a list of processes

View delivery is distinct from view receiving
Group view The lists of the current group members Deliver a view when a membership change occurs, the application is notified of the new membership Group management service delivers to any member process pg a series of views v0(g), v1(g), v2(g), etc View delivery is distinct from view receiving

Basic requirements for view delivery
Order If a process p delivers view v(g) and then view v’(g),then no other process qp delivers v’(g) before v(g) Integrity If process p delivers view v(g)then pv(g) Non-triviality If process q joins a group and is or becomes indefinitely reachable from process qp, then eventually q is always in the views that p delivers If the group partitions and remains partitioned, then eventually the views delivered in any one partition will exclude any processes in another partition

View-synchronous group communication
Agreement Correct processes deliver the same set of messages in any given view If a process delivers message m in view v(g) and subsequently delivers the next view v’(g), then all processes that survive to deliver the next view v’(g), that is the members of v(g)v’(g),also deliver m in the view v(g) Integrity If a process p delivers message m, then it will not deliver m again

View-synchronous group communication (2)
Validity Correct processes always deliver the messages that they send If the system fails to deliver a message to any process q, then it notifies the surviving processes by delivering a new view with q excluded, immediately after the view in which any of them delivered the message Let p be any correct process that delivers message m in view v(g), if some process qv(g) does not deliver m in view v(g), then the next view v’(g) that p delivers has qv’(g)

Discussion of view-synchronous group communication
The basic idea Extend the reliable multicast semantics to take account of changing group views Example c: q and r deliver a message which is not sent from a member in view(q, r) b: not meet agreement Significance A process knows the set of messages that other correct processes have delivered when it delivers a new view Implementation ISIS [Birman 1993] originally developed it

Replication for fault-tolerance
Service replication is a effective measure for fault-tolerance Provide a single image for users Strict consistency among all replicas A negative example Inconsistency between replicas make the property of fault-tolerance fail

The interleaved sequence of operations
Linearizability The interleaved sequence of operations Assume client i performs operations: oio,oi1,oi2,… Then a sequence of operations executed on one replica that issued by two clients may be: o20,o21,o10,o22,o11,… Linearizability criteria The interleaved sequence of operations meets the specification of a (single) correct copy of the objects The order of operations in the interleaving is consistent with the real times at which the operations occurred in the actual execution

Linearizability … continued
Example of “a single correct copy of the objects” A correct bank account For auditing purposes, if one account update occurred after another, then the first update should be observed if the second has been observed Linearizability is not for transactions Concern only the interleaving of individual operations The most strict consistency between replicas Linearizability is hard to achieve

Sequential consistency
Weaker consistency than linearizability Sequential consistency criteria The interleaved sequence of operations meets the specification of a (single) correct copy of the objects The order of operations in the interleaving is consistent with the program order in which each individual client executed them Example

Passive (primary-backup) replication
One primary replica manager, one or more secondary replica manager When the primary replica manager fail, one of the backups is prompted to act as the primary The architecture

The sequence of events when a client issue a request
The font end issues the request, containing a unique identifier, to the primary replica manager Coordination The primary takes each request atomically, in the order in which it receives it Execution The primary execute the request and stores the response

The sequence of events when a client issue a request (2)
Agreement If the request is an update then the primary sends the updated state, the response and the unique identifier to all the backups The backups send an acknowledgement Response The primary responds to the front end, which hands the response back to the client

Linearizability of passive replication
If the primary is correct The system implements linearizability obviously If the primary fail, linearizability retains Requirements The primary is replaced by an unique backup The replica managers that survive agree on which operations had been performed at the point when the replacement primary take over Approach The primary uses view-synchronization group communication to send the updates to the backups

Active replication Front end multicast request to replication managers
The architecture FE C RM

Active replication scheme
Request The front end attaches a unique identifier to the request and multicasts it to the group of replica managers, using a totally ordered, reliable multicast primitive Coordination The group communication system delivers the request to every correct replica manager in the same order Execution Every replica manager executes the request Agreement (no) Response Each replica manager sends its response to the front end

Active replication performance
Achieve sequential consistency Reliable multicast All correct replica manager process the same set of requests: reliable multicast Total order All correct replica manager process requests in the same order FIFO order Be Maintained by each front end No linearizability The total order is not same as the real-time order

High availability vs. fault tolerance
“eager” consistency all replicas reach agreement before passing control to client High availability “lazy” consistency Reach consistency until next access Reach agreement after passing control to client Gossip, Bayou, Coda

The gossip architecture
The architecture Front end connects to any of replica manager Query/Update Replica managers exchange “gossip” messages periodically to maintain consistency Two guarantees Each client obtains a consistent service over time Relaxed consistency between replicas All replica managers eventually receive all updates and they apply updates with ordering guarantees

Request Update response Coordination
Queries and updates in a gossip service Request The front end sends the request to a replica manager Query: client may be blocked Update: unblocked Update response Replica manager replies immediately Coordination Suspend the request until it can be apply May receive gossip messages that sent from other replica managers

Execution Query response Agreement
Queries and updates in a gossip service … continued Execution The replica manager executes the request Query response Reply at this point Agreement exchange gossip messages which contain the most recent updates applied on the replica Exchange occasionally Ask the particular replica manager to send when some replica manager finds it has missed one

The front end’s version timestamp
Client exchange data Access the gossip service Communicate directly A vector timestamp at each front end Contain an entry for each replica manager Attached to every message sent to the gossip service or other front ends When front end receives a message Merge the local vector timestamp with the timestamp in the message The significance of the vector timestamp Reflect the version of the latest data values accessed by the front end

Replica manager state Value Value timestamp Update log
Represent the updates that are reflected in the value E.g., (2,3,5): the replica has received 2 updates from 1st FE, 3 updates from 2nd FE, and 5 updates from 3rd FE Update log Record all received updates; stable update Replica timestamp Represents the updates that have been accepted by the replica manager Executed operation table Filter duplicated updates that could be received from front end and other replica managers Timestamp table Contain a vector timestamp for each other replica manager to identify what updates have been applied at these replica managers

Query operations in gossip service
When the query reach the replica manager If q.prev <= valueTS Return immediately The timestamp in the returned message is valueTS Otherwise Pend the query in a hold-back queue until the condition meets E.g. valueTS = (2,5,5), q.prev=(2,4,6): one update from replica manager 2 is missing When query return frontEndTS:= merge(frontEndTS, new)

Processing updates in causal order
A front end sends the update as (u.op, u.prev, u.id) u.prev: the timestamp of the front end When replica manager i receives the update Discard If the update has been in the executed operation table or in the log Otherwise, save it in the log ts =u.prev, ts[i]=ts[i]+1 Replica timestamp = ts logRecord= <i, ts, u.op, u.prev, u.id> Pass ts back to the front end frontEndTS=merge(frontEndTS, ts)

Processing updates in causal order … continued
Check if the update becomes stable u.prev <= valueTS Example: a stable update at RM 0 ts=(3,3,4), u.prev=(2,3,4), valueTS=(2,4,6) Apply the stable update Value = apply(value, r.u.op) valueTS = merge(valueTS, r.ts) executed = executed {r.u.id}

Exchange gossip message
Gossip messages Exchange gossip message Estimate the missed messages of one replica manager by its timestamp table Exchange gossip messages periodically or when some other replica manager ask The format or a gossip message m.log: one or more updates in the source replica manager’s log m.ts: the replica timestamp of the source replica manager

When receiving a gossip message
Check the record r in m.log Discard if r.ts <= replicaTS The record r has been already in the local log or has been applied to the value Otherwise, insert r in the local log replicaTS = merge (replciaTS, m.ts) Find out the stable updates Sort the updates log to find out stable ones, and apply to the value according to the “” order Update the timestamp table If the gossip message is from replica manager j, then tableTS[j]=m.ts

When receiving a gossip message …continued
Discard useless update r in the log if tableTS[i][c] >= r.ts[c], then discard r c is the replica manager that created r For all i Discard entries in the executed operation table

How often to exchange gossip messages?
Update propagation How often to exchange gossip messages? Minutes, hours or days Depend on the requirement of application How to choose partners to exchange? Random Deterministic Utilize a simple function of the replica manage’s state to make the choice of partner Topological Mesh, circle, tree

The Coda file system Limits of AFS The objective of Coda
Read-only replica The objective of Coda Constant data availability Coda: extend AFS on Read-write replica Optimistic strategy to resolve conflicts Disconnected operation

Volume storage group (VSG) Available volume storage group (AVSG)
The Coda architecture Venus/Vice Venus: replica manager Vice: hybrid of front end and replica manager Volume storage group (VSG) The set of servers holding replicas of a file volume Available volume storage group (AVSG) Vice know AVSG of each file Access a file The file is serviced by any server in AVSG

The Coda architecture … continued
On close a file Copies of modified files are broadcast in parallel to all of the servers in the AVSG Allow file modification when the network is partitioned When network partition is repaired, new updates are reapplied to the file copies in other partition Meanwhile, file conflict is detected Disconnected operation When the file’s AVSG becomes empty, and the file is in the cache Updates in the disconnected operation apply on the server later on when AVSG becomes nonempty if there are conflicts, resolve manually

The replication strategy
Coda version vector (CVV) Attached to each version of a file Each element of the CVV is an estimate of the number of modifications performed on the version of the file that is held at the corresponding server v1: CVV of a file copy at server1, v2: CVV of the file copy at server2 v1>=v2, or v1<=v2 : no conflict Neither v1>=v2, nor v2>=v1: conflict

When a modified file is closed
How to construct a CVV When a modified file is closed Broadcast the file with current CVV to AVSG Each server in AVSG increase the corresponding element of CVV, and return it to the client The client merge all returned CVV as the new CVV, and distribute it to AVSG Example: CVV = (2,2,1) The file on server1 has received 2 updates The file on server2 has received 2 updates The file one server3 has received 1 updates

Example File F is replicated at 3 servers: s1,s2,s3 Initially
VSG={s1,s2,s3} F is modified at the same time by c1 and c2 Because network partition, AVSG of c1 is {s1,s2}, AVSG of c2 is {s3} Initially The CVVs for F at all 3 servers are [1,1,1] C1 updates the file and close the CVVs at s1 and s2 become [2,2,1] There is an update applied on s1 and s2 since beginning C2 updates the file and close twice The CVV at s3 become [1,1,3] There are two updates applied on s3 since beginning

When the network failure is repaired
Example … continued When the network failure is repaired C2 modify AVSG to {s1,s2,s3} and requests the CVVs for F from all members of the new AVSG C2 find [2,2,1]<>[1,1,3], that means conflict happens Conflict means concurrent updates when network partition happen C2 manually resolve the conflict

One-copy serializability
What is one-copy serializability The effect of transactions performed by clients on replicated objects should be the same as if they had been performed one at a time on a single set of objects Architecture of replicated transactions Where to forward a client request? How many replica managers are required to complete an operation? … Different replication schemes Available copies Quorum consensus Virtual partition

Architectures for replicated transactions
Cooperation of the replica managers Read-one/write-all Quorum consensus Updates propagation Lazy approach forward the updates to other replica managers until after a transaction commits Eager approach forward the updates to other replica managers within a transaction and before it commits

The two-phase commit protocol
Two-level nested two phase commit protocol Top level subtransaction for the primary object The second level subtransaction for the other objects

Read-one / write-all scheme
A simple replication scheme How to obtain one-copy serializability? Write lock When applying a write operation, set a write lock on each object Read lock When applying a read operation, set a read lock on any of object Deadlock may happen But one-copy serializability is maintained

Available copies replication
Read-one / write-all is not realistic Some of the replica managers may be unavailable Available copies replication Read: be performed by any of available object Write: be performed by all available objects How to obtain one-copy serializability? Can locking scheme of read-one/write-all work? Example

Replica manager failure
Inconsistency due to server crash RM may crash during a transaction Example X fails after T has getBalance but before U deposit N fails after U has getBalance but before U deposit The concurrency control on A at RM X does not prevent transaction U from updating A at RM X, so that inconsistency happen Local concurrency control is not sufficient to ensure one-copy serializability

Local validation Concurrency control in addition to locking Example
Ensure that any failure or recovery event does not appear to happen during the progress of a transaction Example Since T has read from an object at X and observes the failure of N when it attempts to update, so if the transaction is valid, the relation of failures and the transaction T must be N failsT reads object A at X;T writes object B at M and P T commits X fails Similarly, the relation of failures and transaction U must be X failsU reads object B at N; U writes object A at Y U commits N fails Find conflict, so if T is validated firstly, then U is aborted, and vice versa.

Network partition for replicated transactions
Network partitions Network partition for replicated transactions May lead to inconsistency Deal with partition in available copies scheme Assumption: partitions will eventually be repaired Compensate scheme: if find conflict when partition is repaired, abort some transactions Precedence graph: detect inconsistencies between partitions

Quorum consensus methods
Avoid inconsistency in the case of partition Conflicting operations can be carried out within only one of the partitions Quorum A subgroup whose size gives it the right to carry out operations

Gifford’s algorithm Votes Quorum scheme
Each object is assigned an integer that can be regarded as a weighting related to the desirability of using a particular copy Quorum scheme Read quorum Before read: must obtain a read quorum of R votes Write quorum Before write: must obtain a write quorum of W votes W > half the total votes, R+W > total number of votes for the group any conflicting operations pair must be performed on at least one common copies

Configurability of groups of replica managers
Different performance or reliability Decrease W (or R): increase the performance of write (or read) Increase W (or R): increase the reliability of write (or read) Weak representatives Local cache at client computers Vote = 0 A read may be performed on it, once a read quorum has been obtained and it is up-to-date

An example from Gifford
A file with a high read-to-write ratio Replication is used to enhance the performance of the system, not the reliability Example 2 A file with a moderate read-to-write ratio Read can be satisfied from the local RM Write must access one remote RM Example 3 A file with a very high read-to-write ratio Read-one/write-all

Virtual partition algorithm
Combination of available copies algorithm and quorum consensus algorithm Quorum consensus algorithm work correctly in the presence of partitions Available copies algorithm Less expensive for read operations Virtual partition A partition which has enough RMs to meet the quorum criteria Perform available copies algorithm in a virtual partition

Example Four RMs of a file: V, X, Y and Z Initially
R=2, W=3 Initially V, X, Y and Z can contact with each other Conduct available copies algorithm Network partition happens Create a virtual partition V keeps on trying to contact Y and Z until one or both of them replies V, X and Y comprise a virtual partition since they are sufficient to form read and write quorum Conduct available copies algorithm in the virtual partition

Implementation of virtual partitions
Overlapping virtual partitions E.g. when Y and Z creates virtual partition simultaneously Conflict Read lock on Z will not conflict with write lock in another virtual partition, so one-copy serializability is broken Approach Logical timestamp of a virtual partition Creation time of the virtual partition If there are simultaneously creating virtual partition, create the one with higher logical timestamp algorithm

Summary Replication for distributed systems Group communication
High performance, high availability, fault tolerance Group communication Group management service – view delivery View-synchronous group communication Replication for fault tolerance Linearizability and sequential consistency Primary-backup replication Maintain linearizability Use view synchronous group communication Active replication Maintain sequential consistency Based on total-order, reliable group communication

Replication for high availability
Summary (2) Replication for high availability Gossip protocol Lazy consistency Coda Transactions with replicated data Read-one/write-all Available copies replication Quorum consensus methods Virtual partition Combination of available copies replication and quorum consensus methods

A basic architectural model for the management of replicated data
FE Requests and replies C Replica Service Clients Front ends managers RM

FIFO ordering Causal ordering Total ordering
Different ordering of coordination between replicated objects FIFO ordering if a front end issues request r then request r’, then any correct replica manager that handles r’ handles r before it Causal ordering if the issue of request r happened-before the issue of request r’, then any correct replica manager that handles r’ handles r before it Total ordering if a correct replica manager handles r before request r’, then any correct replica manager that handles r’ handles r before it.

Services provided for process groups
Join Group address expansion Multicast communication send Fail Group membership management Leave Process group

View-synchronous group communication
q r p crashes view (q, r) view (p, q, r) a (allowed). b (allowed). c (disallowed). d (disallowed).

An example of inconsistency between two replications
Computers A and B each maintain replicas of two bank accounts x and y Client accesses any one of the two computers, updates synchronized between the two computers Server A X Y Server B X Y Synchronize Client 2 Client 1

An example of inconsistency between two replications
Client1: setBalanceB(x,1) Server B failed… setBalanceA(y,2) Client2: getBalanceA(y)=2 getBalanceA(x)=0 Inconsistency happens since computer B fails before propagating new value to computer A

An example of sequential consistency
Client1: setBalanceB(x,1) Server B failed… setBalanceA(y,2) Client2: getBalanceA(y)=0 getBalanceA(x)=0 An interleaving operations at server A: getBalanceA(y)=0,getBalanceA(x)=0, setBalanceB(x,1), setBalanceA(y,2) Not satisfy linearizability Satisfy sequential consistency

The passive model for fault tolerance
FE C RM Primary Backup

Query and update operations in a gossip service
Val FE RM Query, prev Val, new Update Update, Update id Service Clients gossip

Front ends propagate their timestamps whenever clients communicate directly
FE Clients Service Vector timestamps RM gossip

A gossip replica manager, showing its main state components
Replica timestamp Update log Value timestamp Executed operation table Stable updates Updates Gossip messages FE Replica Replica log OperationID Update Prev Replica manager Other replica managers Timestamp table

Transactions on replicated objects
Client + front end getBalance(A) Replica managers deposit(B,3); U T

Available copies Concurrency control
X Client + front end P B Replica managers deposit(A,3); U T deposit(B,3); getBalance(B) getBalance(A) Y M N Concurrency control At X, transaction T has read A and therefore transaction U is not allowed to update A with the deposit operation until transaction T has completed

Network partition Client + front end Network U T withdraw(B, 4)
Replica managers deposit(B,3); U T Network partition

Gifford’s quorum consensus examples
Latency Replica 1 75 (milliseconds) Replica 2 65 100 750 Replica 3 Voting 1 2 configuration Quorum R sizes W 3 Derived performance of file suite: Read Blocking probability 0.01 0.0002 Write 0.0101 0.03

Two network partitions
Replica managers Network partition V X Y Z T Transaction

Virtual partitions X V Y Z Replica managers Virtual partition
Network partition

Two overlapping virtual partitions
Virtual partition V 1 2 Y X V Z

Two overlapping virtual partitions
Phase 1: • The initiator sends a Join request to each potential member. The argument of Join is a proposed logical timestamp for the new virtual partition. • When a replica manager receives a Join request, it compares the proposed logical timestamp with that of its current virtual partition. – If the proposed logical timestamp is greater it agrees to join and replies Yes; – If it is less, it refuses to join and replies No. Phase 2: • If the initiator has received sufficient Yes replies to have read and write quora, it may complete the creation of the new virtual partition by sending a Confirmation message to the sites that agreed to join. The creation timestamp and list of actual members are sent as arguments. • Replica managers receiving the Confirmation message join the new virtual partition and record its creation timestamp and list of actual members.

Chapter 15: Distributed Multimedia Systems
Introduction Characteristics of multimedia data Quality of service management Resource management Stream adaptation Case study: the Tiger video file server Summary

Characteristics of distributed multimedia applications
Networked video library, Internet telephony, videoconference Characteristics of multimedia applications Timely delivery of streams of multimedia data to end-users Audio sample, video frame To meet the timing requirements QoS( quality of service)

Traditional real-time system
QoS management Traditional real-time system E.g. telephone switching Small quantities of data, strict time requirement QoS management: fixed schedule that ensures worst- case requirements are always met Different requirements of multimedia app. General environment Compete with other distributed app. for network bandwidth, computing resource Dynamic resource requirements E.g. the number of participants of a video conference may vary User participant in the control of resource consumption

Existing distributed multimedia Apps. without QoS
Web-based multimedia Extensive buffering Best-effort Network phone and audio conference Relatively low bandwidth Efficient compression techniques High interactive latency Video-on-demand services Require sufficient dedicated network bandwidth

Highly interactive applications
Examples Videoconference Distributed online ensemble Requirements Low-latency communication round trip delays < 100 ms Synchronous distributed state If one user stops a video on a given frame, the other users should see it stopped at the same frame Media synchronization All participants in a music performance should hear the performance at approximately the same time External synchronization Sometime, other information need to be synchronized with the time- based multimedia streams Expecting rigorous QoS management

The window of scarcity A history of computer systems that support distributed data access

Characteristics of multimedia data
Continuous Refer to the user’s view of the data Video: a image array is replaced 25 times per second Audio: the amplitude value is replaced 8000 times per second Time-based The delivery delay for each element is bounded in a value Bulky multimedia stream Data compression Reduce bandwidth requirements by factors between 10 and 100 E.g. GIF, TIFF, JPEG, MPEG-1, MPEG-2, MPEG-4

Traditional computer systems
Multimedia App. compete with other App. for Processor cycles, bus cycles, buffer capacity Physical transmission links, switches, gateways Best-effort policies Multi-task OS: round-robin scheduling Each task is allocated so few cycles when there are so many tasks Ethernet: conflict detecting Much collisions when the network is heavily loaded Best-effort policies are not fit to multimedia apps.

QoS management for distributed multimedia apps.
Architecture of a typical system Source Stream processors Connections Network connection In-memory transfer Target Each process must be allocated adequate CPU time, memory capacity and network bandwidth Resource requirement

The QoS manager’s task The OoS manager’s task
Quality of service negotiation Apps. specify the resource requirements QoS manager evaluates the feasibility Give a positive or negative response Admission control Applications run under a resource contract Recycle the released resource

Resource requirements specification
Quality of service negotiation Resource requirements specification bandwidth The rate at which a multimedia stream flows Latency The time required for an individual data element to move through a stream from the source to the destination Loss rate Data loss due to unmet resource requirements A rate of data loss that can be accepted. E.g., 1%

Describe a multimedia stream
The usage of resource requirements spec. Describe a multimedia stream Describe the characteristics of a multimedia stream in a particular environment E.g. a video conference Bandwidth: 1.5mbps; delay: 150ms, loss rate: 1% Describe the resources Describe the capabilities of resources to transport a stream E.g. a network may provide Bandwidth: 64kbps; delay: 10ms; loss rate: 1/1000

Bandwidth Specify the QoS parameters for streams
Specified as minimum-maximum value or average value Required bandwidth varies according to the compression rate of the video. E.g., 1:50 - 1:100 of MPEG video Specify burstiness Different traffic patterns of streams with the same average bandwidth LBAP model: Rt + B, where R is the rate, B is the maximum size of burst

Specify the QoS parameters for streams (2)
Latency The frames of a stream should be processed with the same rate at which frames arrive No human perception E.g. 150ms for interactive apps, 500ms for demand on video No jitter Jitter: the variation in the period between the delivery of two adjacent frames Loss rate Typically be expressed as a probability Be calculated based on worst-cast assumptions or on standard distributions

The leaky bucket algorithm
Traffic shaping Traffic shaping Output buffering to smooth the flow of data elements The leaky bucket algorithm Eliminate burst R A stream will never flow with a rate higher than R B Size of the buffer Bound the time for which an element will remain in the buffer

The token bucket algorithm
Traffic shaping (2) The token bucket algorithm Allow larger burst Token is generated at a fixed rate of R the tokens are collected in a bucket of size B Data of size S can be sent only if at least S tokens are in the bucket The sender process removes these S tokens Ensure: over any interval t, the amount of data sent is not larger than Rt+B, An implementation of the LBAP model

Bandwidth Flow specification – RFC 1363 Delay Loss rate
The maximum transmission unit and maximum transmission rate The burstiness of the stream The token bucket size and rate Delay The minimum delay that an application can notice, the maximum jitter it can accept Loss rate The total acceptable number of losses over a certain interval The maximum number of consecutive losses

Simple approach Negotiation procedures
Follow the flow of data along each stream from the source to the target The source send out a flow spec to local QoS manager The manager check against its database of available resources whether the requested QoS can be provided The flow spec is forwarded to the next node until the last node The result is passed from the target to the source Transactional QoS negotiation procedure Deal with concurrent QoS negotiations

Avoid resource overload Bandwidth reservation
Admission control Avoid resource overload Bandwidth reservation Reserve maximum bandwidth exclusively Used for applications that cannot adapt to different QoS levels, e.g. x-ray video Statistical multiplexing Reserve minimum or average bandwidth Handle burst that cause some service drop level occasionally Hypothesis a large number of streams the aggregate bandwidth required remains nearly constant regardless of the bandwidth of individual streams

Fair scheduling Real-time scheduling Resource scheduling Round-robin
Packet-by-packet Bit-by-bit Practically, the scheduler calculates the time in which each packet should be sent out on the base of bit-by- bit round-robin, and then sends the packet in time Real-time scheduling Earliest-deadline-first (EDF) Each media element is assigned a deadline by which it must be sent out The scheduler send media elements according to their deadline

Drop pieces of information
Stream adaptation Stream adaptation An application adapt to changing QoS levels when a certain QoS cannot be guaranteed Drop pieces of information Audio stream Drop can be noticed immediately by the listener Video stream Motion JPEG: easy since frames are independent MPEG: difficult since frames are interdependent Increase delay Acceptable for non-interactive applications

When to perform scaling
Adapt a stream to the bandwidth available in the system before it enters a bottleneck resource Scaling approach Subsample E.g. reduce the rate of audio sample E.g. drop a channel in a stereo transmission Implementation A monitor process at the target A scaler process at the source

Different scaling methods
Temporal scaling Decrease the number of video frames transmitted within an interval Spatial scaling Reduce the number of pixels of each image in an video stream Frequency scaling Modify the compression algorithm applied to an image Amplitudinal scaling Reduce the color depths for each image pixel Color space scaling Reduce the number of entries in the color space, e.g., from color to grey-scale presentation

Scaling is not suitable to a stream that has several receivers
Filtering Scaling is not suitable to a stream that has several receivers Since scaling is conducted at the source, a scale-down message will degrade the quality of all streams Filtering A stream is partitioned into a set of hierarchical sub-streams The capacity of nodes on a path determines the number of sub-streams a target receives

Design goals Video-on-demand for a large number of users
A large stored digital movie library Delay of receiving the first frame is within a few seconds Users can perform pause, rewind, fast-forward Quality of service Constant rate a maximum jitter and low loss rate Scalable and distributed Support up to clients simultaneously Low-cost hardware Constructed by commodity PC Fault tolerant Tolerant to the failure of any single server or disk

System architecture One controller Cubs – the server group
Connect with each server by low-bandwidth network Cubs – the server group Each cub is attached by a number of disks ( 2-4) Cubs are connected to clients by ATM Controller Cub 0 Cub 1 Cub 2 Cub 3 Cub n ATM switching network video distribution to clients Start/Stop requests from clients low-bandwidth network high-bandwidth n+1 1 n+2 2 n+3 n+4 n 2n+1 3

Storage organization Stripping Mirroring
A movie is divided into blocks The blocks of a movie are stored on disks attached to different cubs in a sequence of the disk number Deliver a movie: deliver the blocks of the movie from different disks in the sequence number Load-balance when delivering hotspot movies Mirroring Each block is divided into several portions (secondaries) The secondaries are stored in the successors If a block is on a disk i, then the secondaries are stored on disks i+1 to i+d Fault-tolerance for single cub or disk failure

Distributed schedule Slot Deliver a stream Deliver multiple streams
The work to be done to play one block of a movie Deliver a stream Deliver the blocks of the stream disk by disk Can be viewed as a slot moving along disks step by step Deliver multiple streams Multiple slots moving along disks step by step Viewer state The address of the client computer The identity of the file being played The viewer’s position in the file The viewer’s sequence number Some bookkeeping information

Distributed schedule (2)
Block play time - T The time that will be required for a viewer to display a block on the client computer Typically about 1 second for all streams The next block of a stream must begin to be delivered T time after the current block begin to be delivered Block service time – t ( a slot ) Read the next block into buffer Deliver it to the client Update viewer state in the schedule and pass the updated slot to the next cub T / t typically result in a value > 4 The maximum streams the Tiger system can support simultaneously T/t * the number of disks

Performance and scalability
Initial prototype [1994] 5 133Mhz Pentium PCs(48M RAM, 2G SCSI disk, Windows NT) 68 simultaneous streams with perfect quality One cub failed, the loss rate is 0.02% 14 cubs, 56 disks [1997] 602 simultaneous streams ( 2Mbps) 1000 cubs simultaneous viewers Microsoft NetShow Theater Server

Summary Characteristics of distributed multimedia apps. QoS management
Large volumes of time-dependent data QoS management QoS negotiation Bandwidth, latency, loss rate Traffic shapping: bucket leaky algorithm, token bucket leaky algorithm Admission control Schedule Fair schedule Real-time schedule

Summary Stream adaption The Tiger video file servers Scale Filter
Date layout Striping Mirroring Distributed schedule slots

A typical distributed multimedia system

Characteristics of typical multimedia streams
Data rate (approximate) Sample or frame size frequency Telephone speech 64 kbps 8 bits 8000/sec CD-quality sound 1.4 Mbps 16 bits 44,000/sec Standard TV video (uncompressed) 120 Mbps up to 640 x 480 pixels 24/sec (MPEG-1 compressed) 1.5 Mbps variable HDTV video 1000–3000 Mbps up to 1920 1080 24 bits 24–60/sec MPEG-2 compressed) 10–30 Mbps

Typical infrastructure components for multimedia applications

QoS specifications for components of the application shown in Figure 15.4
Bandwidth Latency Loss rate Resources required Camera Out: 10 frames/sec, raw video 640x480x16 bits Zero A Codec In: MPEG-1 stream Interactive Low 10 ms CPU each 100 ms; 10 Mbytes RAM B Mixer 2  44 kbps audio 1  44 kbps audio Very low 1 ms CPU each 100 ms; 1 Mbytes RAM H Window system various 50 frame/sec framebuffer 5 ms CPU each 100 ms; 5 Mbytes RAM K Network connection In/Out: MPEG-1 stream, approx. 1.5 Mbps 1.5 Mbps, low-loss stream protocol L Audio 44 kbps 44 kbps, very low-loss

The QoS manager’s task

Traffic shaping algorithm
Token generator (a) Leaky bucket (b) Token bucket

The RFC 1363 flow Spec Protocol version Maximum transmission unit
Token bucket rate Token bucket size Maximum transmission rate Minimum delay noticed Maximum delay variation Loss sensitivity Burst loss sensitivity Loss interval Quality of guarantee Bandwidth: Delay: Loss:

Filtering Source Targets High bandwidth Medium Low

Tiger schedule 1 2 slot 0 viewer 4 slot 1 free slot 2 slot 3 viewer 0
1 2 slot 0 viewer 4 slot 1 free slot 2 slot 3 viewer 0 slot 4 viewer 3 slot 5 viewer 2 slot 6 slot 7 viewer 1 block play time T block service time t state

Examples of distributed systems Resource sharing and the web

Similar presentations

Presentation on theme: "Examples of distributed systems Resource sharing and the web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Examples of distributed systems Resource sharing and the web

Similar presentations

Presentation on theme: "Examples of distributed systems Resource sharing and the web"— Presentation transcript:

Similar presentations

About project

Feedback