Download presentation
Presentation is loading. Please wait.
1
Introduction to Distributed Systems
INF5040/9040 Autumn 2017 Lecturer: Eli Gjørven (ifi/UiO) August 22, 2017
2
Outline Definition of a distributed system
Goals of a distributed system Implications of distributed systems Pitfalls in developing distributed systems Types of distributed systems INF5040, ifi/UiO
3
From a Single Computer to DS
: Computers Large and expensive Operated independently 1985-now: two advances in technology Powerful microprocessors High-speed computer networks Result: Distributed Systems Putting together computing systems composed of a large number of computers connected by a high-speed network INF5040, ifi/UiO
4
What Is a Distributed System?
Definition Operational perspective: A distributed system is one in which hardware or software components, located at networked computers, communicate and coordinate their actions only by passing messages. [Coulouris] User perspective: A distributed system is a collection of independent computers that appears to its users as a single coherent system. [Coulouris ] (Operational perspective) A distributed system consists of hardware and software components located in a network of computers that communicate and coordinate their actions only by passing messages [Tanenbaum & van Steen] (user perspective) A distributed system is a collection of independent computers that appears to its users as a single coherent system. [Lamport] (DS characteristics pe[Coulouris ] (Operational perspective) [Lamport] (DS characteristics perspective) A distributed system is a system that prevents you from doing any work when a computer you have never heard about, fails. Many defs. Is it because we cannot agree on what is a DS? Not necessarily. Different perspectives. [Tanenbaum] INF5040, ifi/UiO
5
What Is a Distributed System?
A DS organized as middleware extending over multiple machines offering each application the same interface From TvS: Distributed systems are often organized with a software layere placed between users and applications, and operating systems and networks. This organization is often called «middleware». Middleware hides, as best as possible, the difference in hardware and operating systems from the applications running above it, making it easier for application programmers to make applications communicate. INF5040, ifi/UiO
6
Examples of Distributed Systems
Web search Indexing the entire contents of the Web Massively multiplayer online games Very large number of users sharing a virtual world. Financial trading Real time access and processing of a wide rage of information sources. Delivery of items of interest in a timely manner Coulouris: Web search: tens of billions of web pages with different kind of content. The search engine itself represents a major challenge for distributed system design (cf Google). The Google infrastructure is addressed as a use case studied in chap 21 of the Coulouris book. Today Google manages several millions of servers to handle the services they offer 365/7/24. A challenge how to engineer. Massively multiplayer online games: (MMOG): Complex and large shared virtual worlds, tens of thousands of simultaneous online players. The engineering of MMOGs represents a major challenge for DS technologies: fast response time, real-time propagation of events, maintaining a consistent view of the shared world (e.g centralized vs decentralized architecture) Financial trading: real time access to e.g. current share prices, and trends. Key is communication and processing of items of interest, known as events in DSs. Events need to be delivered reliably and in a timely manner to potentially a very large number of clients who have stated interest in such information items. Maybe high rate event streams that need to processed rapidly to detect patterns that indicate trading opportunities. Other app areas: health care, education, transport and logistics, science (Grid computing), environment management (WSNs: earth quake warning, flood warning, mammal tracking), emergency response (coordination)… INF5040, ifi/UiO
7
Outline Definition of a distributed system
Goals of a distributed system Implications of distributed systems Pitfalls in developing distributed systems Types of distributed systems INF5040, ifi/UiO
8
Goals of Distributed Systems
Resource sharing Distribution transparency Openness Scalability Fault tolerance Allowing heterogeneity Heterogeneity: network and hardware, operating system, programming languages, implementations by different developers INF5040, ifi/UiO
9
Resource Sharing Making resources accessible:
accessing remote resources sharing them in a controlled and efficient way Examples: printers, storages, files, etc. One reason to share: economics e.g., 1-to-many printer access, rather than 1-to-1 Resource managers control access, offer a scheme for naming, and control concurrency A resource sharing model describes how resources are made available resources can be used service provider and user interact with each other The main goal of a distributed system is to make it easy for users and applications to access remote resources and to share them in a controlled and efficient way. INF5040, ifi/UiO
10
Models for Resource Sharing
Client-server resource model Server processes act as resource managers, and offer services (collection of procedures) Client processes send requests to servers HTTP implements a client-server resource model Object-based resource model Any entity in a process is modeled as an object with a message based interface that provides access to its operations Any shared resource is modeled as an object Object-based middleware (CORBA, Java RMI) defines object-based resource models INF5040, ifi/UiO
11
Distribution Transparency
An important goal of a DS: hiding the fact that its processes and resources are physically distributed across multiple computers Definition: What kind of transparency? A distributed system that is able to present itself to its users and applications as if it were only a single computer system is said to be transparent. INF5040, ifi/UiO
12
Forms of Transparency Degree of transparency
Situations in which full transparency is not good – better to expose than to mask effects of distribution? Trade-off between a high degree of transparency and performance Access: systems use different OSs with different naming and file access methods Location: cannot tell where is the physical location of a resource is. So HOW to access: by assigning logical names, e.g., web URL Migration vs relocation: Migration resources can be moved without affecting how the resources are accessed. For example a service moving from one server to another, while the user still access the same logical address. Relocation – resources are moved while they are being used, without the user or application noticing. For example: Applications on mobile devices continue to function when the user is moving. If replication is used THEN location transparency must be supported. Hardest issue in DS: addressing Failure transparency => inability to distinguish between a dead resource and a slow resource, e.g., a web page Transparency generally makes developing, maintaining and using a DS easier. However, hiding all distribution aspects are not necessarily a good idea (and – some say – probably not possible). It can be useful to consider different degrees of trasparency for different types of applications. EXAMPLE: Internet applications may fail to connect to services for many different reasons. It may be a good idea to NOT mask failures for the user, and rather let the user decide what to do (give up, retry, or find another service). As we will diskuss later in the course, In location and context-aware systems, e.g., ubiquitous DS it may be better to expose than to mask There will ofte be a trade-off between transparency and performance: when server makes attempts to mask failure in connecting to a server by retrying, it may slow down the system as a whole. INF5040, ifi/UiO
13
Openness Definition: E.g., in computer networks: format, content, and meaning of messages In DS: services specified through interfaces Interface Definition Language (IDL): syntax Web Service Definition Language (WSDL): syntax Semantics? the hard part to specify Extensibility: an open DS can be extended and improved incrementally add or replace components An open distributed system is a system that offers services according to standard rules that describe syntax and semantics of those services. Interface: function name, params, return value, and exceptions Semantics: often informal way through natural language We want systems to be Extensible: extend or replace parts of the system without affecting the rest of the system INF5040, ifi/UiO
14
Scalability Scalability in three dimensions: Scalable in size: Users and resources can easily be added Geographically scalable: Users and resources may lie far apart Administratively scalable: The system spans many administrative organizations Scaling up along these dimensions often come with a performance cost Scalability denotes the ability of a system to handle future scaling up Examples: Internet: Size, geography and number of administrative organizations has grown enormously Google: scaled over the years to handle O(100) billion queries a month, expected query time 0.2 secs. INF5040, ifi/UiO
15
Scalability Problems Problems with size scalability:
Often caused by centralized solutions Problems with geographical scalability: traditional synchronous communication in LAN unreliable communications in WAN Problems with administrative scalability: Conflicting policies, complex management, security problems Centralized solutions: Single point of failure, and bottleneck. Centralized services: Single server can become a bottleneck. Even if the server is very powerful, communication eventually becomes a bottleneck prohibiting further growth. Centralized data: The online telephone book (for 50 mill people say) can be kept on one disk served by one server. Again may be a problem of communication capacity into the server. Another example is DNS that maintains information about millions of computers. Only one DNS server in the world? Would anybody use the web in that case? Centralized algorithms are also a bad idea for scalability: Such algorithms (located to a single server) would usually need to collect complete information about all machines and lines (such as a routing algorithm). That part of the system easily becomes a bottleneck (plus single point of failure). Synchronous in LAN is based on fast, reliable communication. Clients blocks until a reply is received. WAN: much larger communication delay may cause trouble. Also, communication through a WAN is unreliable – fails much more often. Administrative scalability: When systems cross several domains, resources are shared between the domains. Conflicting interests and policies. INF5040, ifi/UiO
16
Scaling Techniques Distribution Replication
splitting a resource (such as data) into smaller parts, and spreading the parts across the system (cf DNS) Replication replicate resources (services, data) across the system increases availability, helps to balance load caching (special form of replication) Hiding communication latencies avoid waiting for responses to remote service requests (use asynchronous communication or design to reduce the amount of remote requests) HIT. DNS: hierarchical structure of DNS to resolve a name (finding the host address) Caching vs. replication: caching is in the proximity of the client, diff: caching decision is made by client while replication is made by server Hiding communication latencies to address geographical scalability. Not always possible, depends on application requirements. One way to reduce the amount of remote request is: moving the computing code to the client, e.g. moving the form checking functions of an app to the client instead of doing them at the server side. INF5040, ifi/UiO
17
Fault Tolerance Hardware, software and network fail!!
DS must maintain availability even in cases where hardware/software/network have low reliability Failures in distributed systems are partial makes error handling particularly difficult Many techniques for handling failures Detecting failures (checksum) Masking failures (retransmission in protocols) Tolerating failures (as in web-browsers) Recovery from failures (roll back) Redundancy (replicate servers in failure-independent ways) Partial failures means that parts of the system (network or machines) can be faulty, but it is desirable for the rest of the system to function correctly. IF not, many systems would be malfunctioning a lot of the time. Google: practically no outages of search engines (lately once in 2009, and once in 2013 ) outage lasted 1 minute. Estimated loss of revenue: USD 500,000. Long outages would be costly. INF5040, ifi/UiO
18
Example: Google File System
Goals: Fault tolerance (Coulouris ch ) Google’s need for a huge capacity file system: Proprietary Google File System (new version in 2010 Colossus). Pictures illustres development in scale. GFS: A scalable distributed file system for distributed data intensive applications: the search engine and other Google applications. Hundreds of clusters is deployed in data centers around the world. The size of a data center can be in the scale of servers, 1000s of storage nodes, and 10s of PetaByte of disk storage per cluster. A DS specialized for Googles requirements (2003-paper) Must run reliably on commodity hardware – even though SW and HW will fail (this is normal). Usage pattern within google: Large number of files (at the time not massive), files tend to be massive in size up to GB size Sequential reads through large files, sequentila writes that append to files (i.e. storage of many web pages which are scanned by analysis programs). (Not like «normal» file systems where random reads and writes are common). Scalable (number of files and clients), reliable in spite of partial failures, high and sustained throughput for reading data. INF5040, ifi/UiO
19
Outline Definition of a distributed system
Goals of a distributed system Implications of distributed systems Pitfalls in developing distributed systems Types of distributed systems INF5040, ifi/UiO
20
Implications of Distributed Systems
Concurrency components execute in concurrent processes that read and update shared resources. Requires coordination No global clock makes coordination difficult (ordering of events) Independent failure of components “partial failure” & incomplete information Unreliable communication Loss of connection and messages. Message bit errors Unsecure communication Possibility of unauthorised recording and modification of messages Expensive communication Communication between computers usually has less bandwidth, longer latency, and costs more, than between independent processes on the same computer INF5040, ifi/UiO
21
Pitfalls When Developing DS
False assumptions made by first time developer: The network is reliable. The network is secure. The network is homogeneous. The topology does not change. Latency is zero. Bandwidth is infinite. Transport cost is zero. There is one administrator. Engineers at Sun Microsystems has described what they called pitfalls, or «fallacies», of distributed computing. Reliable: Assumtion about no networking errors, packet sendt will be delivered. Secure: Assumption about no malicious users and programs in the system Homogeneous: Assumption about technologies and solutions being similar in different parts of the system. Topology: System and network is static in its topology, thus stable in its performance. Latency: Data arrives at once, and in the same order as it was sent Bandwidth: Just send all the data you want, the network does not get congested. Transport cost: Just choose the best option without considering cost of building, maintaining and using networks, lunch is for free! Adm: One administrator decides all, defines policies and priorities. INF5040, ifi/UiO
22
Outline Definition of a distributed system
Goals of a distributed system Implications of distributed systems Pitfalls in developing distributed systems Types of distributed systems INF5040, ifi/UiO
23
Types of Distributed Systems
Type1: Distributed Computing Systems Used for high performance computing tasks Cluster and Cloud computing systems Grid computing systems Type2: Distributed Information Systems Systems mainly for management and integration of business functions Transaction processing systems Enterprise application integration Type3: Distributed Pervasive (or Ubiquitous) Systems Mobile and embedded systems Home systems Sensor networks To be explained on the next slides.. INF5040, ifi/UiO
24
Type1: Cluster Computing Systems
Collection of similar PCs, closely connected, all run same OS, e.g.: A collection of computing nodes + master node Master runs middleware: parallel execution and management The master node receives client requests, handle the job queue and the allocation of nodes to a particular parallel program to run on the cluster. The computing nodes may only need to run a standard operating system. INF5040, ifi/UiO
25
Type1: Grid Computing Systems
Federation of autonomous and heterogeneous computer systems (HW,OS,...), several adm domains Cluster computing: homogen, while Grid Heterogen! Constructed as a federation of computer systems, where each system may fall under a different administrative domain, and may be very different when it comes to hardware, software, and deployed networks. A key issue is how resources from different organizations are brought together and how access to resources are provied to its users and applications. A four layered architecture provides different levels of transparency for applications. <<META: Skip rest of comments. Comment only if asked.>> Fabric layer: provides interfaces to local resources at a specific site (allow sharing of resources) Connectivity layer: comm protocols for supporting grid transactions than span usage of mutiple reources Resource layer: responsible for managing a single resource (provide config information, creating processes or reading data) Collective layer: handles access to multiple resources, resource discovery, allocation and scheduling of tasks onto multiple resources. Applications: make use of the grid computing environment A layered architecture for grid computing systems INF5040, ifi/UiO
26
Type1: Distributed Computing as a Utility
View: distributed resources as a commodity or utility Resources are provided by service suppliers and effectively rented rather than owned by the end user. The term cloud computing capture the vision of computing as a utility SalesForce CRM SaaS Clients LotusLive Google App Engine PaaS Internet Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.. The term cloud computing capture the vision of computing as a utility The term cloud is used as a metaphor for the Internet, based on how the Internet is depicted in computer network diagrams and is an abstraction for the complex infrastructure it conceals. The concept generally incorporates combinations of the following: infrastructure as a service (IaaS) , platform as a service (PaaS) , software as a service (SaaS) IaaS delivers computer infrastructure, typically a platform virtualization environment, as a service. Rather than purchasing servers, software, data center space or network equipment, clients instead buy those resources as a fully outsourced service. PaaS deliver a computing platform where the developers can develop their own applications. SaaS is a model of software deployment where the software applications are provided to the customers as a service that are accessed from a thin client such as a web browser or an app, while the software and data are stored on the servers. Clouds are generally implemented on clusters of computers to provide the necessary scale and performance required by such services. The difference between a cloud and a grid can be expressed as below: Ownership: A grid is a collection of computers which is owned by multiple parties in multiple locations and connected together so that users can share the combined power of resources. Whereas a cloud is a collection of computers usually owned by a single party. Resource distribution: Cloud appears to the user as a centralized model (“one cloud”) where resource allocation happens automatically inside the cloud. Grid computing is a decentralized model where users to a larger degree must related to computation which occurs over many administrative domains. IaaS 26 INF5040, ifi/UiO
27
Type2: Enterprise Application Integration
Allowing existing applications to directly exchange information using communication middleware Middleware as a communication facilitator in enterprise application integration With enterprise application integration, existing applications are more easily integrated through communication middleware. Several types of comm middleware exist: RPC, RMI, message-oriented middleware, pub/sub (increasing importance – example later and next class) INF5040, ifi/UiO
28
Type2: Example Communication Middleware: CORBA
Clients may invoke methods of remote objects without worrying about: object location, programming language, operating system platform, communication protocols or hardware. X invoke Z’s method foo() Y Z foo() Different programming languages (or object models) IDL IDL IDL Common object model Interoperable Inter-orb protocol Object Request Broker (ORB) RMI over IIOP INF5040, ifi/UiO
29
Type3: Distributed Pervasive Systems
(Mobile) Devices: discover the environment (its services) and establish themselves in this environment as best as possible. Requirements for pervasive applications Embrace contextual changes Encourage ad hoc and dynamic composition Recognize sharing as the default exploiting the increasing integration of services and (small/tiny) computing devices in our everyday physical world Encourage ad hoc and dynamic composition: many devices in pervasive env. are used in different ways by different users: it should be easy to to configure applications running on a device Recognize sharing as default: Means for store, read, access, share information: intermittent and changing connectivity of devices: the space where accessible information resides will change INF5040, ifi/UiO
30
Type3: Example: Smart Home System
Active research area: Smart home: a Wireless Sensor Network of sensors, called motes, helps older people to live on their own longer. The motes pass information among themselves and to a PC (or gateway that communicates with a back end system in the cloud) The data they gather is analyzed to infer activities of daily living, which can give important clues to a person´s state of health and allow for intervention. Fall detection (using video), pill dispensors, motes in shoes: e.g. the person is outside; is he/she wearing shoes? Motes in the bed tell if there is anyone lying there (video can also be used), etc etc INF5040, ifi/UiO
31
Type3: Example: Smart Home System
Integrates personal devices (mobile, wearables, ..) Integrates home apparatus (light, alarms, entertainment systems, termostat control..) Self-configuring and self-managing Takes care of your information (music, photo, video, calendar, health information…) Protects your personal space from others! Exampe: Google home INF5040, ifi/UiO
32
Summary Distributed systems:
components located in a network that communicates and coordinates their actions exclusively by sending messages. Goals like resource sharing, distribution transparency, openness, scalability, fault tolerance and heterogeneity can be satisfied by distributed systems Consequences of distributed systems Independent failure of components Unsecure communication No global clock Many pitfalls when developing distributed systems Novel applications in pervasive systems: e.g., smart homes INF5040, ifi/UiO
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.