Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Fundamentals of Distributed Systems. Jim Gray Researcher Microsoft Corp. Prof. Andreas Reuter Professor U. Stuttgart

Similar presentations


Presentation on theme: "1 Fundamentals of Distributed Systems. Jim Gray Researcher Microsoft Corp. Prof. Andreas Reuter Professor U. Stuttgart"— Presentation transcript:

1 1 Fundamentals of Distributed Systems. Jim Gray Researcher Microsoft Corp. Prof. Andreas Reuter Professor U. Stuttgart

2 2 Outline Concepts and Terminology Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts Goal: What you need to know to understand Microsoft Transaction Server (or CORBA or …)

3 3 Whats a Distributed System? Centralized: Centralized: everything in one place everything in one place stand-alone PC or Mainframe stand-alone PC or Mainframe Distributed: Distributed: some parts remote some parts remote distributed users distributed users distributed execution distributed execution distributed data distributed data

4 4 Why Distribute? No best organization No best organization Companies constantly swing between Companies constantly swing between Centralized: focus, control, economy Centralized: focus, control, economy Decentralized: adaptive, responsive, competitive Decentralized: adaptive, responsive, competitive Why distribute? Why distribute? reflect organization or application structure reflect organization or application structure empower users / producers empower users / producers improve service (response / availability) improve service (response / availability) distributed load distributed load use PC technology (economics) use PC technology (economics)

5 5 What Should Be Distributed? Users and User Interface Users and User Interface Thin client Thin client Processing Processing Trim client Trim client Data Data Fat client Fat client Will discuss tradeoffs later Will discuss tradeoffs later Database Business Objects workflow Presentation

6 6 Transparency in Distributed Systems Make distributed system as easy to use and manage as a centralized system Make distributed system as easy to use and manage as a centralized system Give a Single-System Image Give a Single-System Image Location transparency: Location transparency: hide fact that object is remote hide fact that object is remote hide fact that object has moved hide fact that object has moved hide fact that object is partitioned or replicated hide fact that object is partitioned or replicated Name doesnt change if object is replicated, partitioned or moved. Name doesnt change if object is replicated, partitioned or moved.

7 7 Naming- The basics Objects have Objects have Globally Unique Identifier (GUIDs) Globally Unique Identifier (GUIDs) location(s) = address(es) location(s) = address(es) name(s) name(s) addresses can change addresses can change objects can have many names objects can have many names Names are context dependent: Names are context dependent: KGB CIA) KGB CIA) Many naming systems Many naming systems UNC: \\node\device\dir\dir\dir\object UNC: \\node\device\dir\dir\dir\object Internet: Internet: LDAP: ldap://ldap.domain.root/o=org,c=US,cn=dir LDAP: ldap://ldap.domain.root/o=org,c=US,cn=dir guid Jim Address James

8 8 Name Servers in Distributed Systems Name servers translate names + context to address (+ GUID) Name servers translate names + context to address (+ GUID) Name servers are partitioned (subtrees of name space) Name servers are partitioned (subtrees of name space) Name servers replicate root of name tree Name servers replicate root of name tree Name servers form a hierarchy Name servers form a hierarchy Distributed data from hell: Distributed data from hell: high read traffic high read traffic high reliability & availability high reliability & availability autonomy autonomy root North South Southern names Northern names root

9 9 Autonomy in Distributed Systems Owner of site (or node, or application, or database) Wants to control it Owner of site (or node, or application, or database) Wants to control it If my part is working, must be able to access & manage it (reorganize, upgrade, add user,…) If my part is working, must be able to access & manage it (reorganize, upgrade, add user,…) Autonomy is Autonomy is Essential Essential Difficult to implement. Difficult to implement. Conflicts with global consistency Conflicts with global consistency examples: naming, authentication, admin… examples: naming, authentication, admin…

10 10 Security The Basics Authentication server subject + Authenticator => (Yes + token) | No Authentication server subject + Authenticator => (Yes + token) | No Security matrix: Security matrix: who can do what to whom who can do what to whom Access control list is column of matrix Access control list is column of matrix who is authenticated ID who is authenticated ID In a distributed system, who and what and whom are distributed objects In a distributed system, who and what and whom are distributed objects subject Object Permissions

11 11 Security in Distributed Systems Security domain: nodes with a shared security server. Security domain: nodes with a shared security server. Security domains can have trust relationships: Security domains can have trust relationships: A trusts B: A believes B when it says this is A trusts B: A believes B when it says this is Security domains form a hierarchy. Security domains form a hierarchy. Delegation: passing authority to a server when A asks B to do something (e.g. print a file, read a database) B may need As authority Delegation: passing authority to a server when A asks B to do something (e.g. print a file, read a database) B may need As authority Autonomy requires: Autonomy requires: each node is an authenticator each node is an authenticator each node does own security checks each node does own security checks Internet Today: Internet Today: no trust among domains (fire walls, many passwords) no trust among domains (fire walls, many passwords) trust based on digital signatures trust based on digital signatures

12 12 Clusters The Ideal Distributed System. Cluster is distributed system BUT single Cluster is distributed system BUT single location location manager manager security policy security policy relatively homogeneous relatively homogeneous communications is communications is high bandwidth high bandwidth low latency low latency low error rate low error rate Clusters use distributed system techniques for load distribution storage execution growth fault tolerance

13 13 Cluster: Shared What? Shared Memory Multiprocessor Shared Memory Multiprocessor Multiple processors, one memory Multiple processors, one memory all devices are local all devices are local DEC or SGI or Sequent 16x nodes DEC or SGI or Sequent 16x nodes Shared Disk Cluster Shared Disk Cluster an array of nodes an array of nodes all shared common disks all shared common disks VAXcluster + Oracle VAXcluster + Oracle Shared Nothing Cluster Shared Nothing Cluster each device local to a node each device local to a node ownership may change ownership may change Tandem, SP2, Wolfpack Tandem, SP2, Wolfpack

14 14 Outline Concepts and Terminology Why Distribute Why Distribute Distributed data & objects Distributed data & objects Partitioned Partitioned Replicated Replicated Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts

15 15 Partitioned Data Break file into disjoint groups Exploit data access locality Exploit data access locality Put data near consumer Put data near consumer Less network traffic Less network traffic Better response time Better response time Better availability Better availability Owner controls data autonomy Owner controls data autonomy Spread Load Spread Load data or traffic may exceed single store data or traffic may exceed single store Orders N.A. S.A. Europe Asia

16 16 How to Partition Data? How to Partition How to Partition by attribute or by attribute or random or random or by source or by source or by use by use Problem: to find it must have Problem: to find it must have Directory (replicated) or Directory (replicated) or Algorithm Algorithm Encourages attribute-based partitioning Encourages attribute-based partitioning N.A. S.A. Europe Asia

17 17 Replicated Data Place fragment at many sites Pros: Pros: +Improves availability +Disconnected (mobile) operation +Distributes load +Reads are cheaper Cons: Cons: N times more updates N times more updates N times more storage N times more storage Placement strategies: Placement strategies: Dynamic: cache on demand Dynamic: cache on demand Static: place specific Static: place specific Catalog

18 18 Updating Replicated Data When a replica is updated, how do changes propagate? When a replica is updated, how do changes propagate? Master copy, many slave copies (SQL Server) Master copy, many slave copies (SQL Server) always know the correct value (master) always know the correct value (master) change propagation can be change propagation can be transactional transactional as soon as possible as soon as possible periodic periodic on demand on demand Symmetric, and anytime (Access) Symmetric, and anytime (Access) allows mobile (disconnected) updates allows mobile (disconnected) updates updates propagated ASAP, periodic, on demand updates propagated ASAP, periodic, on demand non-serializable non-serializable colliding updates must be reconciled. colliding updates must be reconciled. hard to know real value hard to know real value

19 19 Replication and Partitioning Compared Central Scaleup 2x more work Central Scaleup 2x more work Partition Scaleup 2x more work Partition Scaleup 2x more work Replication Scaleup 4x more work Replication Scaleup 4x more work Replication

20 20 Outline Concepts and Terminology Why Distribute Why Distribute Distributed data & objects Distributed data & objects Partitioned Partitioned Replicated Replicated Distributed execution Distributed execution remote procedure call remote procedure call queues queues Three tier architectures Three tier architectures Transaction concepts Transaction concepts

21 21 Distributed Execution Threads and Messages Thread is Execution unit (software analog of cpu+memory) Thread is Execution unit (software analog of cpu+memory) Threads execute at a node Threads execute at a node Threads communicate via Threads communicate via Shared memory (local) Shared memory (local) Messages (local and remote) Messages (local and remote) threads shared memory messages

22 22 Peer-to-Peer or Client-Server Peer-to-Peer is symmetric: Peer-to-Peer is symmetric: Either side can send Either side can send Client-server Client-server client sends requests client sends requests server sends responses server sends responses simple subset of peer-to-peer simple subset of peer-to-peer request response

23 23 Connection-less or Connected Connection-less Connection-less request contains request contains client id client id client context client context work request work request client authenticated on each message client authenticated on each message only a single response message only a single response message e.g. HTTP, NFS v1 e.g. HTTP, NFS v1 Connected (sessions) Connected (sessions) open - request/reply - close open - request/reply - close client authenticated once client authenticated once Messages arrive in order Messages arrive in order Can send many replies (e.g. FTP) Can send many replies (e.g. FTP) Server has client context (context sensitive) Server has client context (context sensitive) e.g. Winsock and ODBC e.g. Winsock and ODBC HTTP adding connections HTTP adding connections

24 24 Remote Procedure Call: The key to transparency Object may be local or remote Object may be local or remote Methods on object work wherever it is. Methods on object work wherever it is. Local invocation Local invocation y = pObj->f(x); f() x val y = val; return val;

25 25 Remote Procedure Call: The key to transparency Remote invocation Remote invocation Obj Local? x val y = val; f() return val; y = pObj->f(x); marshal un marshal un marshal x proxy un marshal pObj->f(x) marshal x stub Obj Local? x val f() return val; val

26 26 Transaction Object Request Broker (ORB) Orchestrates RPC Registers Servers Registers Servers Manages pools of servers Manages pools of servers Connects clients to servers Connects clients to servers Does Naming, request-level authorization, Does Naming, request-level authorization, Provides transaction coordination (new feature) Provides transaction coordination (new feature) Old names: Old names: Transaction Processing Monitor, Transaction Processing Monitor, Web server, Web server, NetWare NetWare Object-Request Broker

27 27 Solaris UNIX International OSF DCE Open software Foundation (OSF) NT ODBC XA / TX DCE RPC GUIDs IDL DNS Kerberos COM Object Management Group (OMG) CORBA Open Group History and Alphabet Soup X/Open Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it

28 28 Send updates to correct partition Send updates to correct partition Using RPC for Transparency Partition Transparency part Local? x y = pfile->write(x); send to correct partition send to correct partition x un marshal pObj->write(x) marshal x x val write() return val; val

29 29 Send updates to EACH node Send updates to EACH node Using RPC for Transparency Replication Transparency x y = pfile->write(x); Send to each replica Send to each replica x val

30 30 Client/Server Interactions All can be done with RPC Request-Response response may be many messages Request-Response response may be many messages Conversational server keeps client context Conversational server keeps client context Dispatcher three-tier: complex operation at server Dispatcher three-tier: complex operation at server Queued de-couples client from server allows disconnected operation Queued de-couples client from server allows disconnected operation CS CS C S S S C S S

31 31 Queued Request/Response Time-decouples client and server Time-decouples client and server Three Transactions Three Transactions Almost real time, ASAP processing Almost real time, ASAP processing Communicate at each others convenience Allows mobile (disconnected) operation Communicate at each others convenience Allows mobile (disconnected) operation Disk queues survive client & server failures Disk queues survive client & server failures Client Server Submit Perform Response

32 32 Why Queued Processing? Prioritize requests ambulance dispatcher favors high-priority calls Prioritize requests ambulance dispatcher favors high-priority calls Manage Workflows Manage Workflows Deferred processing in mobile apps Deferred processing in mobile apps Interface heterogeneous systems EDI, MOM: Message-Oriented-Middleware DAD: Direct Access to Data Interface heterogeneous systems EDI, MOM: Message-Oriented-Middleware DAD: Direct Access to Data OrderBuildShipInvoicePay

33 33 Outline Concepts and Terminology Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution remote procedure call remote procedure call queues queues Three tier architectures Three tier architectures what what why why Transaction concepts Transaction concepts

34 34 Work Distribution Spectrum Presentation and plug-ins Presentation and plug-ins Workflow manages session & invokes objects Workflow manages session & invokes objects Business objects Business objects Database Database Fat Thin Fat Thin Database Business Objects workflow Presentation

35 35 Transaction Processing Evolution to Three Tier Intelligence migrated to clients Mainframe Batch processing (centralized) Mainframe Batch processing (centralized) Dumb terminals & Remote Job Entry Dumb terminals & Remote Job Entry Intelligent terminals database backends Intelligent terminals database backends Workflow Systems Object Request Brokers Application Generators Workflow Systems Object Request Brokers Application Generators Mainframe cards Active green screen 3270 Server TP Monitor ORB

36 36 Web Evolution to Three Tier Intelligence migrated to clients (like TP) Character-mode clients, smart servers Character-mode clients, smart servers GUI Browsers - Web file servers GUI Browsers - Web file servers GUI Plugins - Web dispatchers - CGI GUI Plugins - Web dispatchers - CGI Smart clients - Web dispatcher (ORB) pools of app servers (ISAPI, Viper) workflow scripts at client & server Smart clients - Web dispatcher (ORB) pools of app servers (ISAPI, Viper) workflow scripts at client & server archie ghopher green screen Web Server Mosaic WAIS NS & IE Active

37 37 PC Evolution to Three Tier Intelligence migrated to server Stand-alone PC (centralized) Stand-alone PC (centralized) PC + File & print server message per I/O PC + File & print server message per I/O PC + Database server message per SQL statement PC + Database server message per SQL statement PC + App server message per transaction PC + App server message per transaction ActiveX Client, ORB ActiveX server, Xscript ActiveX Client, ORB ActiveX server, Xscript disk I/O IO request reply SQL Statement Transaction

38 38 The Pattern: Three Tier Computing Clients do presentation, gather input Clients do presentation, gather input Clients do some workflow (Xscript) Clients do some workflow (Xscript) Clients send high-level requests to ORB (Object Request Broker) Clients send high-level requests to ORB (Object Request Broker) ORB dispatches workflows and business objects -- proxies for client, orchestrate flows & queues ORB dispatches workflows and business objects -- proxies for client, orchestrate flows & queues Server-side workflow scripts call on distributed business objects to execute task Server-side workflow scripts call on distributed business objects to execute task Database Business Objects workflow Presentation

39 39 The Three Tiers Web Client HTML VB or Java Script Engine VB or Java Virt Machine VBscritpt JavaScrpt VB Java plug-ins Internet ORB HTTP+ DCOM Object server Pool Middleware ORB TP Monitor Web Server... DCOM (oleDB, ODBC,...) Object & Data server. LU6.2 IBM Legacy Gateways

40 40 Why Did Everyone Go To Three-Tier? Manageability Manageability Business rules must be with data Business rules must be with data Middleware operations tools Middleware operations tools Performance (scaleability) Performance (scaleability) Server resources are precious Server resources are precious ORB dispatches requests to server pools ORB dispatches requests to server pools Technology & Physics Technology & Physics Put UI processing near user Put UI processing near user Put shared data processing near shared data Put shared data processing near shared data Database Business Objects workflow Presentation

41 41 DADsRaw Data Customer comes to store Takes what he wants Fills out invoice Leaves money for goods Easy to build No clerks Why Put Business Objects at Server? Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoice Customer pays clerk, gets goods Easy to manage Clerks controls access Encapsulation MOMs Business Objects

42 42 What Middleware Does ORB, TP Monitor, Workflow Mgr, Web Server Registers transaction programs workflow and business objects (DLLs) Registers transaction programs workflow and business objects (DLLs) Pre-allocates server pools Pre-allocates server pools Provides server execution environment Provides server execution environment Dynamically checks authority (request-level security) Dynamically checks authority (request-level security) Does parameter binding Does parameter binding Dispatches requests to servers Dispatches requests to servers parameter binding parameter binding load balancing load balancing Provides Queues Provides Queues Operator interface Operator interface

43 43 Server Side Objects Easy Server-Side Execution ORB gives simple execution environment ORB gives simple execution environment Object gets Object gets start start invoke invoke shutdown shutdown Everything else is automatic Everything else is automatic Drag & Drop Business Objects Drag & Drop Business Objects Network Thread Pool Queue Connections Context Security Shared Data Receiver Synchronization Service logic Configuration Management A Server

44 44 Why Server Pools? Server resources are precious. Clients have 100x more power than server. Server resources are precious. Clients have 100x more power than server. Pre-allocate everything on server Pre-allocate everything on server preallocate memory preallocate memory pre-open files pre-open files pre-allocate threads pre-allocate threads pre-open and authenticate clients pre-open and authenticate clients Keep high duty-cycle on objects (re-use them) Keep high duty-cycle on objects (re-use them) Pool threads, not one per client Pool threads, not one per client Classic example: TPC-C benchmark Classic example: TPC-C benchmark 2 processes 2 processes everything pre-allocated everything pre-allocated 7,000 clients IIS SQL Pool of DBC links HTTP N clients x N Servers x F files = N x N x F file opens!!! IE

45 45 Classic Three-Tier Example TPC-C Transaction Processing Performance Council (TPC): standard performance benchmarks Transaction Processing Performance Council (TPC): standard performance benchmarks 5 transaction types 5 transaction types order entry, payment, status (oltp) order entry, payment, status (oltp) delivery (mini-batch) delivery (mini-batch) restock (mini-DSS) restock (mini-DSS) Metrics: Throughput, Price/Performance Metrics: Throughput, Price/Performance Shows best practices: Shows best practices: everyone three tier everyone three tier 2 processes at server 2 processes at server everything pre-allocated everything pre-allocated HTTP ODBC SQL SQL IIS = Web 7,000 Web clients Pool of DBC links

46 46 Classic Mistakes Thread per terminal fix: DB server thread pools fix: server pools Thread per terminal fix: DB server thread pools fix: server pools Process per request (CGI) fix: ISAPI & NSAPI DLLs fix: connection pools Process per request (CGI) fix: ISAPI & NSAPI DLLs fix: connection pools Many messages per operation fix: stored procedures fix: server-side objects Many messages per operation fix: stored procedures fix: server-side objects File open per request fix: cache hot files File open per request fix: cache hot files

47 47 Outline Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures why: manageability & performance why: manageability & performance what: server side workflows & objects what: server side workflows & objects Transaction concepts Transaction concepts Why transactions? Why transactions? Using transactions Using transactions Two Phase Commit Two Phase Commit How transactions? How transactions?

48 48 Thesis Transactions are key to structuring distributed applications Transactions are key to structuring distributed applications ACID properties ease exception handling ACID properties ease exception handling Atomic: all or nothing Atomic: all or nothing Consistent: state transformation Consistent: state transformation Isolated: no concurrency anomalies Isolated: no concurrency anomalies Durable: committed transaction effects persist Durable: committed transaction effects persist

49 49 What Is A Transaction? Programmers view: Programmers view: Bracket a collection of actions Bracket a collection of actions A simple failure model A simple failure model Only two outcomes: Only two outcomes: Begin() action action Commit() Success! Begin()actionactionactionRollback()Begin()actionactionactionRollback() Failure! Fail !

50 50 Why Bother: Atomicity? RPC semantics: RPC semantics: At most once: try one time At most once: try one time At least once: keep trying till acknowledged At least once: keep trying till acknowledged Exactly once: keep trying till acknowledged and server discards duplicate requests Exactly once: keep trying till acknowledged and server discards duplicate requests ? ? ?

51 51 Why Bother: Atomicity? Example: insert record in file Example: insert record in file At most once: time-out means maybe At most once: time-out means maybe At least once: retry may get duplicate error or retry may do second insert At least once: retry may get duplicate error or retry may do second insert Exactly once: you do not have to worry Exactly once: you do not have to worry What if operation involves What if operation involves Insert several records? Insert several records? Send several messages? Send several messages? Want ALL or NOTHING for group of actions Want ALL or NOTHING for group of actions

52 52 Why Bother: Consistency Begin-Commit brackets a set of operations Begin-Commit brackets a set of operations You can violate consistency inside brackets You can violate consistency inside brackets Debit but not credit (destroys money) Debit but not credit (destroys money) Delete old file before create new file in a copy Delete old file before create new file in a copy Print document before delete from spool queue Print document before delete from spool queue Begin and commit are points of consistency Begin and commit are points of consistency State transformations new state under construction Begin Commit

53 53 Why Bother: Isolation Running programs concurrently on same data can create concurrency anomalies Running programs concurrently on same data can create concurrency anomalies The shared checking account example The shared checking account example Programming is hard enough without having to worry about concurrency Programming is hard enough without having to worry about concurrency Begin() read BAL read BAL add 10 add 10 write BAL write BALCommit() Bal = 100 Bal = 70 Bal = 110 Bal = 100 Begin() read BAL read BAL Subtract 30 Subtract 30 write BAL write BALCommit()

54 54 Isolation It is as though programs run one at a time It is as though programs run one at a time No concurrency anomalies No concurrency anomalies System automatically protects applications System automatically protects applications Locking (DB2, Informix, Microsoft ® SQL Server, Sybase…) Locking (DB2, Informix, Microsoft ® SQL Server, Sybase…) Versioned databases (Oracle, Interbase…) Versioned databases (Oracle, Interbase…) Begin() read BAL read BAL add 10 add 10 write BAL write BALCommit() Bal = 100 Bal = 110 Bal = 80 Bal = 110 Begin() read BAL read BAL Subtract 30 Subtract 30 write BAL write BALCommit()

55 55 Why Bother: Durability Once a transaction commits, want effects to survive failures Once a transaction commits, want effects to survive failures Fault tolerance: old master-new master wont work: Fault tolerance: old master-new master wont work: Cant do daily dumps: would lose recent work Cant do daily dumps: would lose recent work Want continuous dumps Want continuous dumps Redo lost transactions in case of failure Redo lost transactions in case of failure Resend unacknowledged messages Resend unacknowledged messages

56 56 Why ACID For Client/Server And Distributed ACID is important for centralized systems ACID is important for centralized systems Failures in centralized systems are simpler Failures in centralized systems are simpler In distributed systems: In distributed systems: More and more-independent failures More and more-independent failures ACID is harder to implement ACID is harder to implement That makes it even MORE IMPORTANT That makes it even MORE IMPORTANT Simple failure model Simple failure model Simple repair model Simple repair model

57 57 ACID Generalizations Taxonomy of actions Taxonomy of actions Unprotected: not undone or redone Unprotected: not undone or redone Temp files Temp files Transactional: can be undone before commit Transactional: can be undone before commit Database and message operations Database and message operations Real: cannot be undone Real: cannot be undone Drill a hole in a piece of metal, print a check Drill a hole in a piece of metal, print a check Nested transactions: subtransactions Nested transactions: subtransactions Work flow: long-lived transactions Work flow: long-lived transactions

58 58 Outline Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts Why transactions? Why transactions? ACID: atomic, consisistent, isolated, durable ACID: atomic, consisistent, isolated, durable Using transactions Using transactions programming programming save points save points nested, chained nested, chained workflow workflow Two Phase Commit Two Phase Commit How transactions? How transactions?

59 59 Programming & Transactions The Application View You Start (e.g. in TransactSQL) : You Start (e.g. in TransactSQL) : Begin [Distributed] Transaction Begin [Distributed] Transaction Perform actions Perform actions Optional Save Transaction Optional Save Transaction Commit or Rollback Commit or Rollback You Inherit a XID You Inherit a XID Caller passes you a transaction Caller passes you a transaction You return or Rollback. You return or Rollback. You can Begin / Commit sub-trans. You can Begin / Commit sub-trans. You can use save points You can use save points Begin Commit Begin RollBack Return RollBack Return XID

60 60 Transaction Save Points Backtracking within a transaction Allows app to cancel parts of a transaction prior to commit Allows app to cancel parts of a transaction prior to commit This is in most SQL products (save transaction in MS SQL Server) This is in most SQL products (save transaction in MS SQL Server) BEGIN WORK:1 SAVE WORK:2 action SAVE WORK:4 SAVE WORK:3 ROLLBACK WORK(2) SAVE WORK:5 action SAVE WORK:6 action SAVE WORK:7 action ROLLBACK WORK(7) SAVE WORK:8 action COMMIT WORK

61 61 Chained Transactions Commit of T1 implicitly begins T2. Carries context forward to next transaction cursors cursors locks locks other state other state Transaction #1 Transaction #2 CommitCommit BeginBegin Processing contextestablished Processing contextused

62 62 Nested Transactions Going Beyond Flat Transactions Need transactions within transactions Need transactions within transactions Sub-transactions commit only if root does Sub-transactions commit only if root does Only root commit is durable. Only root commit is durable. Subtransactions may rollback if so, all its subtransactions rollback Subtransactions may rollback if so, all its subtransactions rollback Parallel version of nested transactions Parallel version of nested transactions T1 T114 T113 T112 T111 T11 T123 T122 T121 T12 T131 T132 T133 T13

63 63 Workflow: A Sequence of Transactions Application transactions are multi-step Application transactions are multi-step order, build, ship & invoice, reconcile order, build, ship & invoice, reconcile Each step is an ACID unit Each step is an ACID unit Workflow is a script describing steps Workflow is a script describing steps Workflow systems Workflow systems Instantiate the scripts Instantiate the scripts Drive the scripts Drive the scripts Allow query against scripts Allow query against scripts Examples Manufacturing Work In Process (WIP) Queued processing Loan application & approval, Hospital admissions… Examples Manufacturing Work In Process (WIP) Queued processing Loan application & approval, Hospital admissions… Database Business Objects workflow Presentation

64 64 Workflow Scripts Workflow scripts are programs (could use VBScript or JavaScript) Workflow scripts are programs (could use VBScript or JavaScript) If step fails, compensation action handles error If step fails, compensation action handles error Events, messages, time, other steps cause step. Events, messages, time, other steps cause step. Workflow controller drives flows Workflow controller drives flows Source Step branch case fork loop Compensation Action join

65 65 Workflow and ACID Workflow is not Atomic or Isolated Workflow is not Atomic or Isolated Results of a step visible to all Results of a step visible to all Workflow is Consistent and Durable Workflow is Consistent and Durable Each flow may take hours, weeks, months Each flow may take hours, weeks, months Workflow controller Workflow controller keeps flows moving keeps flows moving maintains context (state) for each flow maintains context (state) for each flow provides a query and operator interface e.g.: what is the status of Job # 72149? provides a query and operator interface e.g.: what is the status of Job # 72149?

66 66 ACID Objects Using ACID DBs The easy way to build transactional objects Application uses transactional objects (objects have ACID properties) Application uses transactional objects (objects have ACID properties) If object built on top of ACID objects, then object is ACID. If object built on top of ACID objects, then object is ACID. Example: New, EnQueue, DeQueue on top of SQL Example: New, EnQueue, DeQueue on top of SQL SQL provides ACID SQL provides ACID SQL Business Object: Customer Business Object Mgr: CustomerMgr SQL dim c as Customer dim CM as CustomerMgr... set C = CM.get(CustID)... C.credit_limit = CM.update(C, CustID).. Persistent Programming languages automate this.

67 67 ACID Objects From Bare Metal The Hard Way to Build Transactional Objects Object Class is a Resource Manager (RM) Object Class is a Resource Manager (RM) Provides ACID objects from persistent storage Provides ACID objects from persistent storage Provides Undo (on rollback) Provides Undo (on rollback) Provides Redo (on restart or media failure) Provides Redo (on restart or media failure) Provides Isolation for concurrent ops Provides Isolation for concurrent ops Microsoft SQL Server, IBM DB2, Oracle,… are Resource managers. Microsoft SQL Server, IBM DB2, Oracle,… are Resource managers. Many more coming. Many more coming. RM implementation techniques described later RM implementation techniques described later

68 68 Outline Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts Why transactions? Why transactions? Using transactions Using transactions programming programming save points save points nested, chained nested, chained workflow workflow Two Phase Commit Two Phase Commit Prepare and commit phases Prepare and commit phases Transaction & Resource Managers Transaction & Resource Managers How transactions? How transactions?

69 69 Transaction Manager Transaction Manager (TM): manages transaction objects. Transaction Manager (TM): manages transaction objects. XID factory XID factory tracks them tracks them coordinates them coordinates them App gets XID from TM App gets XID from TM Transactional RPC Transactional RPC passes XID on all calls passes XID on all calls manages XID inheritance manages XID inheritance TM manages commit & rollback TM manages commit & rollback App RM TM begin XID call(..XID) enlist

70 70 TM Two-Phase Commit Dealing with multiple RMs If all use one RM, then all or none commit If all use one RM, then all or none commit If multiple RMs, then need coordination If multiple RMs, then need coordination Standard technique: Standard technique: Marriage: Do you? I do. I pronounce…Kiss Marriage: Do you? I do. I pronounce…Kiss Theater: Ready on the set? Ready! Action! Act Theater: Ready on the set? Ready! Action! Act Sailing: Ready about? Ready! Helms a-lee! Tack Sailing: Ready about? Ready! Helms a-lee! Tack Contract law: Escrow agent Contract law: Escrow agent Two-phase commit: Two-phase commit: 1. Voting phase: can you do it? 1. Voting phase: can you do it? 2. If all vote yes, then commit phase: do it! 2. If all vote yes, then commit phase: do it!

71 71 Two-Phase Commit In Pictures Transactions managed by TM Transactions managed by TM App gets unique ID (XID) from TM at Begin() App gets unique ID (XID) from TM at Begin() XID passed on Transactional RPC XID passed on Transactional RPC RMs Enlist when first do work on XID RMs Enlist when first do work on XID AppRM1 TM RM2 Begin XID Call(..XID..) Enlist Call(..XID..) Enlist

72 72 When App Requests Commit Two Phase Commit in Pictures TM tracks all RMs enlisted on an XID TM tracks all RMs enlisted on an XID TM calls enlisted RMs Prepared() callback TM calls enlisted RMs Prepared() callback If all vote yes, TM calls RMs Commit() If all vote yes, TM calls RMs Commit() If any vote no, TM calls RMs Rollback() If any vote no, TM calls RMs Rollback() App RM1 TM RM2 1. Application requests Commit Commit 1 Prepare TM broadcasts prepared? Yes RMs all vote Yes 4. TM decides Yes, broadcasts 4 Commit 4 5. RMs acknowledge 5 Yes 5 yes 6. TM says yes

73 73 X/Open Standardizes Two-Phase Commit TM Client RM CommmgrServer Commmgr RM TM TX:begincommitrollback SQLorMTSor.. XA+:outgoingincoming XA:enlist,PrepareCommit Standardized APIs for apps and to RMs Points to OSI/TP for interoperation Commmgr

74 74 How Does This Relate To Microsoft? SQL Server is transactional (so is Oracle, DB2, Informix, Sybase) SQL Server is transactional (so is Oracle, DB2, Informix, Sybase) MS Distributed Transaction Coordinator (DTC) packaged with SQL Server, MTS, and other RMs MS Distributed Transaction Coordinator (DTC) packaged with SQL Server, MTS, and other RMs Connects to CICS, Encina, Topend, Tuxedo Connects to CICS, Encina, Topend, Tuxedo Any RM ( SNA LU6.2, DB2, Oracle, Sybase, Informix, …) can participate in transactions Any RM ( SNA LU6.2, DB2, Oracle, Sybase, Informix, …) can participate in transactions

75 75 OLE Transactions: the Movie Client DtcGetTransactionManager () TM I TransactionDispenser BeginTransaction ITransaction GetTransactionInfo Commit Abort Transaction Comm Mgr Resource Manager aka (sql, viper,…) Commit / Abort Two styles: (1) Bind an RM connection to the transaction. All work on that connection is now part of that transaction. (2) pass transaction object on every RM call. Not shown: client can get async notification of transaction outcome. begin commit rollback

76 76 OLE Transactions RM Enlist begin commit rollback Resource Manager aka (sql, viper,…) DtcGetTransactionManager () TM IResourceManagerFactory Create IResourceManager Enlist ReEnlist ReEnlistmentComplete RM Transaction ITransactionResourceAsync PrepareRequest CommitRequest AbortRequest TMDown ENLIST ()!!!! ITransactionEnlistmentAsync PrepareReqDone CommitReqDone AbortReqDone Enlistment RM registers with TM RM Enlists in transaction (provides callbacks)

77 77 begin commit rollback Resource Manager aka (sql, viper,…) Transaction COMMIT OLE Transactions RM Commit TM Two phase commit Enlisted RMs get prepare & commit callbacks Abort callbacks are similar ITransactionResourceAsync PrepareRequest CommitRequest AbortRequest TMDown ITransactionEnlistmentAsync PrepareReqDone CommitReqDone AbortReqDone Enlistment

78 78 Outline Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts Why transactions? Why transactions? Using transactions Using transactions Two Phase Commit Two Phase Commit Prepare and commit phases Prepare and commit phases Transaction and Resource Managers Transaction and Resource Managers How transactions? How transactions? logging logging locking or versioning locking or versioning

79 79 Implementing Transactions Atomicity Atomicity The DO/UNDO/REDO protocol The DO/UNDO/REDO protocol Idempotence Idempotence Two-phase commit Two-phase commit Durability Durability Durable logs Durable logs Force at commit Force at commit Isolation Isolation Locking or versioning Locking or versioning

80 80 Each action generates a log record Each action generates a log record Has an UNDO action Has an UNDO action Has a REDO action Has a REDO action DO/UNDO/REDO New state Old state DO Log New state Old state UNDOLog New state Old state REDOLog

81 81 What Does A Log Record Look Like? Log record has Log record has Header (transaction ID, timestamp… ) Header (transaction ID, timestamp… ) Item ID Item ID Old value Old value New value New value For messages: just message text and sequence # For messages: just message text and sequence # For records: old and new value on update For records: old and new value on update Keep records small Keep records small ? Log ?

82 82 Transaction Is A Sequence Of Actions Each action changes state Each action changes state Changes database Changes database Sends messages Sends messages Operates a display/printer/drill press Operates a display/printer/drill press Leaves a log trail Leaves a log trail New state Old state DO Log Log New state Old state DO DO Log New state Old state DO Log New state

83 83 Transaction UNDO Is Easy Read log backwards Read log backwards UNDO one step at a time UNDO one step at a time Can go half-way back to get nested transactions Can go half-way back to get nested transactions New state Old state UNDO Log Log UNDO New state UNDO Log Old state New state Old state UNDO Log New state Old state UNDO Log Log UNDO New state UNDO Log Old state New state Old state UNDO Log Log UNDO New state Old state UNDO Log

84 84 Durability: Protecting The Log When transaction commits When transaction commits Put its log in a durable place (duplexed disk) Put its log in a durable place (duplexed disk) Need log to redo transaction in case of failure Need log to redo transaction in case of failure System failure: lost in-memory updates System failure: lost in-memory updates Media failure (lost disk) Media failure (lost disk) This makes transaction durable This makes transaction durable Log is sequential file Log is sequential file Converts random IO to single sequential IO Converts random IO to single sequential IO See NTFS or newer UNIX file systems See NTFS or newer UNIX file systemsLogLog Log Log Log Log Log Log Write

85 85 Recovery After ASystem Failure During normal processing, write checkpoints on non-volatile storage During normal processing, write checkpoints on non-volatile storage When recovering from a system failure… When recovering from a system failure… return to the checkpoint state return to the checkpoint state Reapply log of all committed transactions Reapply log of all committed transactions Force-at-commit insures log will survive restart Force-at-commit insures log will survive restart Then UNDO all uncommitted transactions Then UNDO all uncommitted transactions New state Old state REDO Log New state Old state REDO Log REDO New state Old state Log Log REDO New state

86 86 Idempotence Dealing with failure What if fail during restart? What if fail during restart? REDO many times REDO many times What if new state not around at restart? What if new state not around at restart? UNDO something not done UNDO something not done Old state REDO New state Log Log REDO UNDO Old state Log Log UNDO

87 87 Idempotence Dealing with failure Solution: make F(F(x))=F(x) (idempotence) Solution: make F(F(x))=F(x) (idempotence) Discard duplicates Discard duplicates Message sequence numbers to discard duplicates Message sequence numbers to discard duplicates Use sequence numbers on pages to detect state Use sequence numbers on pages to detect state (Or) make operations idempotent (Or) make operations idempotent Move to position x, write value V to byte B… Move to position x, write value V to byte B… Old state REDO New state Log Log REDO UNDO Old state Log Log UNDO

88 88 Recap ACID makes it easy to program distributed applications ACID makes it easy to program distributed applications DO/UNDO/REDO + log allows atomicity DO/UNDO/REDO + log allows atomicity Multiple logs need two-phase commit Multiple logs need two-phase commit Persistent log gives durability Persistent log gives durability Recover from system failure Recover from system failure Recover from media failure Recover from media failure

89 89 Outline Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts Why transactions? Why transactions? Using transactions Using transactions Two Phase Commit Two Phase Commit How transactions? How transactions? logging logging locking or versioning locking or versioning

90 90 Concurrency Control Locking How to automatically prevent concurrency bugs? How to automatically prevent concurrency bugs? Serialization theorem: Serialization theorem: If you lock all you touch and hold to commit: no bugs If you lock all you touch and hold to commit: no bugs If you do not follow these rules, you may see bugs If you do not follow these rules, you may see bugs Automatic Locking: Automatic Locking: Set automatically (well-formed) Set automatically (well-formed) Released at commit/rollback (two-phase locking) Released at commit/rollback (two-phase locking) Greater concurrency for locks: Greater concurrency for locks: Granularity: objects or containers or server Granularity: objects or containers or server Mode: shared or exclusive or… Mode: shared or exclusive or…

91 91 Reduced Isolation Levels It is possible to lock less and risk fuzzy data It is possible to lock less and risk fuzzy data Example: want statistical summary of DB Example: want statistical summary of DB But do not want to lock whole database But do not want to lock whole database Reduced levels: Reduced levels: Repeatable Read: may see fuzzy inserts/delete Repeatable Read: may see fuzzy inserts/delete But will serialize all updates But will serialize all updates Read Committed: see only committed data Read Committed: see only committed data Read Uncommitted: may see uncommitted updates Read Uncommitted: may see uncommitted updates

92 92 Multiversion Concurrency Control Run transaction at some timestamp in the past Run transaction at some timestamp in the past No locking needed, reconstruct old state from log No locking needed, reconstruct old state from log Add in your transactions updates Add in your transactions updates At commit assure updates do not collide with other committed transactions At commit assure updates do not collide with other committed transactions Almost as good as serializable (only obscure bugs) Almost as good as serializable (only obscure bugs)

93 93 Summary ACID eases error handling ACID eases error handling Atomic: all or nothing Atomic: all or nothing Consistent: correct transformation Consistent: correct transformation Isolated: no concurrency bugs Isolated: no concurrency bugs Durable: survives failures Durable: survives failures Allows you to build robust distributed applications Allows you to build robust distributed applications ACID becoming standard part of systems ACID becoming standard part of systems Its real Its real

94 94 Outline Why Distributed Why Distributed Distributed data & objects Distributed data & objects Distributed execution Distributed execution Three tier architectures Three tier architectures Transaction concepts Transaction concepts 2-Tier 3-Tier Acid Atomic Autonomy Commit Consistent Delegation Durable Fat Client Idempotent Isolated Lock Log ORB Partitioned Data Queue Queued or Direct Replicated Data Resource Manager Rollback (Abort) RPC Serializable Server Pool Thin Client Transaction Manager Two Phase Commit Undo/Redo Update Anywhere Workflow XID Queue Queued or Direct Replicated Data Resource Manager Rollback (Abort) RPC Serializable Server Pool Thin Client Transaction Manager Two Phase Commit Undo/Redo Update Anywhere Workflow XID

95 95 References Essential Client/Server Survival Guide 2nd ed. Essential Client/Server Survival Guide 2nd ed. Orfali, Harkey & Edwards, J. Wiley, 1996 Orfali, Harkey & Edwards, J. Wiley, 1996 Principles of Transaction Processing Principles of Transaction Processing Bernstein & Newcomer, Morgan Kaufmann, 1997 Bernstein & Newcomer, Morgan Kaufmann, 1997 Transaction Processing Concepts and Techniques Transaction Processing Concepts and Techniques Gray & Reuter, Morgan Kaufmann, 1993 Gray & Reuter, Morgan Kaufmann, 1993

96 96


Download ppt "1 Fundamentals of Distributed Systems. Jim Gray Researcher Microsoft Corp. Prof. Andreas Reuter Professor U. Stuttgart"

Similar presentations


Ads by Google