Outline Introduction Background Distributed DBMS Architecture

Outline Introduction Background Distributed DBMS Architecture
Distributed Database Design (Briefly) Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Instructor Introduction
Bharat Bhargava Professor of Computer Sciences, Purdue University West Lafayette, IN 47907 Phone: , Professor Bhargava has taught the “Distributed Database Systems” course twenty times since He has graduated the largest number of Ph.D. students in Computer Sciences Department in Purdue University. He has been inducted in the “Book of Great Teachers” at Purdue University. Professor Bhargava's research involves both theoretical and experimental studies in distributed systems. His research group has implemented a robust and adaptable distributed database system called RAID, an adaptable video conferencing system and is involved in networking research. Prof. Bhargava has conducted experiments in large scale distributed systems, communications, authentication, key management, fault-tolerance and Quality of Service. His current interests are in secure mobile systems, multimedia security and QoS as a security parameter.

Distributed Database Systems
Computer network (communication system) Database systems Users (programs, transactions) Examples: Distributed INGRES (UC-Berkley) SDD-1 (Computer Corporation of America) DB2 and System R* (IBM) SIRIUS – DELTA (INRIA, France) RAID (Purdue)

Distributed Database Systems
Computer Networks: Communications: Ethernet UDP/IP ATM TCP/IP FDDI ISO ARPANET BITNET Internet2 … User Interaction: SQL Transaction

Fundamental References
Bharat Bhargava (Ed.), Concurrency Control and Reliability in Distributed Systems, Van Nostrand and Reinhold Publishers, 1987. A. Helal, A. Heddaya, and B. Bhargava, Replication Techniques in Distributed Systems, Klumer Academic Publishers, 1996. J. Gray and A. Reuter. Transaction Processing - Concepts and Techniques. Morgan Kaufmann, 1993. M.T. Özsu and P. Valduriez. Principles of Distributed Database Systems, 2nd edition. Prentice Hall,1999. S. Ceri and G. Pelagatti. Distributed Databases - Principles and Systems. McGraw Hill, 1984. D.A. Bell and J.B. Grimson. Distributed Database Systems. Addison-Wesley, 1992.

Fundamental References (see Website)
B. Bhargava, Building Distributed Database Systems. B. Bhargava and John Riedl, The Raid Distributed Database System, IEEE Trans on Software Engineering, 15(6), June 1989. B. Bhargava, Concurrency Control in Database Systems, IEEE Trans on Knowledge and Data Engineering,11(1), Jan.-Feb. 1999 B. Bhargava and John Riedl, A Model for Adaptable Systems for Transaction Processing, IEEE Transactions on Knowledge and Data Engineering, 1(4), Dec 1989. B. Bhargava and M. Annanalai, A framework for communication software and meaurements for digital library, Journal of Multimedia systems, 2000. B. Bhargava and C. Hua. A Causal Model for Analyzing Distributed Concurrency Control Algorithms, IEEE Transactions on Software Engineering, SE-9, , 1983. E. Mafla, and B. Bhargava, Communication Facilities for Distributed Transaction Processing Systems, IEEE Computer, 24(8), 1991. Y. Zhang and B. Bhargava, WANCE: Wide area network communication emulation systems, IEEE workshop on Parallel and Distributed Systems, 1993. G. Ding and B. Bhargava, Peer-to-peer File-sharing over Mobile Ad hoc Networks, in the First International Workshop on Mobile Peer-to-Peer Computing, Orlando, Florida, March 2004 M. Hefeeda, A. Habib, B. Botev, D. Xu, B. Bhargava, PROMISE: Peer-to-Peer Media Streaming Using CollectCast, In Proc. of ACM Multimedia 2003, 45-54, Berkeley, CA, November 2003. Y. Lu, W. Wang, D. Xu, and B. Bhargava, Trust-Based Privacy Preservation for Peer-to-peer, in the 1st NSF/NSA/AFRL workshop on secure knowledge management (SKM), Buffalo, NY, Sep B. Bhargava, Y. Zhang, and E. Mafla, Evolution of a communication system for distributed transaction processing in RAID, Computing Systems, 4(3), 1991. E. Pitoura and B. Bhargava, Data Consistency in Intermittently Connected Distributed Systems, IEEE TKDE, 11(6), 1999.

Fundamental References (cont’d)
E. Pitoura and B. Bhargava, Maintaining Consistency of Data in Mobile Distributed Environments, ICDCS, 1995. A. Zhang, M. Nodine, and B. Bhargava, Global scheduling for flexible transactions in heterogeneous distributed database systems, IEEE TKDE, 13(3), 2001. P. Bernstein and N. Goodman, Concurrency Control in Distributed Database Systems, ACM Computer Survey, 13(2), 1981. P. Bernstein, D. Shipman, and J. Rothnie, Concurrency control in a system for distributed databases (SDD-1), ACM Transactions on Database Systems, 5(1), 1980. Jim Gray, The Transaction Concept: Virtues and Limitations, VLDB, 1981. H.T. Kung and John T. Robinson, On Optimistic Methods for Concurrency Control, ACM Trans. Database Systems, 6(2), 1981. C. Papadimitriou, The serializability of concurrent database updates, Journal of the ACM, 26(4), 1979. D. Skeen, A Decentralized Termination Protocol, IEEE Symposium on Reliability in Distributed Software and Database Systems, July 1981. D. Skeen, Nonblocking commit protocols, ACM SIGMOD, 1981. D. Skeen and M Stonebraker, A Formal Model of Crash Recovery in a Distributed System, IEEE Trans. Software Eng. 9(3): , 1983. W. W. Chu, Optimal File Allocation in Multiple Computer System, IEEE Transaction on Computers, , October 1969. B. Bhargava and L. Lilien, Private and Trusted Collaborations, in Proceedings of Secure Knowledge Management (SKM), Amherst, NY, Sep S. B. Davidson, Optimism and consistency in partitioned distributed database systems, ACM Transactions on Database Systems 9(3): , 1984. S. B. Davidson, H. Garcia-Molina, and D. Skeen, Consistency in Partitioned Networks, ACM Computer Survey, 17(3): , 1985. B. Bhargava, Resilient Concurrency Control in Distributed Database Systems, IEEE Trans. on Reliability, R-31(5): , 1984. Jr. D. Parker, et al., Detection of Mutual Inconsistency in Distributed Systems, IEEE Trans. on Software Engineering, SE-9, 1983.

Other References Transaction Management:
P.A. Bernstein and E. Newcomer. Principles of Transaction Processing for the Systems Professional, Morgan Kaufmann, 1997. P.A. Bernstein; V. Hadzilacos and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, (out of print) M. Buretta. Data Replication, Wiley, 1997. V. Kumar (ed.). Performance of Concurrency Control Mechanisms in Centralized Database Systems, Prentice Hall, 1996. V. Kumar and S.H. Son. Database Recovery, Kluwer, 1998. C.H. Papadimitriou. The Theory of Concurrency Control. Computer Science Press, 1986.

Other References Interoperability:
A.K. Elmagarmid, M. Rusinkiewicz, and A. Sheth (eds). Management of Heterogeneous and Autonomous Database Systems, Morgan Kaufmann, 1998. A. Bouguettaya, B. Benatallah, and A. Elmagarmid (eds.). Interconnecting Heterogeneous Information Systems, Kluwer, 1998. J. Siegel (ed.). CORBA Fundamentals and Programming, Wiley, 1996. K. Brockschmidt. Inside OLE, 2nd edition, Microsoft Press, 1995. K. Geiger. Inside ODBC, Microsoft Press, 1995.

Other References Data Warehousing
There are many books. A small sample: W. Inmon. Building the Data Warehouse. John Wiley and Sons, 1992. A. Berson and S.J. Smith. Data Warehousing, Data Mining, and OLAP. McGraw Hill, 1997. S. Chaudri and U. Dayal. Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record, March 1997, 26(1), pp IEEE Q. Bull. Data Engineering, Special Issue on Materialised Views on Data Warehousing, June 1995, 18(2).

Other References Mobile Databases
A. Helal et al. Any Time, Anywhere Computing, Kluwer, 1999. T. Imielinski and H. Korth. Mobile Computing. Kluwer Publishers, 1996. E. Pitoura and G. Samaras. Data Management for Mobile Computing. Kluwer Publishers, 1998. T. Imielinski and B.R. Badrinath. Data Management Issues in Mobile Computing. Communications of ACM, October 1994, 37(10):18-28. M. H. Dunham and A. Helal. Mobile Computing and Databases: Anything New? ACM SIGMOD Record, December 1995, 24(4): 5-9. G. H. Forman and J. Zahorjan. The Challenges of Mobile Computing, Computer, April 1994, 27(4):38-47.

Other References Web Data Management
S. Abiteboul, P. Buneman, D. Suciu. Data on the Web, Morgan Kaufmann, 2000. D. Florescu, A. Levy, and A. Mendelzon, Database Technoques for the World Wide Web: A Survey, ACM SIGMOD Record, 27(3): 59-74, 1998. S. Bhowmick, S. Madria, and W. K. Ng, Web Data Management: A Warehouse Approach, Springer, 2003.

What is a distributed DBMS Problems Current state-of-affairs Background Distributed DBMS Architecture Distributed Database Design (Briefly) Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

File Systems program 1 File 1 data description 1 program 2

Database Management description manipulation control Application
DBMS Application program 1 (with data semantics) program 2 program 3 description manipulation control

Integrate Databases and Commuinication
Technology Computer Networks integration distribution Distributed Database Systems integration

Distributed Computing
A number of autonomous processing elements (not necessarily homogeneous) that are interconnected by a computer network and that cooperate in performing their assigned tasks.

Distributed Computing
Synonymous terms distributed data processing multiprocessors/multicomputers satellite processing backend processing dedicated/special purpose computers timeshared systems functionally modular systems Peer to Peer Systems

What is distributed … Processing logic Functions Data Control

What is a Distributed Database System?
A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (D–DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users. Distributed database system (DDBS) = DB + Communication

What is not a DDBS? A timesharing computer system
A loosely or tightly coupled multiprocessor system A database system which resides at one of the nodes of a network of computers - this is a centralized database on a network node

Centralized DBMS on a Network
Site 1 Site 2 Site 5 Communication Network Site 4 Site 3

Distributed DBMS Environment
Site 1 Site 2 Site 5 Communication Network Site 4 Site 3

Implicit Assumptions Data stored at a number of sites  each site logically consists of a single processor. Processors at different sites are interconnected by a computer network  no multiprocessors parallel database systems Distributed database is a database, not a collection of files  data logically related as exhibited in the users’ access patterns relational data model D-DBMS is a full-fledged DBMS not remote file system, not a TP system

Shared-Memory Architecture
P1 Pn D M Examples : symmetric multiprocessors (Sequent, Encore) and some mainframes (IBM3090, Bull's DPS8)

Shared-Nothing Architecture
P1 M1 D1 Pn Dn Mn Examples : Teradata's DBC, Tandem, Intel's Paragon, NCR's 3600 and 3700

Applications Manufacturing - especially multi-plant manufacturing
Military command and control Electronic fund transfers and electronic trading Corporate MIS Airline restrictions Hotel chains Any organization which has a decentralized organization structure

Distributed DBMS Promises
Transparent management of distributed, fragmented, and replicated data Improved reliability/availability through distributed transactions Improved performance Easier and more economical system expansion

Transparency Transparency is the separation of the higher level semantics of a system from the lower level implementation issues. Fundamental issue is to provide data independence in the distributed environment Network (distribution) transparency Replication transparency Fragmentation transparency horizontal fragmentation: selection vertical fragmentation: projection hybrid

Example EMP ASG ENO ENAME TITLE ENO PNO RESP DUR E1 J. Doe Elect. Eng.
Manager 12 E2 M. Smith Syst. Anal. E2 P1 Analyst 24 E3 A. Lee Mech. Eng. E2 P2 Analyst 6 E4 J. Miller Programmer E3 P3 Consultant 10 E5 B. Casey Syst. Anal. E3 P4 Engineer 48 E6 L. Chu Elect. Eng. E4 P2 Programmer 18 E7 R. Davis Mech. Eng. E5 P2 Manager 24 E6 P4 Manager 48 E8 J. Jones Syst. Anal. E7 P3 Engineer 36 E7 P5 Engineer 23 E8 P3 Manager 40 PROJ PAY PNO PNAME BUDGET TITLE SAL P1 Instrumentation 150000 Elect. Eng. 40000 P2 Database Develop. 135000 Syst. Anal. 34000 P3 CAD/CAM 250000 Mech. Eng. 27000 P4 Maintenance 310000 Programmer 24000

Transparent Access SELECT ENAME,SAL FROM EMP,ASG,PAY WHERE DUR > 12
AND EMP.ENO = ASG.ENO AND PAY.TITLE = EMP.TITLE Paris projects Paris employees Paris assignments Boston employees Montreal projects New York projects with budget > Montreal employees Montreal assignments Boston Communication Network Montreal Paris New York Boston projects Boston assignments New York employees New York assignments Tokyo

Outline Background Introduction Distributed DBMS Architecture
Distributed Database Design (Briefly) Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Distributed Database - User View

Distributed DBMS - Reality
User Query DBMS Software User Application DBMS Software DBMS Software Communication Subsystem User Application DBMS Software User Query DBMS Software User Query

Potentially Improved Performance
Proximity of data to its points of use Requires some support for fragmentation and replication Parallelism in execution Inter-query parallelism Intra-query parallelism

System Expansion Issue is database scaling Peer to Peer systems
Communication overhead

Distributed DBMS Issues
Distributed Database Design how to distribute the database replicated & non-replicated database distribution a related problem in directory management Query Processing convert user transactions to data manipulation instructions optimization problem min{cost = data transmission + local processing} general formulation is NP-hard

Distributed DBMS Issues
Concurrency Control Synchronization of concurrent accesses Consistency and isolation of transactions' effects Deadlock management Reliability How to make the system resilient to failures Atomicity and durability Privacy/Security Keep database access private Protect against malicious activities Trusted Collaborations (Emerging requirements) Evaluate trust among users and database sites Enforce policies for privacy Enforce integrity

Relationship Between Issues
Directory Management Query Processing Distribution Design Reliability Concurrency Control Deadlock Management

Related Issues Operating System Support
operating system with proper support for database operations dichotomy between general purpose processing requirements and database processing requirements Open Systems and Interoperability Distributed Multidatabase Systems More probable scenario Parallel issues Network Behavior

Introduction to Database Concepts Architecture, Schema, Views Alternatives in Distributed Database Systems Datalogical Architecture Implementation Alternatives Component Architecture Distributed Database Design (Briefly) Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Architecture of a Database System
Background materials of database architecture Defines the structure of the system components identified functions of each component defined interrelationships and interactions between components defined

ANSI/SPARC Architecture
Users External Schema External view External view External view Conceptual view Conceptual Schema Internal Schema Internal view

Standardization Reference Model Approaches
A conceptual framework whose purpose is to divide standardization work into manageable pieces and to show at a general level how these pieces are related to one another. Approaches Component-based Components of the system are defined together with the interrelationships between components. Good for design and implementation of the system. Function-based Classes of users are identified together with the functionality that the system will provide for each class. The objectives of the system are clearly identified. But how do you achieve these objectives? Data-based Identify the different types of describing data and specify the functional units that will realize and/or use data according to these views.

Conceptual Schema Definition
RELATION EMP [ KEY = {ENO} ATTRIBUTES = { ENO : CHARACTER(9) ENAME : CHARACTER(15) TITLE : CHARACTER(10) } ] RELATION PAY [ KEY = {TITLE} SAL : NUMERIC(6)

Conceptual Schema Definition
RELATION PROJ [ KEY = {PNO} ATTRIBUTES = { PNO : CHARACTER(7) PNAME : CHARACTER(20) BUDGET : NUMERIC(7) } ] RELATION ASG [ KEY = {ENO,PNO} ENO : CHARACTER(9) RESP : CHARACTER(10) DUR : NUMERIC(3)

Internal Schema Definition
RELATION EMP [ KEY = {ENO} ATTRIBUTES = { ENO : CHARACTER(9) ENAME : CHARACTER(15) TITLE : CHARACTER(10) } ]  INTERNAL_REL EMPL [ INDEX ON E# CALL EMINX FIELD = { HEADER : BYTE(1) E# : BYTE(9) ENAME : BYTE(15) TIT : BYTE(10)

External View Definition – Example 1
Create a BUDGET view from the PROJ relation CREATE VIEW BUDGET(PNAME, BUD) AS SELECT PNAME, BUDGET FROM PROJ

External View Definition – Example 2
Create a Payroll view from relations EMP and TITLE_SALARY CREATE VIEW PAYROLL (ENO, ENAME, SAL) AS SELECT EMP.ENO,EMP.ENAME,PAY.SAL FROM EMP, PAY WHERE EMP.TITLE = PAY.TITLE

Introduction to Database Concepts Alternatives in Distributed Database Systems Datalogical Architecture Implementation Alternatives Component Architecture Distributed Database Design (Briefly) Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Alternatives in Distributed Database Systems
Distribution Client/server Peer-to-peer Distributed DBMS Federated DBMS Distributed multi-DBMS Multi-DBMS Autonomy Heterogeneity

Dimensions of the Problem
Distribution Whether the components of the system are located on the same machine or not Heterogeneity Various levels (hardware, communications, operating system) DBMS important one data model, query language,transaction management algorithms Autonomy Not well understood and most troublesome Various versions Design autonomy: Ability of a component DBMS to decide on issues related to its own design. Communication autonomy: Ability of a component DBMS to decide whether and how to communicate with other DBMSs. Execution autonomy: Ability of a component DBMS to execute local operations in any manner it wants to.

Datalogical Distributed DBMS Architecture
... ES1 ES2 ESn ES: External Schema GCS: Global Conceptual Schema LCS: Local Conceptual Schema LIS: Local Internal Schema GCS ... LCS1 LCS2 LCSn ... LIS1 LIS2 LISn

Datalogical Multi-DBMS Architecture
... GES1 GES2 GESn … … LES11 LES1n GCS LESn1 LESnm … LCS1 LCS2 LCSn … LIS1 LIS2 LISn GES: Global External Schema LES: Local External Schema LCS: Local Conceptual Schema LIS: Local Internal Schema

Timesharing Access to a Central Database
Terminals or PC terminal emulators No data storage Host running all software Batch requests Network Response Communications Application Software DBMS Services Database

Multiple Clients/Single Server
Applications Applications Applications Client Services Client Services Client Services Communications Communications Communications LAN High-level requests Filtered data only Communications DBMS Services Database

Communications Manager Communications Manager
Task Distribution Application QL Interface Programmatic Interface … Communications Manager SQL query result table Communications Manager Query Optimizer Lock Manager Storage Manager Page & Cache Manager Database

Advantages of Client-Server Architectures
More efficient division of labor Horizontal and vertical scaling of resources Better price/performance on client machines Ability to use familiar tools on client machines Client access to remote data (via standards) Full DBMS functionality provided to client workstations Overall better system price/performance

Problems With Multiple-Client/Single Server
Server forms bottleneck Server forms single point of failure Database scaling difficult

Multiple Clients/Multiple Servers
directory caching query decomposition commit protocols Applications Client Services Communications LAN Communications DBMS Services Communications DBMS Services Database Database

Server-to-Server SQL interface programmatic interface
other application support environments Applications Client Services Communications LAN Communications DBMS Services Communications DBMS Services Database Database

Components of a Multi-DBMS
USER Global Requests Responses GTP GUI GQP GS GRM GQO Component Interface Processor (CIP) Component Interface Processor (CIP) Local Requests Local Requests User Interface Transaction Manager Transaction Manager User Interface D B M S D B M S Query Processor Scheduler Scheduler Query Processor … Query Optimizer Recovery Manager Recovery Manager Query Optimizer Runtime Sup. Processor Runtime Sup. Processor

Directory Issues Type Location Replication Local & distributed
Local & central & non-replicated (?) Local & distributed & non-replicated Global & central & non-replicated Global & distributed & non-replicated (?) Local & central & replicated (?) Location Global & central & replicated (?) Local & distributed & replicated Global & distributed & replicated Replication

Distributed Database Design Fragmentation Data Location Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Design Problem In the general setting :
Making decisions about the placement of data and programs across the sites of a computer network as well as possibly designing the network itself. In Distributed DBMS, the placement of applications entails placement of the distributed DBMS software; and placement of the applications that run on the database

Dimensions of the Problem
Access pattern behavior dynamic static partial information data Level of knowledge data + program complete information Level of sharing

Distribution Design Top-down Bottom-up
mostly in designing systems from scratch mostly in homogeneous systems Bottom-up when the databases already exist at a number of sites

Distribution Design Issues
Why fragment at all? How to fragment? How much to fragment? How to test correctness? How to allocate? Information requirements?

Fragmentation Can't we just distribute relations?
What is a reasonable unit of distribution? relation views are subsets of relations extra communication fragments of relations (sub-relations) concurrent execution of a number of transactions that access different portions of a relation views that cannot be defined on a single fragment will require extra processing semantic data control (especially integrity enforcement) more difficult

Fragmentation Alternatives – Horizontal
New York PROJ PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P3 CAD/CAM 250000 P2 Database Develop. 135000 P4 Maintenance 310000 Paris P5 500000 Boston PROJ1 : projects with budgets less than $200,000 PROJ2 : projects with budgets greater than or equal to $200,000 PROJ1 PROJ2 PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P3 CAD/CAM 250000 New York P2 Database Develop. 135000 New York P4 Maintenance 310000 Paris P5 CAD/CAM 500000 Boston

Fragmentation Alternatives – Vertical
New York PROJ PNO PNAME BUDGET LOC P1 Instrumentation 150000 Montreal P3 CAD/CAM 250000 P2 Database Develop. 135000 P4 Maintenance 310000 Paris P5 500000 Boston PROJ1: information about project budgets PROJ2: information about project names and locations PROJ1 PROJ2 PNO BUDGET PNO PNAME LOC P1 Instrumentation Montreal P3 CAD/CAM New York P2 Database Develop. P4 Maintenance Paris P5 Boston P1 150000 P2 135000 P3 250000 P4 310000 P5 500000

Degree of Fragmentation
finite number of alternatives tuples or attributes relations Finding the suitable level of partitioning within this range

Correctness of Fragmentation
Completeness Decomposition of relation R into fragments R1, R2, ..., Rn is complete if and only if each data item in R can also be found in some Ri Reconstruction If relation R is decomposed into fragments R1, R2, ..., Rn, then there should exist some relational operator such that R = 1≤i≤nRi Disjointness If relation R is decomposed into fragments R1, R2, ..., Rn, and data item di is in Rj, then di should not be in any other fragment Rk (k ≠ j ).

Other Fragmentation Issues
Privacy Security Bandwidth of Connection Reliability Replication Consistency Local User Needs

Distributed Database Design Fragmentation Data Location Distributed Query Processing (Briefly) Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Useful References W. W. Chu, Optimal File Allocation in Multiple Computer System, IEEE Transaction on Computers, , October 1969.

Allocation Alternatives
Non-replicated partitioned : each fragment resides at only one site Replicated fully replicated : each fragment at each site partially replicated : each fragment at some of the sites Rule of thumb: If replication is advantageous, otherwise replication may cause problems read - only queries update queries 1

Replication Alternatives
Comparison of Replication Alternatives Full-replication Partial-replication Partitioning QUERY PROCESSING Same Difficulty Easy DIRECTORY MANAGEMENT Easy or Non-existant Same Difficulty CONCURRENCY CONTROL Moderate Difficult Easy RELIABILITY Very high High Low Possible application Possible application REALITY Realistic

Information Requirements
Four categories: Database information Application information Communication network information Computer system information

Fragment Allocation Problem Statement Optimality Given
F = {F1, F2, …, Fn} fragments S ={S1, S2, …, Sm} network sites Q = {q1, q2,…, qq} applications Find the "optimal" distribution of F to S. Optimality Minimal cost Communication + storage + processing (read & update) Cost in terms of time (usually) Performance Response time and/or throughput Constraints Per site constraints (storage & processing)

Information Requirements
Database information selectivity of fragments size of a fragment Application information access types and numbers access localities Communication network information unit cost of storing data at a site unit cost of processing at a site Computer system information bandwidth latency communication overhead

Allocation File Allocation (FAP) vs Database Allocation (DAP):
Fragments are not individual files relationships have to be maintained Access to databases is more complicated remote file access model not applicable relationship between allocation and query processing Cost of integrity enforcement should be considered Cost of concurrency control should be considered

Allocation – Information Requirements
Database Information selectivity of fragments size of a fragment Application Information number of read accesses of a query to a fragment number of update accesses of query to a fragment A matrix indicating which queries updates which fragments A similar matrix for retrievals originating site of each query Site Information unit cost of storing data at a site unit cost of processing at a site Network Information communication cost/frame between two sites frame size

Allocation Model General Form min(Total Cost) subject to
response time constraint storage constraint processing constraint Decision Variable  1 if fragment Fi is stored at site Sj xij   0 otherwise 

Allocation Model   Total Cost Storage Cost (of fragment Fj at Sk)
Query Processing Cost (for one query) processing component + transmission component query processing cost  all queries  cost of storing a fragment at a site all fragments  all sites (unit storage cost at Sk)  (size of Fj) xjk

Allocation Model  Query Processing Cost Processing component
access cost + integrity enforcement cost + concurrency control cost Access cost Integrity enforcement and concurrency control costs Can be similarly calculated no. of update accesses+ no. of read accesses)  all fragments  all sites ( xijlocal processing cost at a site

Allocation Model    Query Processing Cost Transmission component
cost of processing updates + cost of processing retrievals Cost of updates Retrieval Cost update message cost  all fragments  all sites acknowledgment cost all fragments  all sites min all sites all fragments  cost of retrieval command  ( cost of sending back the result)

Allocation Model   Constraints Response Time
execution time of query ≤ max. allowable response time for that query Storage Constraint (for a site) Processing constraint (for a site)  storage requirement of a fragment at that site  all fragments storage capacity at that site  processing load of a query at that site  all queries processing capacity of that site

Allocation Model Solution Methods Heuristics based on
FAP is NP-complete DAP also NP-complete Heuristics based on single commodity warehouse location (for FAP) knapsack problem branch and bound techniques network flow

Allocation Model Attempts to reduce the solution space
assume all candidate partitionings known; select the “best” partitioning ignore replication at first sliding window on fragments

Distributed Database Design Distributed Query Processing Query Processing Methodology Distributed Query Optimization Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Query Processing high level user query low level data manipulation
processor low level data manipulation commands

Query Processing Components
Query language that is used SQL: “intergalactic dataspeak” Query execution methodology The steps that one goes through in executing high-level (declarative) user queries. Query optimization How do we determine the “best” execution plan?

Selecting Alternatives
SELECT ENAME  Project FROM EMP,ASG  Select WHERE EMP.ENO = ASG.ENO  Join AND DUR > 37 Strategy 1 ENAME(DUR>37EMP.ENO=ASG.ENO(EMP  ASG)) Strategy 2 ENAME(EMP ENO (DUR>37 (ASG))) Strategy 2 avoids Cartesian product, so is “better”

result2=(EMP1EMP2) ENODUR>37(ASG1ASG1)
What is the Problem? Site 1 Site 2 Site 3 Site 4 Site 5 ASG1=ENO≤“E3”(ASG) ASG2=ENO>“E3”(ASG) EMP1=ENO≤“E3”(EMP) EMP2=ENO>“E3”(EMP) Result Site 5 Site 5 result = EMP1’EMP2’ result2=(EMP1EMP2) ENODUR>37(ASG1ASG1) EMP1’ EMP2’ ASG1 ASG2 EMP1 EMP2 Site 3 Site 4 EMP1’=EMP ENOASG1’ EMP2’=EMP ENOASG2’ Site 1 Site 2 Site 3 Site 4 ASG1’ ASG2’ Site 1 Site 2 ASG1’=DUR>37(ASG1) ASG2’=DUR>37(ASG2)

Cost of Alternatives Assume: Strategy 1 Strategy 2
size(EMP) = 400, size(ASG) = 1000 tuple access cost = 1 unit; tuple transfer cost = 10 units Strategy 1 produce ASG': (10+10)tuple access cost transfer ASG' to the sites of EMP: (10+10)tuple transfer cost produce EMP': (10+10) tuple access cost transfer EMP' to result site: (10+10) tuple transfer cost Total cost Strategy 2 transfer EMP to site 5:400tuple transfer cost 4,000 transfer ASG to site 5 :1000tuple transfer cost 10,000 produce ASG':1000tuple access cost 1,000 join EMP and ASG':40020tuple access cost 8,000 Total cost 23,000

Query Optimization Objectives
Minimize a cost function I/O cost + CPU cost + communication cost These might have different weights in different distributed environments Wide area networks communication cost will dominate (80 – 200 ms) low bandwidth low speed high protocol overhead most algorithms ignore all other cost components Local area networks communication cost not that dominant (1 – 5 ms) total cost function should be considered Can also maximize throughput

Complexity of Relational Operations
Select Project O(n) Assume relations of cardinality n sequential scan (without duplicate elimination) Project (with duplicate elimination) O(nlog n) Group Join Semi-join O(nlog n) Division Set Operators Cartesian Product O(n2)

Query Optimization Issues – Types of Optimizers
Exhaustive search cost-based optimal combinatorial complexity in the number of relations Heuristics not optimal regroup common sub-expressions perform selection, projection first replace a join by a series of semijoins reorder operations to reduce intermediate relation size optimize individual operations

Query Optimization Issues – Optimization Granularity
Single query at a time cannot use common intermediate results Multiple queries at a time efficient if many similar queries decision space is much larger

Query Optimization Issues – Optimization Timing
Static compilation  optimize prior to the execution difficult to estimate the size of the intermediate results  error propagation can amortize over many executions R* Dynamic run time optimization exact information on the intermediate relation sizes have to reoptimize for multiple executions Distributed INGRES Hybrid compile using a static algorithm if the error in estimate sizes > threshold, reoptimize at run time MERMAID

Query Optimization Issues – Statistics
Relation cardinality size of a tuple fraction of tuples participating in a join with another relation Attribute cardinality of domain actual number of distinct values Common assumptions independence between different attribute values uniform distribution of attribute values within their domain

Query Optimization Issues – Decision Sites
Centralized single site determines the “best” schedule simple need knowledge about the entire distributed database Distributed cooperation among sites to determine the schedule need only local information cost of cooperation Hybrid one site determines the global schedule each site optimizes the local subqueries

Query Optimization Issues – Network Topology
Wide area networks (WAN) – point-to-point characteristics low bandwidth low speed high protocol overhead communication cost will dominate; ignore all other cost factors global schedule to minimize communication cost local schedules according to centralized query optimization Local area networks (LAN) communication cost not that dominant total cost function should be considered broadcasting can be exploited (joins) special algorithms exist for star networks

Distributed Database Design Distributed Query Processing Query Processing Methodology Distributed Query Optimization Distributed Transaction Management (Extensive) Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Distributed Query Processing Methodology
Calculus Query on Distributed Relations GLOBAL SCHEMA Query Decomposition Algebraic Query on Distributed Relations CONTROL SITE FRAGMENT SCHEMA Data Localization Fragment Query STATS ON FRAGMENTS Global Optimization Optimized Fragment Query with Communication Operations LOCAL SITES LOCAL SCHEMAS Local Optimization Optimized Local Queries

Restructuring ENAME DUR=12 OR DUR=24 PNAME=“CAD/CAM”
Convert relational calculus to relational algebra Make use of query trees Example Find the names of employees other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years. SELECT ENAME FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO AND ENAME ≠ “J. Doe” AND PNAME = “CAD/CAM” AND (DUR = 12 OR DUR = 24) ENAME Project DUR=12 OR DUR=24 PNAME=“CAD/CAM” Select ENAME≠“J. DOE” PNO ENO Join PROJ ASG EMP

Restructuring –Transformation Rules
Commutativity of binary operations R  S  S  R R S  S R R  S  S  R Associativity of binary operations ( R  S )  T  R  (S  T) ( R S ) T  R (S T ) Idempotence of unary operations A’(A’(R)) A’(R) p1(A1)(p2(A2)(R)) = p1(A1)  p2(A2)(R) where R[A] and A'  A, A"  A and A'  A" Commuting selection with projection

Restructuring – Transformation Rules
Commuting selection with binary operations p(A)(R  S)  (p(A) (R))  S p(Ai)(R (Aj,Bk) S)  (p(Ai) (R)) (Aj,Bk) S p(Ai)(R  T)  p(Ai) (R)  p(Ai) (T) where Ai belongs to R and T Commuting projection with binary operations C(R  S)  A’(R)  B’(S) C(R (Aj,Bk) S)  A’(R) (Aj,Bk) B’(S) C(R  S)  C (R)  C (S) where R[A] and S[B]; C = A'  B' where A'  A, B'  B

Example Recall the previous example: ENAME DUR=12 OR DUR=24
Find the names of employees other than J. Doe who worked on the CAD/CAM project for either one or two years. SELECT ENAME FROM PROJ, ASG, EMP WHERE ASG.ENO=EMP.ENO AND ASG.PNO=PROJ.PNO AND ENAME≠“J. Doe” AND PROJ.PNAME=“CAD/CAM” AND (DUR=12 OR DUR=24) ENAME Project DUR=12 OR DUR=24 PNAME=“CAD/CAM” Select ENAME≠“J. DOE” PNO ENO Join PROJ ASG EMP

Equivalent Query ENAME
PNAME=“CAD/CAM” (DUR=12  DUR=24) ENAME≠“J. DOE” PNO ENO  ASG PROJ EMP

Restructuring ENAME PNO PNO,ENAME ENO PNO PNO,ENO PNO,ENAME
PNAME = "CAD/CAM" DUR =12 DUR=24 ENAME ≠ "J. Doe" PROJ ASG EMP

Increases system throughput
Cost Functions Total Time (or Total Cost) Reduce each cost (in terms of time) component individually Do as little of each cost component as possible Optimizes the utilization of the resources Increases system throughput Response Time Do as many things as possible in parallel May increase total time because of increased total activity

Total Cost Summation of all cost factors
Total cost = CPU cost + I/O cost + communication cost CPU cost = unit instruction cost  no.of instructions I/O cost = unit disk I/O cost  no. of disk I/Os communication cost = message initiation + transmission

Total Cost Factors Wide area network Local area networks
message initiation and transmission costs high local processing cost is low (fast mainframes or minicomputers) ratio of communication to I/O costs = 20:1 Local area networks communication and local processing costs are more or less equal ratio = 1:1.6

Response Time Elapsed time between the initiation and the completion of a query Response time = CPU time + I/O time + communication time CPU time = unit instruction time  no. of sequential instructions I/O time = unit I/O time  no. of sequential I/Os communication time = unit msg initiation time  no. of sequential msg + unit transmission time  no. of sequential bytes

Example Assume that only the communication cost is considered
Site 1 x units Site 3 Site 2 y units Assume that only the communication cost is considered Total time = 2  message initialization time + unit transmission time  (x+y) Response time = max {time to send x from 1 to 3, time to send y from 2 to 3} time to send x from 1 to 3 = message initialization time + unit transmission time  x time to send y from 2 to 3 = message initialization time + unit transmission time  y

Distributed Database Design Distributed Query Processing Distributed Transaction Management Transaction Concepts and Models Distributed Concurrency Control Distributed Reliability Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Useful References C. Papadimitriou, The serializability of concurrent database updates, Journal of the ACM, 26(4), 1979. S. B. Davidson, Optimism and consistency in partitioned distributed database systems, ACM Transactions on Database Systems 9(3): , 1984. B. Bhargava and C. Hua. A Causal Model for Analyzing Distributed Concurrency Control Algorithms, IEEE Transactions on Software Engineering, SE-9, , 1983.

Transaction A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency. concurrency transparency failure transparency Database may be temporarily in an inconsistent state during execution Database in a consistent state Database in a consistent state Begin Transaction Execution of Transaction End Transaction

Formal Definitions and Models
A history is a quadruple h = (n, , M, S) where n is a positive integer,  is a permutation of the set n = {R1, W1, R2, W2,…,Rn, Wn equivalently a one-to-one function : n -> {1,2,-----,2n} that (Ri) <  (Wi) for i = 1,2,--n, M is a finite set of variables representing physical data items, S is a function mapping n to 2M Set of all histories is denoted by M. Definition 2: A transaction Ti is a pair (Ri, Wi). A transaction is a single execution of a program. This program may be a simple query statement expressed in a query language. Definition 3: Read set of Ti is denoted by S (Ri) and Write set of Ti is denoted by S(Wi).

A history h = (n, , M, S) is serial if (Wi) = (Ri) + 1 for all i = 1,2,---n. In other words, a history is serial if Ri immediately precedes Wi for i = 1,2---n. Definition 5: A history is serializable if there is some serial history hs such that the effect of the execution of h is equivalent to hs. Note serializability requires only that there exists some serial order equivalent to the actual interleaved execution history. There may in fact be several such equivalent serial orderings. Definition 6: A history h is strongly serializable if in hs the following conditions hold true: (Wi) = (Ri) + 1 (Ri + 1) = (Wi) + 1 If ti + 1 is the next transaction that arrived and obtained the next time-stamp after Ti. In strongly serializable history, the following constraint must hold “If a transaction Ti is issued before a transaction Tj, then the total effect on the database should be equivalent to the effect that Ti was executed before Tj. Note if Ti and Tj are independent, e.g., {S(Ri) U S(Wi)}  {S(Rj) U S(Wj)} = ø then the effect of execution TiTj or TjTi will be the same.

history Live transaction (set can be found in O(n · |V|). Two histories are equivalent () if they have the same set of live transactions. Equivalence can be determined O(n · |V| ). Theorem: Testing whether a history h is serializable is NP-complete even if h has no dead transactions. Polygraph: Pair of arcs between nodes Satisfiability: Problem of Boolean formulas in conjuctive normal forms with two-/three literals (SAT) (Non-circular)

Concatenation of histories:
same true for Ri

Two-phase locking: is 2PL If distinct non-integer real numbers
for i = … If & If &

Transaction Example – A Simple SQL Query
Transaction BUDGET_UPDATE begin EXEC SQL UPDATE PROJ SET BUDGET = BUDGET1.1 WHERE PNAME = “CAD/CAM” end.

Example Database Consider an airline reservation example with the relations: FLIGHT(FNO, DATE, SRC, DEST, STSOLD, CAP) CUST(CNAME, ADDR, BAL) FC(FNO, DATE, CNAME,SPECIAL)

Example Transaction – SQL Version
Begin_transaction Reservation begin input(flight_no, date, customer_name); EXEC SQL UPDATE FLIGHT SET STSOLD = STSOLD + 1 WHERE FNO = flight_no AND DATE = date; EXEC SQL INSERT INTO FC(FNO, DATE, CNAME, SPECIAL); VALUES (flight_no, date, customer_name, null); output(“reservation completed”) end . {Reservation}

Termination of Transactions
Begin_transaction Reservation begin input(flight_no, date, customer_name); EXEC SQL SELECT STSOLD,CAP INTO temp1,temp2 FROM FLIGHT WHERE FNO = flight_no AND DATE = date; if temp1 = temp2 then output(“no free seats”); Abort else EXEC SQL UPDATE FLIGHT SET STSOLD = STSOLD + 1 WHERE FNO = flight_no AND DATE = date; EXEC SQL INSERT INTO FC(FNO, DATE, CNAME, SPECIAL); VALUES (flight_no, date, customer_name, null); Commit output(“reservation completed”) endif end . {Reservation}

Example Transaction – Reads & Writes
Begin_transaction Reservation begin input(flight_no, date, customer_name); temp Read(flight_no(date).stsold); if temp = flight(date).cap then output(“no free seats”); Abort end else begin Write(flight(date).stsold, temp + 1); Write(flight(date).cname, customer_name); Write(flight(date).special, null); Commit; output(“reservation completed”) end. {Reservation}

Characterization Ti Read set (RS) Write set (WS) Base set (BS)
Transaction i Read set (RS) The set of data items that are read by a transaction Write set (WS) The set of data items whose values are changed by this transaction Base set (BS) RS  WS

Formalization Based on Textbook
Let Oij(x) be some operation Oj of transaction Ti operating on entity x, where Oj  {read,write} and Oj is atomic OSi = j Oij Ni  {abort,commit} Transaction Ti is a partial order Ti = {i, <i} where i = OSi {Ni } For any two operations Oij , Oik OSi , if Oij = R(x) and Oik = W(x) for any data item x, then either Oij <i Oik or Oik <i Oij Oij OSi, Oij <i Ni

Example Consider a transaction T: Then Read(x) Read(y) x x + y
Write(x) Commit Then  = {R(x), R(y), W(x), C} < = {(R(x), W(x)), (R(y), W(x)), (W(x), C), (R(x), C), (R(y), C)}

DAG Representation Assume R(x) W(x) C R(y)
< = {(R(x),W(x)), (R(y),W(x)), (R(x), C), (R(y), C), (W(x), C)} R(x) W(x) C R(y)

Properties of Transactions
ATOMICITY all or nothing CONSISTENCY no violation of integrity constraints ISOLATION concurrent changes invisible to other transactions DURABILITY committed updates persist

Atomicity Either all or none of the transaction's operations are performed. Atomicity requires that if a transaction is interrupted by a failure, its partial results must be undone. The activity of preserving the transaction's atomicity in presence of transaction aborts due to input errors, system overloads, or deadlocks is called transaction recovery. The activity of ensuring atomicity in the presence of system crashes is called crash recovery.

Consistency Internal consistency Transactions are correct programs
A transaction which executes alone against a consistent database leaves it in a consistent state. Transactions do not violate database integrity constraints. Transactions are correct programs

Consistency Degrees Degree 0 Degree 1
Transaction T does not overwrite dirty data of other transactions Dirty data refers to data values that have been updated by a transaction prior to its commitment Degree 1 T does not overwrite dirty data of other transactions T does not commit any writes before EOT

Consistency Degrees (cont’d)
T does not overwrite dirty data of other transactions T does not commit any writes before EOT T does not read dirty data from other transactions Degree 3 Other transactions do not dirty any data read by T before T completes.

Isolation Serializability Incomplete results
If several transactions are executed concurrently, the results must be the same as if they were executed serially in some order. Incomplete results An incomplete transaction cannot reveal its results to other transactions before its commitment. Necessary to avoid cascading aborts.

Isolation Example Consider the following two transactions:
T1: Read(x) T2: Read(x) x x1 x x+1 Write(x) Write(x) Commit Commit Possible execution sequences: T1: Read(x) T1: Read(x) T1: x x1 T1: x x+1 T1: Write(x) T2: Read(x) T1: Commit T1: Write(x) T2: Read(x) T2: x x+1 T2: x x+1 T2: Write(x) T2: Write(x) T1: Commit T2: Commit T2: Commit

SQL-92 Isolation Levels Phenomena: Dirty read
T1 modifies x which is then read by T2 before T1 terminates; T1 aborts  T2 has read value which never exists in the database. Non-repeatable (fuzzy) read T1 reads x; T2 then modifies or deletes x and commits. T1 tries to read x again but reads a different value or can’t find it. Phantom T1 searches the database according to a predicate while T2 inserts new tuples that satisfy the predicate.

SQL-92 Isolation Levels (cont’d)
Read Uncommitted For transactions operating at this level, all three phenomena are possible. Read Committed Fuzzy reads and phantoms are possible, but dirty reads are not. Repeatable Read Only phantoms possible. Anomaly Serializable None of the phenomena are possible.

Durability Once a transaction commits, the system must guarantee that the results of its operations will never be lost, in spite of subsequent failures. Database recovery

Characterization of Transactions
Based on Application areas non-distributed vs. distributed compensating transactions heterogeneous transactions Timing on-line (short-life) vs batch (long-life) Organization of read and write actions two-step restricted action model Structure flat (or simple) transactions nested transactions workflows

Transaction Structure
Flat transaction Consists of a sequence of primitive operations embraced between a begin and end markers. Begin_transaction Reservation … end. Nested transaction The operations of a transaction may themselves be transactions. Begin_transaction Airline end. {Airline} Begin_transaction Hotel end. {Hotel} end. {Reservation}

Nested Transactions Have the same properties as their parents  may themselves have other nested transactions. Introduces concurrency control and recovery concepts to within the transaction. Types Closed nesting Subtransactions begin after their parents and finish before them. Commitment of a subtransaction is conditional upon the commitment of the parent (commitment through the root). Open nesting Subtransactions can execute and commit independently. Compensation may be necessary.

Workflows “A collection of tasks organized to accomplish some business process.” [D. Georgakopoulos] Types Human-oriented workflows Involve humans in performing the tasks. System support for collaboration and coordination; but no system-wide consistency definition System-oriented workflows Computation-intensive & specialized tasks that can be executed by a computer System support for concurrency control and recovery, automatic task execution, notification, etc. Transactional workflows In between the previous two; may involve humans, require access to heterogeneous, autonomous and/or distributed systems, and support selective use of ACID properties

Workflow Example T3 T1: Customer request obtained
T2: Airline reservation performed T3: Hotel reservation performed T4: Auto reservation performed T5: Bill generated T1 T2 T5 T4 Customer Database Customer Database Customer Database

Transactions Provide…
Atomic and reliable execution in the presence of failures Correct execution in the presence of multiple user accesses Correct management of replicas (if they support it)

Transaction Processing Issues
Transaction structure (usually called transaction model) Flat (simple), nested Internal database consistency Semantic data control (integrity enforcement) algorithms Reliability protocols Atomicity & Durability Local recovery protocols Global commit protocols

Transaction Processing Issues
Concurrency control algorithms How to synchronize concurrent transaction executions (correctness criterion) Intra-transaction consistency, Isolation Replica control protocols How to control the mutual consistency of replicated data One copy equivalence and ROWA

Architecture Revisited
Begin_transaction, Read, Write, Commit, Abort Results Distributed Execution Monitor Transaction Manager (TM) With other TMs With other SCs Scheduling/ Descheduling Requests Scheduler (SC) To data processor

Centralized Transaction Execution
User Application User Application … Begin_Transaction, Read, Write, Abort, EOT Results & User Notifications Transaction Manager (TM) Read, Write, Abort, EOT Results Scheduler (SC) Scheduled Operations Results Recovery Manager (RM)

Distributed Transaction Execution
User application Begin_transaction, Read, Write, EOT, Abort Results & User notifications Distributed Transaction Execution Model TM TM Replica Control Protocol Read, Write, EOT, Abort Distributed Concurrency Control Protocol SC SC Local Recovery Protocol RM RM

Useful References J. D. Ullman, Principles of Database Systems. Computer Science Press, Rockville, 1982 J. Gray and A. Reuter. Transaction Processing - Concepts and Techniques. Morgan Kaufmann, 1993 B. Bhargava, Concurrency Control in Database Systems, IEEE Trans on Knowledge and Data Engineering,11(1), Jan.-Feb. 1999

Concurrency Control Interleaved execution of a set of transactions that satisfies given consistency constraints. Concurrency Control Mechanisms: Locking (two-phase locking) Conflict graphs Knowledge about incoming transactions or transaction typing Optimistic: requires validation (backout and starvation) Some Examples: Centralized locking Distributed locking Majority voting Local and centralized validation

Basic Terms for Concurrency Control
Concurrent processing Conflict Consistency Mutual consistency History Serializability Serial history Database Database entity (item, object) Distributed database Program Transaction, read set, write set Actions Atomic

Basic Terms for Concurrency Control
Serializable history Concurrency control Centralized control Distributed control Scheduler Locking Read lock, write lock Two phase locking, lock point Crash Node failure Network partition Log Live lock Dead lock Conflict graph (Acyclic) Timestamp Version number Rollback Validation and optimistic Commit Redo log Undo log Recovery Abort

Concurrency Control once again
The problem of synchronizing concurrent transactions such that the consistency of the database is maintained while, at the same time, maximum degree of concurrency is achieved. Anomalies: Lost updates The effects of some transactions are not reflected on the database. Inconsistent retrievals A transaction, if it reads the same data item more than once, should always read the same value.

Execution Schedule (or History)
An order in which the operations of a set of transactions are executed. A schedule (history) can be defined as a partial order over the operations of a set of transactions. T1: Read(x) T2: Write(x) T3: Read(x) Write(x) Write(y) Read(y) Commit Read(z) Read(z) Commit Commit H1={W2(x),R1(x), R3(x),W1(x),C1,W2(y),R3(y),R2(z),C2,R3(z),C3}

Formalization of Schedule
A complete schedule SC(T) over a set of transactions T={T1, …, Tn} is a partial order SC(T)={T, < T} where T = i i , for i = 1, 2, …, n < T i < i , for i = 1, 2, …, n For any two conflicting operations Oij, Okl  T, either Oij < T Okl or Okl < T Oij

Complete Schedule – Example
Given three transactions T1: Read(x) T2: Write(x) T3: Read(x) Write(x) Write(y) Read(y) Commit Read(z) Read(z) Commit Commit A possible complete schedule is given as the DAG R1(x) W2(x) R3(x) W1(x) W2(y) R3(y) C 1 R2(z) R3(z) C 2 C 3

Schedule Definition A schedule is a prefix of a complete schedule such that only some of the operations and only some of the ordering relationships are included. T1: Read(x) T2: Write(x) T3: Read(x) Write(x) Write(y) Read(y) Commit Read(z) Read(z) Commit Commit R1(x) W2(x) R3(x) R1(x) W2(x) R3(x) W1(x) W2(y) R3(y) W2(y) R3(y)  C 1 R2(z) R3(z) R2(z) R3(z) C 2 C 3

Serial History All the actions of a transaction occur consecutively.
No interleaving of transaction operations. If each transaction is consistent (obeys integrity rules), then the database is guaranteed to be consistent at the end of executing a serial history. T1: Read(x) T2: Write(x) T3: Read(x) Write(x) Write(y) Read(y) Commit Read(z) Read(z) Commit Commit Hs={W2(x),W2(y),R2(z),C2,R1(x),W1(x),C1,R3(x),R3(y),R3(z),C3}

Serializable History Transactions execute concurrently, but the net effect of the resulting history upon the database is equivalent to some serial history. Equivalent with respect to what? Conflict equivalence: the relative order of execution of the conflicting operations belonging to unaborted transactions in two histories are the same. Conflicting operations: two incompatible operations (e.g., Read and Write) conflict if they both access the same data item. Incompatible operations of each transaction is assumed to conflict; do not change their execution orders. If two operations from two different transactions conflict, the corresponding transactions are also said to conflict.

Serializable History The following are not conflict equivalent
T1: Read(x) T2: Write(x) T3: Read(x) Write(x) Write(y) Read(y) Commit Read(z) Read(z) Commit Commit The following are not conflict equivalent Hs={W2(x),W2(y),R2(z),C2,R1(x),W1(x),C1,R3(x),R3(y),R3(z),C3} H1={W2(x),R1(x), R3(x),W1(x),C1,W2(y),R3(y),R2(z),C2,R3(z),C3} The following are conflict equivalent; therefore H2 is serializable. H2={W2(x),R1(x),W1(x),C1,R3(x),W2(y),R3(y),R2(z),C2,R3(z),C3}

Serializability in Distributed DBMS
Somewhat more involved. Two histories have to be considered: local histories global history For global transactions (i.e., global history) to be serializable, two conditions are necessary: Each local history should be serializable. Two conflicting operations should be in the same relative order in all of the local histories where they appear together.

Global Non-serializability
T1: Read(x) T2: Read(x) x x5 x x15 Write(x) Write(x) Commit Commit The following two local histories are individually serializable (in fact serial), but the two transactions are not globally serializable. LH1={R1(x),W1(x),C1,R2(x),W2(x),C2} LH2={R2(x),W2(x),C2,R1(x),W1(x),C1}

Evaluation Criterion for Concurrency Control
1. Degree of Concurrency Scheduler Recognizes or Reshuffles history history (requested) (executed) Less reshuffle  High degree of concurrency 2. Resources used to recognize - Lock tables - Time stamps - Read/write sets - Complexity 3. Costs - Programming ease

General Comments Information needed by Concurrency Controllers
Locks on database objects Time stamps on database objects Time stamps on transactions Observations Time stamps mechanisms more fundamental than locking Time stamps carry more information Checking locks costs less than checking time stamps

General Comments (cont.)
When to synchronize First access to an object (Locking, pessimistic validation) At each access (question of granularity) After all accesses and before commitment (optimistic validation) Fundamental notions Rollback Identification of useless transactions Delaying commit point Semantics of transactions

Concurrency Control Algorithms
Pessimistic Two-Phase Locking-based (2PL) Centralized (primary site) 2PL Primary copy 2PL Distributed 2PL Timestamp Ordering (TO) Basic TO Multiversion TO Conservative TO Hybrid Optimistic Locking-based Timestamp ordering-based

Locking-Based Algorithms
Transactions indicate their intentions by requesting locks from the scheduler (called lock manager). Locks are either read lock (rl) [also called shared lock] or write lock (wl) [also called exclusive lock] Read locks and write locks conflict (because Read and Write operations are incompatible rl wl rl yes no wl no no Locking works nicely to allow concurrent processing of transactions.

Two-Phase Locking (2PL)
A Transaction locks an object before using it. When an object is locked by another transaction, the requesting transaction must wait. When a transaction releases a lock, it may not request another lock. Lock point Obtain lock Release lock No. of locks Phase 1 Phase 2 BEGIN END

Strict 2PL Hold locks until the end. Obtain lock Release lock
Transaction duration BEGIN END period of data item use

Testing for Serializability
Consider transactions T1, T2, …, Tk Create a directed graph (called a conflict graph), whose nodes are transactions. Consider a history of transactions. If T1 unlocks an item and T2 locks it afterwards, draw an edge from T1 to T2 implying T1 must precede T2 in any serial history T1→T2 Repeat this for all unlock and lock actions for different transactions. If graph has a cycle, the history is not serializable. If graph is a cyclic, a topological sorting will give the serial history.

Example T1: Lock X T1: Unlock X T2: Lock X T2: Lock Y T2: Unlock X
T2: Unlock Y T3: Lock Y T3: Unlock Y T1→T2 T2→T3 T1 T2 T3

Theorem Two phase locking is a sufficient condition to ensure serializablility. Proof: By contradiction. If history is not serializable, a cycle must exist in the conflict graph. This means the existence of a path such as T1→T2→T3 … Tk → T1. This implies T1 unlocked before T2 and after Tk. T1 requested a lock again. This violates the condition of two phase locking.

Centralized 2PL There is only one 2PL scheduler in the distributed system. Lock requests are issued to the central scheduler. Data Processors at participating sites Coordinating TM Central Site LM Lock Request Lock Granted Operation End of Operation Release Locks

Distributed 2PL 2PL schedulers are placed at each site. Each scheduler handles lock requests for data at that site. A transaction may read any of the replicated copies of item x, by obtaining a read lock on one of the copies of x. Writing into x requires obtaining write locks for all copies of x.

Distributed 2PL Execution
Coordinating TM Participating LMs Participating DPs Lock Request Operation End of Operation Release Locks

Timestamp Ordering for Ri(x) for Wi(x)
Transaction (Ti) is assigned a globally unique timestamp ts(Ti). Transaction manager attaches the timestamp to all operations issued by the transaction. Each data item is assigned a write timestamp (wts) and a read timestamp (rts): rts(x) = largest timestamp of any read on x wts(x) = largest timestamp of any read on x Conflicting operations are resolved by timestamp order. Basic T/O: for Ri(x) for Wi(x) if ts(Ti) < wts(x) if ts(Ti) < rts(x) and ts(Ti) < wts(x) then reject Ri(x) then reject Wi(x) else accept Ri(x) else accept Wi(x) rts(x) ts(Ti) wts(x) ts(Ti)

Conservative Timestamp Ordering
Basic timestamp ordering tries to execute an operation as soon as it receives it progressive too many restarts since there is no delaying Conservative timestamping delays each operation until there is an assurance that it will not be restarted Assurance? No other operation with a smaller timestamp can arrive at the scheduler Note that the delay may result in the formation of deadlocks

Multiversion Timestamp Ordering
Do not modify the values in the database, create new values. A Ri(x) is translated into a read on one version of x. Find a version of x (say xv) such that ts(xv) is the largest timestamp less than ts(Ti). A Wi(x) is translated into Wi(xw) and accepted if the scheduler has not yet processed any Rj(xr) such that ts(Ti) < ts(xr) < ts(Tj)

Optimistic Concurrency Control Algorithms
Pessimistic execution Validate Read Compute Write Optimistic execution Read Compute Validate Write

Week 7, Lecture 1 Midterm Review

Week 7, Lecture 2 Midterm Exam

Useful References H.T. Kung and John T. Robinson, On Optimistic Methods for Concurrency Control, ACM Trans. Database Systems, 6(2), 1981. B. Bhargava, Concurrency Control in Database Systems, IEEE Trans on Knowledge and Data Engineering,11(1), Jan.-Feb. 1999

Optimistic Concurrency Control Algorithms
Transaction execution model: divide into subtransactions each of which execute at a site Tij: transaction Ti that executes at site j Transactions run independently at each site until they reach the end of their read phases All subtransactions are assigned a timestamp at the end of their read phase Validation test performed during validation phase. If one fails, all rejected.

Optimistic Concurrency Control Processing
Start Read, Compute, And Write Local Semi-Commit On Initiating Site Integrity Control & Local Validation Commit, Global Write Finish Fail Success

Transaction Types on a Site
Committed Transactions Semi-Committed Transactions Transactions Still Reading/Computing Validating Transactions

Exmaple of Locking vs Optimistic
S(Ri) S(Wi) S(RJ) S(WJ) Ti TJ S(Ri)  S(WJ)  ø AND (Ri) < (WJ)  Ti → TJ Locking Ri RJ Wi WJ Optimistic Ri RJ WJ Wi

Locking: This history not allowed
Example: R1 R2 R3 Rn ... W1 W2 W3 Wn Locking: This history not allowed W2 is blocked by R1 T2 cannot finish before T1 What if T1 is a log trans. and T2 is a small trans.? T1 blocks T2; can block T3 … Tn if Optimistic [Kung] Ti (i = 2,…,n) commit. Wi saved for validn R1 validated with Wi, T1 aborted switch to

Optimistic Validation (first modification)
Try this or switch Ti’s can commit, Wi and Ri saved from validation W1 validates with Wi and Ri T1 aborted if validation fails (second modification) Switch R1 to the right after W2, W3…Wn Switch W1 to the left before Rn, Rn-1…R2 If R1 and W1 are adjacent, T1 is successful

Probability that two transactions do not share an object
Lower bound on this problem Maximum problem that two transactions will share an object BS M Probability of conflict 5 100 .0576 10 500 .0025 20 1000 .113 Probability of cycle = 0(PC2)  small

Concurrency/Multiprogramming level is low
Example: I/O = .005 seconds CPU .0001 seconds Trans size 5 Time to execute trans. .0255 seconds For another trans. to meet this trans. in the system Arrival rate > or > 40 per second

Optimistic CC Validation Test
If all transactions Tk where ts(Tk) < ts(Tij) have completed their write phase before Tij has started its read phase, then validation succeeds Transaction executions in serial order R V W Tk R V W Tij

If there is any transaction Tk such that ts(Tk)<ts(Tij) and which completes its write phase while Tij is in its read phase, then validation succeeds if WS(Tk)  RS(Tij) = Ø Read and write phases overlap, but Tij does not read data items written by Tk R V W Tk R V W Tij

If there is any transaction Tk such that ts(Tk)< ts(Tij) and which completes its read phase before Tij completes its read phase, then validation succeeds if WS(Tk) RS(Tij) = Ø and WS(Tk) WS(Tij) = Ø They overlap, but don't access any common data items. R V W Tk R V W Tij

Deadlock A transaction is deadlocked if it is blocked and will remain blocked until there is intervention. Locking-based CC algorithms may cause deadlocks. TO-based algorithms that involve waiting may cause deadlocks. Wait-for graph If transaction Ti waits for another transaction Tj to release a lock on an entity, then Ti  Tj in WFG. Ti Tj

Local versus Global WFG
Assume T1 and T2 run at site 1, T3 and T4 run at site 2. Also assume T3 waits for a lock held by T4 which waits for a lock held by T1 which waits for a lock held by T2 which, in turn, waits for a lock held by T3. Local WFG Site 1 Site 2 T1 T4 T2 T3 Global WFG T1 T4 T2 T3

Deadlock Management Ignore Prevention Avoidance Detection and Recovery
Let the application programmer deal with it, or restart the system Prevention Guaranteeing that deadlocks can never occur in the first place. Check transaction when it is initiated. Requires no run time support. Avoidance Detecting potential deadlocks in advance and taking action to insure that deadlock will not occur. Requires run time support. Detection and Recovery Allowing deadlocks to form and then finding and breaking them. As in the avoidance scheme, this requires run time support.

Deadlock Prevention All resources which may be needed by a transaction must be predeclared. The system must guarantee that none of the resources will be needed by an ongoing transaction. Resources must only be reserved, but not necessarily allocated a priori Unsuitability of the scheme in database environment Suitable for systems that have no provisions for undoing processes. Evaluation: Reduced concurrency due to preallocation Evaluating whether an allocation is safe leads to added overhead. Difficult to determine (partial order) No transaction rollback or restart is involved.

Deadlock Avoidance Transactions are not required to request resources a priori. Transactions are allowed to proceed unless a requested resource is unavailable. In case of conflict, transactions may be allowed to wait for a fixed time interval. Order either the data items or the sites and always request locks in that order. More attractive than prevention in a database environment.

Deadlock Avoidance – Wait-Die & Wound-Wait Algorithms
WAIT-DIE Rule: If Ti requests a lock on a data item which is already locked by Tj, then Ti is permitted to wait iff ts(Ti)<ts(Tj). If ts(Ti)>ts(Tj), then Ti is aborted and restarted with the same timestamp. if ts(Ti)<ts(Tj) then Ti waits else Ti dies non-preemptive: Ti never preempts Tj prefers younger transactions WOUND-WAIT Rule: If Ti requests a lock on a data item which is already locked by Tj , then Ti is permitted to wait iff ts(Ti)>ts(Tj). If ts(Ti)<ts(Tj), then Tj is aborted and the lock is granted to Ti. if ts(Ti)<ts(Tj) then Tj is wounded else Ti waits preemptive: Ti preempts Tj if it is younger prefers older transactions

Deadlock Detection Transactions are allowed to wait freely.
Wait-for graphs and cycles. Topologies for deadlock detection algorithms Centralized Distributed Hierarchical

Centralized Deadlock Detection
One site is designated as the deadlock detector for the system. Each scheduler periodically sends its local WFG to the central site which merges them to a global WFG to determine cycles. How often to transmit? Too often  higher communication cost but lower delays due to undetected deadlocks Too late  higher delays due to deadlocks, but lower communication cost Would be a reasonable choice if the concurrency control algorithm is also centralized. Proposed for Distributed INGRES

Hierarchical Deadlock Detection
Build a hierarchy of detectors DDox DD11 DD14 Site 1 Site 2 Site 3 Site 4 DD21 DD22 DD23 DD24

Distributed Deadlock Detection
Sites cooperate in detection of deadlocks. One example: The local WFGs are formed at each site and passed on to other sites. Each local WFG is modified as follows: Since each site receives the potential deadlock cycles from other sites, these edges are added to the local WFGs The edges in the local WFG which show that local transactions are waiting for transactions at other sites are joined with edges in the local WFGs which show that remote transactions are waiting for local ones. Each local deadlock detector: looks for a cycle that does not involve the external edge. If it exists, there is a local deadlock which can be handled locally. looks for a cycle involving the external edge. If it exists, it indicates a potential global deadlock. Pass on the information to the next site.

Useful References J. Gray and A. Reuter. Transaction Processing - Concepts and Techniques. Morgan Kaufmann, 1993. Bharat Bhargava (Ed.), Concurrency Control and Reliability in Distributed Systems, Van Nostrand and Reinhold Publishers, 1987.

Reliability In case of a crash, recover to a consistent (or correct state) and continue processing. Types of Failures Node failure Communication line of failure Loss of a message (or transaction) Network partition Any combination of above

Approaches to Reliability
Audit trails (or logs) Two phase commit protocol Retry based on timing mechanism Reconfigure Allow enough concurrency which permits definite recovery (avoid certain types of conflicting parallelism) Crash resistance design

Recovery Controller Types of failures: * transaction failure
* site failure (local or remote) * communication system failure Transaction failure UNDO/REDO Logs transparent transaction (effects of execution in private workspace)  Failure does not affect the rest of the system Site failure volatile storage lost stable storage lost processing capability lost (no new transactions accepted)

System Restart We need: Problem: Solution: Types of transactions:
1. In commitment phase 2. Committed actions reflected in real/stable 3. Have not yet begun 4. In prelude (have done only undoable actions) We need: stable undo log; stable redo log (at commit); perform redo log (after commit) Problem: entry into undo log; performing the action Solution: undo actions  < T, A, E > must be restartable (or idempotent) DO – UNDO UNDO DO – UNDO – UNDO – UNDO --- UNDO

Site Failures (simple ideas)
Local site failure - Transaction committed  do nothing - Transaction semi-committed  abort - Transaction computing/validating  abort AVOIDS BLOCKING Remote site failure - Assume failed site will accept transaction - Send abort/commit messages to failed site via spoolers Initialization of failed site - Update for globally committed transaction before validating other transactions - If spooler crashed, request other sites to send list of committed transactions

Communication Failures (simple ideas)
Communication system failure - Network partition - Lost message - Message order messed up Network partition solutions - Semi-commit in all partitions and commit on reconnection (updates available to user with warning) - Commit transactions if primary copy token for all entities within the partition - Consider commutative actions - Compensating transactions

Compensation Compensating transactions Recomputing cost
Commit transactions in all partitions Break cycle by removing semi-committed transactions Otherwise abort transactions that are invisible to the environment (no incident edges) Pay the price of committing such transactions and issue compensating transactions Recomputing cost Size of readset/writeset Computation complexity

Reliability and Fault-tolerate Parameters
Problem: How to maintain atomicity durability properties of transactions

Fundamental Definitions
Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced any failures within a given time period. Typically used to describe systems that cannot be repaired or where the continuous operation of the system is critical. Availability The fraction of the time that a system meets its specification. The probability that the system is operational at a given time t.

Basic System Concepts ENVIRONMENT SYSTEM Component 1 Component 2
Stimuli Responses Component 3 External state Internal state

Fundamental Definitions
Failure The deviation of a system from the behavior that is described in its specification. Erroneous state The internal state of a system such that there exist circumstances in which further processing, by the normal algorithms of the system, will lead to a failure which is not attributed to a subsequent fault. Error The part of the state which is incorrect. Fault An error in the internal states of the components of a system or in the design of a system.

Faults to Failures causes results in Fault Error Failure

Types of Faults Hard faults Soft faults Permanent
Resulting failures are called hard failures Soft faults Transient or intermittent Account for more than 90% of all failures Resulting failures are called soft failures

Fault Classification Permanent fault Permanent error Incorrect design
Intermittent error Unstable or marginal components System Failure Unstable environment Transient error Operator mistake

Multiple errors can occur
Failures MTBF MTTD MTTR Time Fault occurs Error caused Detection of error Repair Fault occurs Error caused Multiple errors can occur during this period

Fault Tolerance Measures
Reliability R(t) = Pr{0 failures in time [0,t] | no failures at t=0} If occurrence of failures is Poisson R(t) = Pr{0 failures in time [0,t]} Then where m(t) is known as the hazard function which gives the time-dependent failure rate of the component and is defined as e-m(t)[m(t)]k Pr(k failures in time [0,t] = k! t  m ( t )  z ( x ) dx

Fault-Tolerance Measures
Reliability The mean number of failures in time [0, t] can be computed as and the variance can be be computed as Var[k] = E[k2] - (E[k])2 = m(t) Thus, reliability of a single component is R(t) = e-m(t) and of a system consisting of n non-redundant components as ∞  e-m(t )[m(t )]k E [k] = = m(t ) k k! k =0 Rsys(t) =  i =1 n Ri(t)

Availability A(t) = Pr{system is operational at time t} Assume Poisson failures with rate Repair time is exponentially distributed with mean 1/µ Then, steady-state availability  A = lim A(t)  t  

MTBF Mean time between failures MTBF = ∞ R(t)dt MTTR Mean time to repair Availability MTBF + MTTR

Types of Failures Transaction failures System (site) failures
Transaction aborts (unilaterally or due to deadlock) Avg. 3% of transactions abort abnormally System (site) failures Failure of processor, main memory, power supply, … Main memory contents are lost, but secondary storage contents are safe Partial vs. total failure Media failures Failure of secondary storage devices such that the stored data is lost Head crash/controller failure (?) Communication failures Lost/undeliverable messages Network partitioning

Local Recovery Management – Architecture
Volatile storage Consists of the main memory of the computer system (RAM). Stable storage Resilient to failures and loses its contents only in the presence of media failures (e.g., head crashes on disks). Implemented via a combination of hardware (non-volatile storage) and software (stable-write, stable-read, clean-up) components. Main memory Local Recovery Manager Secondary storage Fetch, Flush Database buffers (Volatile database) Stable database Read Write Database Buffer Manager Write Read

Update Strategies In-place update Out-of-place update
Each update causes a change in one or more data values on pages in the database buffers Out-of-place update Each update causes the new value(s) of data item(s) to be stored separate from the old value(s)

In-Place Update Recovery Information
Database Log Every action of a transaction must not only perform the action, but must also write a log record to an append-only file. Old stable database state New stable database state Update Operation Database Log

Logging The log contains information used by the recovery process to restore the consistency of a system. This information may include transaction identifier type of operation (action) items accessed by the transaction to perform the action old value (state) of item (before image) new value (state) of item (after image) …

Why Logging? Upon recovery:
all of T1's effects should be reflected in the database (REDO if necessary due to a failure) none of T2's effects should be reflected in the database (UNDO if necessary) system crash Begin T1 End Begin T2 t time

REDO Protocol REDO'ing an action means performing it again.
Old stable database state New stable database state REDO Database Log REDO'ing an action means performing it again. The REDO operation uses the log information and performs the action that might have been done before, or not done due to failures. The REDO operation generates the new image.

UNDO Protocol New stable database state Old stable database state UNDO Database Log UNDO'ing an action means to restore the object to its before image. The UNDO operation uses the log information and restores the old value of the object.

When to Write Log Records Into Stable Store
Assume a transaction T updates a page P Fortunate case System writes P in stable database System updates stable log for this update SYSTEM FAILURE OCCURS!... (before T commits) We can recover (undo) by restoring P to its old state by using the log Unfortunate case SYSTEM FAILURE OCCURS!... (before stable log is updated) We cannot recover from this failure because there is no log record to restore the old value. Solution: Write-Ahead Log (WAL) protocol

Write–Ahead Log Protocol
Notice: If a system crashes before a transaction is committed, then all the operations must be undone. Only need the before images (undo portion of the log). Once a transaction is committed, some of its actions might have to be redone. Need the after images (redo portion of the log). WAL protocol : Before a stable database is updated, the undo portion of the log should be written to the stable log When a transaction commits, the redo portion of the log must be written to stable log prior to the updating of the stable database.

Logging Interface (see book)
Secondary storage Main memory Stable log Log buffers Local Recovery Manager Read Fetch, Write Database buffers (Volatile database) Flush Stable database Read Database Buffer Manager Read Write Write

Out-of-Place Update Recovery Information (see book)
Shadowing When an update occurs, don't change the old page, but create a shadow page with the new values and write it into the stable database. Update the access paths so that subsequent accesses are to the new shadow page. The old page retained for recovery. Differential files For each file F maintain a read only part FR a differential file consisting of insertions part DF+ and deletions part DF- Thus, F = (FR  DF+) – DF- Updates treated as delete old value, insert new value

Execution of Commands (see book)
Commands to consider: begin_transaction read write commit abort recover Independent of execution strategy for LRM

Execution Strategies (see book)
Dependent upon Can the buffer manager decide to write some of the buffer pages being accessed by a transaction into stable storage or does it wait for LRM to instruct it? fix/no-fix decision Does the LRM force the buffer manager to write certain buffer pages into stable database at the end of a transaction's execution? flush/no-flush decision Possible execution strategies: no-fix/no-flush no-fix/flush fix/no-flush fix/flush

No-Fix/No-Flush (see book)
Abort Buffer manager may have written some of the updated pages into stable database LRM performs transaction undo (or partial undo) Commit LRM writes an “end_of_transaction” record into the log. Recover For those transactions that have both a “begin_transaction” and an “end_of_transaction” record in the log, a partial redo is initiated by LRM For those transactions that only have a “begin_transaction” in the log, a global undo is executed by LRM

No-Fix/Flush (see book)
Abort Buffer manager may have written some of the updated pages into stable database LRM performs transaction undo (or partial undo) Commit LRM issues a flush command to the buffer manager for all updated pages LRM writes an “end_of_transaction” record into the log. Recover No need to perform redo Perform global undo

Fix/No-Flush (see book)
Abort None of the updated pages have been written into stable database Release the fixed pages Commit LRM writes an “end_of_transaction” record into the log. LRM sends an unfix command to the buffer manager for all pages that were previously fixed Recover Perform partial redo No need to perform global undo

Fix/Flush (see book) Abort
None of the updated pages have been written into stable database Release the fixed pages Commit (the following have to be done atomically) LRM issues a flush command to the buffer manager for all updated pages LRM sends an unfix command to the buffer manager for all pages that were previously fixed LRM writes an “end_of_transaction” record into the log. Recover No need to do anything

Checkpoints Simplifies the task of determining actions of transactions that need to be undone or redone when a failure occurs. A checkpoint record contains a list of active transactions. Steps: Write a begin_checkpoint record into the log Collect the checkpoint dat into the stable storage Write an end_checkpoint record into the log

Media Failures –Full Architecture (see book)
Secondary storage Main memory Stable log Log buffers Local Recovery Manager Read Fetch, Write Database buffers (Volatile database) Flush Stable database Read Database Buffer Manager Read Write Write Write Write Archive database Archive log

Useful References D. Skeen and M Stonebraker, A Formal Model of Crash Recovery in a Distributed System, IEEE Trans. Software Eng. 9(3): , 1983. D. Skeen, A Decentralized Termination Protocol, IEEE Symposium on Reliability in Distributed Software and Database Systems, July 1981.

Byzantine General Problem
Two generals are situated on adjacent hills and enemy is in the valley in between. Enemy can defeat either general, but not both. To succeed, both generals must agree to either attack or retreat. The generals can communicate via messengers who are subject to capture or getting lost. The general may themselves be traitors or send inconsistent information.

Byzantine Agreement Problem of a set of processors to agree on a common value for an object. Processors may fail arbitrarily, die and revive randomly, send messages when they are not supposed to etc.

Atomicity Control from Book
Commit protocols How to execute commit command for distributed transactions. Issue: how to ensure atomicity and durability? Termination protocols If a failure occurs, how can the remaining operational sites deal with it. Non-blocking : the occurrence of failures should not force the sites to wait until the failure is repaired to terminate the transaction. Recovery protocols When a failure occurs, how do the sites where the failure occurred deal with it. Independent : a failed site can determine the outcome of a transaction without having to obtain remote information. Independent recovery  non-blocking termination

General Terminology for Commit/Termination/Recovery Protocols
Committed: Effects are installed to the database. Aborted: Does not execute to completion and any partial effects on database are erased. Consistent state: Derived state from serial execution. Inconsistency caused by: Concurrently executing transaction. Failures causing partial or incorrect execution of a transaction.

Commit protocols Protocols for directing the successful execution of a simple transaction Termination protocols Protocols at operational site to commit/abort an unfinished transaction after a failure Recovery protocols Protocols at failed site to complete all transactions outstanding at the time of failure

Distributed Crash Recovery: Centralized Protocols Hierarchical Protocols Linear Protocols Decentralized Protocols Phase: Consists of a message round where all Sites exchange messages. Two Phase Commit Protocol: ARGUS, LOCUS, INGRES Four Phase Commit Protocol: SSD-1 Quorum: Minimum number of sites needed to proceed with an action

Commit/Termination Protocols
Two Phase Commit Three Phase Commit Four Phase Commit Linear, Centralized, Hierarchical, Decentralized Protocols

Two Phase Commit Site 1 Site 2 1. Trans. arrives.
Message to ask for vote is sent to other site(s) Message is recorded. Site votes Y or N (abort) Vote is sent to site 1 2. The vote is received. If vote = Y on both sites, then Commit else Abort Either Commit or Abort based on the decision of site 1

Two-Phase Commit (2PC) Phase 1 : The coordinator gets the participants ready to write the results into the database Phase 2 : Everybody writes the results into the database Coordinator :The process at the site where the transaction originates and which controls the execution Participant :The process at the other sites that participate in executing the transaction Global Commit Rule: The coordinator aborts a transaction if and only if at least one participant votes to abort it. The coordinator commits a transaction if and only if all of the participants vote to commit it.

Local Protocols for the Centralized Two-Phase Commit Protocol
q1 w1 a1 c1 c2 xact request start xact yes abort commit a2 no w2 q2 Site 1 (co-ordinator) Site 2 (slave)

Decentralized Two-Phase Commit Protocol
ci ai wi qi xact noi noin yesi yesin … no1i| |noni yes1i| |yesni Site i (i = 1,2,…n) send receive

Centralized 2PC (see book)
ready? yes/no commit/abort? commited/aborted Phase 1 Phase 2

SDD-1 Four-Phase Commit Protocol
q1 a1’ c1’ w1’ w1 a1 c1 c2 a2 q2 w2 Site 1 (co-ordinator) Site 2 (back-up) request xact2 act2 xact3xact4 abort2 commit2 yes3yes4 no3|no4 ack2 commit3commit4 abort3abort4 ci ai wi qi xacti noi yesi … aborti commiti Site i (i = 3,4) (slave)

2PC Protocol Actions (see book)
Coordinator Participant INITIAL INITIAL PREPARE write begin_commit in log write abort in log No Ready to Commit? VOTE-ABORT Yes WAIT VOTE-COMMIT write ready in log Yes write abort in log GLOBAL-ABORT READY Any No? No VOTE-COMMIT write commit in log Abort Type of msg ACK COMMIT ABORT write abort in log Commit ACK write commit in log write end_of_transaction in log ABORT COMMIT

Linear 2PC Phase 1 Prepare VC/VA VC/VA VC/VA VC/VA 1 2 3 4 5 N GC/GA
VC: Vote-Commit, VA: Vote-Abort, GC: Global-commit, GA: Global-abort

State Transitions in 2PC (see book)
INITIAL INITIAL Commit command Prepare Prepare Vote-commit Prepare Vote-abort READY WAIT Vote-abort Vote-commit (all) Global-abort Global-commit Global-abort Global-commit Ack Ack ABORT COMMIT ABORT COMMIT Coordinator Participants

Site Failures - 2PC Termination (see book)
COORDINATOR Timeout in INITIAL Who cares Timeout in WAIT Cannot unilaterally commit Can unilaterally abort Timeout in ABORT or COMMIT Stay blocked and wait for the acks INITIAL Commit command Prepare WAIT Vote-abort Vote-commit Global-abort Global-commit ABORT COMMIT

Site Failures - 2PC Termination
PARTICIPANTS INITIAL Timeout in INITIAL Coordinator must have failed in INITIAL state Unilaterally abort Timeout in READY Stay blocked Prepare Vote-commit Prepare Vote-abort READY Global-abort Global-commit Ack Ack ABORT COMMIT

Site Failures - 2PC Recovery
COORDINATOR Failure in INITIAL Start the commit process upon recovery Failure in WAIT Restart the commit process upon recovery Failure in ABORT or COMMIT Nothing special if all the acks have been received Otherwise the termination protocol is involved INITIAL Commit command Prepare WAIT Vote-commit Vote-abort Global-commit Global-abort ABORT COMMIT

Site Failures - 2PC Recovery
PARTICIPANTS Failure in INITIAL Unilaterally abort upon recovery Failure in READY The coordinator has been informed about the local decision Treat as timeout in READY state and invoke the termination protocol Failure in ABORT or COMMIT Nothing special needs to be done INITIAL Prepare Vote-commit Prepare Vote-abort READY Global-abort Global-commit Ack Ack ABORT COMMIT

2PC Recovery Protocols –Additional Cases (see book)
Arise due to non-atomicity of log and message send actions Coordinator site fails after writing “begin_commit” log and before sending “prepare” command treat it as a failure in WAIT state; send “prepare” command Participant site fails after writing “ready” record in log but before “vote-commit” is sent treat it as failure in READY state alternatively, can send “vote-commit” upon recovery Participant site fails after writing “abort” record in log but before “vote-abort” is sent no need to do anything upon recovery

2PC Recovery Protocols –Additional Case (see book)
Coordinator site fails after logging its final decision record but before sending its decision to the participants coordinator treats it as a failure in COMMIT or ABORT state participants treat it as timeout in the READY state Participant site fails after writing “abort” or “commit” record in log but before acknowledgement is sent participant treats it as failure in COMMIT or ABORT state coordinator will handle it by timeout in COMMIT or ABORT state

Problem With 2PC Blocking Independent recovery is not possible
Ready implies that the participant waits for the coordinator If coordinator fails, site is blocked until recovery Blocking reduces availability Independent recovery is not possible However, it is known that: Independent recovery protocols exist only for single site failures; no independent recovery protocol exists which is resilient to multiple-site failures. So we search for these protocols – 3PC

Three-Phase Commit 3PC is non-blocking.
A commit protocols is non-blocking iff it is synchronous within one state transition, and its state transition diagram contains no state which is “adjacent” to both a commit and an abort state, and no non-committable state which is “adjacent” to a commit state Adjacent: possible to go from one stat to another with a single state transition Committable: all sites have voted to commit a transaction e.g.: COMMIT state

State Transitions in 3PC
Coordinator Participants INITIAL INITIAL Commit command Prepare Prepare Vote-commit Prepare Vote-abort WAIT READY Vote-abort Vote-commit Global-abort Prepared-to-commit Global-abort Prepare-to-commit Ack Ready-to-commit PRE- COMMIT PRE- COMMIT ABORT ABORT Ready-to-commit Global commit Global commit Ack COMMIT COMMIT

Communication Structure (see book)
P P P P P P C C C C P P P P P P pre-commit/ ready? yes/no pre-abort? yes/no commit/abort ack Phase 1 Phase 2 Phase 3

Formalism for Commit Protocols
Finite set of states Messages addressed to the site Messages sent by the site Initial state Abort states Commit states Q V : < Q, I, 0, , V, A, C >

Formalism for Commit Protocols
Properties: 1. 2. V V Protocols are non-deterministic: Sites make local decisions. Messages can arrive in any order.

Global State Definition
Global state vector containing the states of the local protocols. Outstanding messages in the network A global state transition occurs whenever a local state transition occurs at a participating site. Exactly one global transition occurs for each local transition.

Global Sate Graph q1 q2 xact req w1 q2 start xact w1 w2 yes w1 a2 no a1 w2 abort c1 w2 commit a1 a2 c1 c2 Global state is inconsistent if its state vector contains both a commit and abort state.

S(s) = {t/t sends message m & m  M}
Two states are potentially concurrent if there exists a reachable global state that contains both local states. Concurrency set of s is set of all local states that are potentially concurrent with it. C(s) C(w1) = {V2, a2 , w2} The sender set for s, S(s) = {t/t sends message m & m  M} where M be the set of messages that are received by s. t is a local state.

States of Various States in the Commit Protocol
Global state inconsistent if it contains both local commit state local abort state Final state if All local states are final Terminal state if: there exists an immediately reachable successor state  deadlock Committable state (local) if: all sites have voted yes on committing the transaction Otherwise, non-committable

An Example when Only a Single Site Remains Operational
This site can safely abort the transaction if and only if the concurrency set for its local state does not contain a commit state This site can safely commit only if Its local state must be “committable” And the concurrency set for its state must not contain an abort state. A blocking situation arises when The concurrency set for the local state contains both a commit and an abort state Or the site is in a “noncommittable” state and the concurrency set for that state contains a commit state The site can not commit because it can not infer that all sites have voted yes on committing It can not abort because another site may have committed the transaction before crashing. There observations imply the simple but power result in the next slide

Fundamental Non-blocking Theorem
Definition: protocol is synchronous within one state transition if one site never leads another site by more than one state transition. Theorem Fundamental non-blocking: A protocol is non-blocking iff There exists no local state s C(s) = A (abort) and C (commit) And there exists no non-committable state s C(s) = C (commit) Lemma: A protocol that is synchronous within one state transition is non-blocking iff: No local state is adjacent to both a commit & an abort state No non-committable state is adjacent to a commit state

Three-Phase Commit Protocol
Site i (i = 2,3,…n) (slave) ci ai wi qi xacti noi yesi … aborti commiti pi preparei acki p1 a1 q1 w1 c1 ack ackn commit commitn request xact xact4 no2| |non abort abortn yes yesn prepare preparen Site I (co-ordinator)

Useful References D. Skeen and M Stonebraker, A Formal Model of Crash Recovery in a Distributed System, IEEE Trans. Software Eng. 9(3): , 1983. D. Skeen, A Decentralized Termination Protocol, IEEE Symposium on Reliability in Distributed Software and Database Systems, July 1981. D. Skeen, Nonblocking commit protocols, ACM SIGMOD, 1981.

Termination Protocols
Message sent by an operational site abort – If trans. state is abort (If in abort) committable – If trans. state is committable (If in p or c) non-committable – If trans. state is neither committable nor abort (If in initial or wait) If at least one committable message is received, then commit the transaction, else abort it.

Problem with Simple Termination Protocol
Issue 1 Operational site fails immediately after making a commit decision Issue 2 Site does not know the current operational status (i.e., up or down) of other sites. Simple termination protocol is not robust: Site Site Site 3 Committable Noncommittable Site 3 does not know if Site 1 was up at beginning. Does not know it got inconsistent messages Crashes before sending message to Site 3 Commits and fails before sending message to Site 3 Resilient protocols require at least two rounds unless no site fails during the execution of the protocol.

Resilient Termination Protocols
First message round: Type of transaction state Message sent Final abort state abort Committable state committable All other states non-committable

Second and subsequent rounds: Message received from previous round Message sent One or more abort messages abort One or more committable messages committable All non-committable messages non-committable Summary of rules for sending messages.

The transactions is terminated if: Condition Final state Receipt of a single abort message abort Receipt of all committable messages commit 2 successive rounds of messages where all messages are non-committable (and no site failure) Summary of commit and termination rules.

Rules for Commit and Termination
Commit Rule: A transaction is committed at a site only after the receipt of a round consisting entirely of committable messages Termination Rule: If a site ever receives two successive rounds of non-committable messages and it detects no site failures between rounds, it can safely abort the transaction. Lemma: Ni(r+1)  Ni(r) Set of sites sending non-committables to site i during round r. Lemma: If Ni(r+1) = Ni(r), then all messages received by site i during r and r + 1 were non-committable messages.

Worst Case Execution of the Resilient Transition Protocol
MESSAGES RECEIVED SITE 1 SITE 2 SITE 3 SITE 4 SITE5 initial state Commit- able Non-Committable Round 1 (1) CNNNN -NNNN Round 2 FAILED -CNNN --NNN Round 3 --CNN ---NN Round 4 ---CN Round 5 ----C NOTE: (1) site fails after sending a single message.

Recovery Protocols Recovery Protocols: Classes of failures:
Protocols at failed site to complete all transactions outstanding at the time of failure Classes of failures: Site failure Lost messages Network partitioning Byzantine failures Effects of failures: Inconsistent database Transaction processing is blocked Failed component unavailable

Independent Recovery A recovering site makes a transition directly to a final state without communicating with other sites. Lemma: For a protocol, if a local state’s concurrency set contains both an abort and commit, it is not resilient to an arbitrary failure of a single site. si  commit because other site may be in abort si  abort because other site may be in commit cannot cannot Rule 1: s: Intermediate state If C(s) contains a commit failure transition from s to commit otherwise failure transition from s to abort

Theorem for Single Site Failure
Rule 2: For each intermediate state si: if tj in s(si) & tj has a failure transition to a commit (abort), then assign a timeout transition from si to a commit (abort). Theorem: Rules 1 and 2 are sufficient for designing protocols resilient to a single site failure. p: consistent p’: p + Failure + Timeout Transition s2 = f2  f2  C(si) si in s(s2) f2 ← inconsistent site 1 fails s1 f1 

Independent Recovery when Two Sites Fail?
Theorem: There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites. Same state exists for other sites First global state G0  abort G1 Gk-1  site j recovers to abort (only j makes a transition) other sites recover to abort Gk  site j recovers to commit Gm  commit Note: G0, G1, G2, … Gk-1, Gk, … Gm are global state vectors. Failure of j  recover to commit Failure of any other site  recover to abort

Resilient Protocol when Messages are Lost
Theorem: There exists no protocol resilient to a network partitioning when messages are lost. Rule 3: Rule 4: Rule 1: Rule 2: Isomorphic to undelivered message ↔ timeout timeout ↔ failure Theorem: Rules 3 & 4 are necessary and sufficient for making protocols resilient to a partition in a two-site protocol. Theorem: There exists no protocol resilient to a multiple partition.

Site Failures – 3PC Termination (see book)
Coordinator INITIAL Timeout in INITIAL Who cares Timeout in WAIT Unilaterally abort Timeout in PRECOMMIT Participants may not be in PRE-COMMIT, but at least in READY Move all the participants to PRECOMMIT state Terminate by globally committing Commit command Prepare WAIT Vote-abort Vote-commit Global-abort Prepare-to-commit PRE- COMMIT ABORT Ready-to-commit Global commit COMMIT

Coordinator INITIAL Commit command Prepare Timeout in ABORT or COMMIT Just ignore and treat the transaction as completed participants are either in PRECOMMIT or READY state and can follow their termination protocols WAIT Vote-abort Vote-commit Global-abort Prepare-to-commit PRE- COMMIT ABORT Ready-to-commit Global commit COMMIT

INITIAL Participants Timeout in INITIAL Coordinator must have failed in INITIAL state Unilaterally abort Timeout in READY Voted to commit, but does not know the coordinator's decision Elect a new coordinator and terminate using a special protocol Timeout in PRECOMMIT Handle it the same as timeout in READY state Prepare Vote-commit Prepare Vote-abort READY Global-abort Prepared-to-commit Ack Ready-to-commit PRE- COMMIT ABORT Global commit Ack COMMIT

Termination Protocol Upon Coordinator Election (see book)
New coordinator can be in one of four states: WAIT, PRECOMMIT, COMMIT, ABORT Coordinator sends its state to all of the participants asking them to assume its state. Participants “back-up” and reply with appriate messages, except those in ABORT and COMMIT states. Those in these states respond with “Ack” but stay in their states. Coordinator guides the participants towards termination: If the new coordinator is in the WAIT state, participants can be in INITIAL, READY, ABORT or PRECOMMIT states. New coordinator globally aborts the transaction. If the new coordinator is in the PRECOMMIT state, the participants can be in READY, PRECOMMIT or COMMIT states. The new coordinator will globally commit the transaction. If the new coordinator is in the ABORT or COMMIT states, at the end of the first phase, the participants will have moved to that state as well.

Site Failures – 3PC Recovery (see book)
Failure in INITIAL start commit process upon recovery Failure in WAIT the participants may have elected a new coordinator and terminated the transaction the new coordinator could be in WAIT or ABORT states  transaction aborted ask around for the fate of the transaction Failure in PRECOMMIT Coordinator INITIAL Commit command Prepare WAIT Vote-abort Vote-commit Global-abort Prepare-to-commit PRE- COMMIT ABORT Ready-to-commit Global commit COMMIT

Coordinator INITIAL Commit command Prepare Failure in COMMIT or ABORT Nothing special if all the acknowledgements have been received; otherwise the termination protocol is involved WAIT Vote-abort Vote-commit Global-abort Prepare-to-commit PRE- COMMIT ABORT Ready-to-commit Global commit COMMIT

INITIAL Failure in INITIAL unilaterally abort upon recovery Failure in READY the coordinator has been informed about the local decision upon recovery, ask around Failure in PRECOMMIT ask around to determine how the other participants have terminated the transaction Failure in COMMIT or ABORT no need to do anything Participants Prepare Vote-commit Prepare Vote-abort READY Global-abort Prepared-to-commit Ack Ready-to-commit PRE- COMMIT ABORT Global commit Ack COMMIT

Useful References S. B. Davidson, Optimism and consistency in partitioned distributed database systems, ACM Transactions on Database Systems 9(3): , 1984. S. B. Davidson, H. Garcia-Molina, and D. Skeen, Consistency in Partitioned Networks, ACM Computer Survey, 17(3): , 1985. B. Bhargava, Resilient Concurrency Control in Distributed Database Systems, IEEE Trans. on Reliability, R-31(5): , 1984. Jr. D. Parker, et al., Detection of Mutual Inconsistency in Distributed Systems, IEEE Trans. on Software Engineering, SE-9, 1983.

Site Failure and Recovery
Maintain consistency of replicated copies during site failure. Announce failure and restart of a site. Identify out-of-date data items. Update stale data items.

Main Ideas and Concepts
Read one Write all available protocol. Fail locks and copier transactions. Session vectors. Control transactions.

Logical and Physical Copies of Data
X: Logical data item xk: A copy of item X on site k Strict read-one write all (ROWA) requires reading at Least at one site and writing at all sites. Read(X) = Write(X) = {read(xk), xk  X} {write(xk), xk  X}

Session Numbers and Nominal Session Numbers
Each operational session of a site is designated with an integer, session number. Failed site has session number = 0. as[k] is actual session number of site k. nsi[k] is nominal session number of site k at site i. NS[k] is nominal session number of site k. A nominal session vector consisting of nominal session numbers of all sites is stored at each site. nsi is the nominal session vector at site i.

Read one Write all Available (ROWAA)
Transaction initiated at site i, reads and writes as follows: Read(X) = Write(X) = {read(xk), xk  X and nsi[k]  0} {write(xk), xk  X and nsi[k]  0} At site k, the nsi(k) is checked against as as[k]. If they are not equal, the transaction is rejected. Transaction is not sent to a failed site for whom nsi(k) = 0.

Control Transactions for Announcing Recovery
Type 1: Claims that a site is nominally up. Updates the session vector of all operational sites with the recovering site’s new session number. New session number is one more than the last session number (like an incarnation). Example: as[k] = 1 initially as[k] = 0 after site failure as[k] = 2 after site recovers as[k] = 3 after site recovers second time

Control Transactions for Announcing Failure
Type 2: Claims that one or more sites are down. Claim is made when a site attempts and fails to access a data item on another site. Control transaction type 2 sets a value 0 for a failed site in the nominal session vectors at all operational sites. This allows operational sites to avoid sending read and write requests to failed sites.

Fail Locks A fail lock is set at an operational site on behalf of a failed site if a data item is updated. Fail lock can be set per site or per data item. Fail lock used to identify out-of-date items (or missed updates) when a site recovers. All fail locks are released when all sites are up and all data copies are consistent.

Copier Transaction Copier transaction reads current values (for failed lock items) on operational sites and writes on out of data items on the recover site.

Site Recovery Procedure
When a site k starts, it loads its actual session number as[k] with 0, meaning that the site is ready to process control transactions but not user transactions. Next, the site initiates a control transaction of type 1. It reads an available copy of the nominal session vector and refreshes its own copy. Next this control transaction writes a newly chosen session number into nsi[k] for all operational sites I including itself, but not as[k] as yet. Using the fail locks on the operational site, the recovering site marks the data copies that have missed updates since the site failed. Note that steps 2 and 3 can be combined. If the control transaction in step 2 commits, the site is nominally up. The site converts its state from recovering to operational by loading the new session number into as[k]. If step 2 fails due to a crash of another site, the recovering site must initiate a control transaction of type 2 to exclude the newly crashed site, and then must try step 2 and 3 again. Note that the recovery procedure is delayed by the failure of another site, but the algorithm is robust as long as there is at least one operational site coordinating the transaction in the system.

Site is down All data items are available Site is up None of the data items are available (all fail locks for this site released) Continued recovery, copies on failed site marked and fail-locks are released Partial recovery unmarked data-objects are available Control transaction 1 running Status in site recovery and Availability of Data Items for Transaction Processing

Transaction Processing when Network Partitioning Occurs
Three Alternatives after Partition Allow each group of nodes to process new transactions Allow at most one group to process new transactions Halt all transaction processing Alternative A Database values will diverge database inconsistent when partition is eliminated Undo some transactions detailed log expensive Integrate the inconsistent values database item X has values v1, v2 new value = v1 + v2 – value of i at partition

Network Partition Alternatives
Alternative B How to guarantee only one group processes transactions assign a number of points to each site partition with majority of points proceeds Both partition and site failure cases are equivalent in the sense in both situations we have a group of sites which know that no other site outside the group may process transactions What if  no group with a majority? should we allow transactions to proceed? commit point? delay the commit decision? force transaction to commit or cancel?

Planes of Serializability
Begin Partition Plane C End Partition A Partition C Partition B Rollback Plane A B

Merging Semi-Committed Transactions
Merger of Semi-Committed Transactions From Several Partitions Combine DCG, DCG2, --- DCGN (DCG is Dynamic Cyclic Graph) (minimize rollback if cycle exists) NP-complete (minimum feedback vertex set problem) Consider each DCG as a single transaction Check acyclicity of this N node graph (too optimistic!) Assign a weight to transactions in each partition Consider DCG1 with maximum weight Select transactions from other DCG’s that do not create cycles

Breaking Cycle by Aborting Transactions
Two Choices Abort transactions who create cycles Consider each transaction that creates cycle one at a time. Abort transactions which optimize rollback (complexity O(n3)) Minimization not necessarily optimal globally

Commutative Actions and Semantics
Semantics of Transaction Computation Commutative Give $5000 bonus to every employee Commutativity can be predetermined or recognized dynamically Maintain log (REDO/UNDO) of commutative and noncommutative actions Partially rollback transactions to their first noncommutative action

Compensating Actions Compensating Transactions
Commit transactions in all partitions Break cycle by removing semi-committed transactions Otherwise abort transactions that are invisible to the environment (no incident edges) Pay the price of commiting such transactions and issue compensating transactions Recomputing Cost Size of readset/writeset Computation complexity

Network Partitioning Simple partitioning Multiple partitioning
Only two partitions Multiple partitioning More than two partitions Formal bounds: There exists no non-blocking protocol that is resilient to a network partition if messages are lost when partition occurs. There exist non-blocking protocols which are resilient to a single network partition if all undeliverable messages are returned to sender. There exists no non-blocking protocol which is resilient to a multiple partition.

Independent Recovery Protocols for Network Partitioning
No general solution possible allow one group to terminate while the other is blocked improve availability How to determine which group to proceed? The group with a majority How does a group know if it has majority? centralized whichever partitions contains the central site should terminate the transaction voting-based (quorum) different for replicated vs non-replicated databases

Quorum Protocols for Non-Replicated Databases
The network partitioning problem is handled by the commit protocol. Every site is assigned a vote Vi. Total number of votes in the system V Abort quorum Va, commit quorum Vc Va + Vc > V where 0 ≤ Va , Vc ≤ V Before a transaction commits, it must obtain a commit quorum Vc Before a transaction aborts, it must obtain an abort quorum Va

State Transitions in Quorum Protocols
Coordinator INITIAL INITIAL Participants Commit command Prepare Prepare Vote-commit Prepare Vote-abort READY WAIT Prepared-to-abortt Vote-abort Vote-commit Prepare-to-commit Prepare-to-abort Ready-to-abort Prepare-to-commit Ready-to-commit PRE- ABORT PRE- COMMIT PRE- ABORT PRE- COMMIT Ready-to-abort Ready-to-commit Global-abort Ack Global commit Ack Global-abort Global commit ABORT COMMIT ABORT COMMIT

Quorum Protocols for Replicated Databases
Network partitioning is handled by the replica control protocol. One implementation: Assign a vote to each copy of a replicated data item (say Vi) such that i Vi = V Each operation has to obtain a read quorum (Vr) to read and a write quorum (Vw) to write a data item Then the following rules have to be obeyed in determining the quorums: Vr + Vw > V a data item is not read and written by two transactions concurrently Vw > V/2 two write operations from two transactions cannot occur concurrently on the same data item

Use for Network Partitioning
Simple modification of the ROWA rule: When the replica control protocol attempts to read or write a data item, it first checks if a majority of the sites are in the same partition as the site that the protocol is running on (by checking its votes). If so, execute the ROWA rule within that partition. Assumes that failures are “clean” which means: failures that change the network's topology are detected by all sites instantaneously each site has a view of the network consisting of all the sites it can communicate with

Open Problems Replication protocols Transaction models
experimental validation replication of computation and communication Transaction models changing requirements cooperative sharing vs. competitive sharing interactive transactions longer duration complex operations on complex data relaxed semantics non-serializable correctness criteria

Other Issues Detection of mutual inconsistency in distributed systems
Distributed system with replication for reliability (availability) efficient access Maintaining consistency of all copies hard to do efficiently Handling discovered inconsistencies not always possible semantics-dependent

Replication and Consistency
Tradeoffs between degree of replication of objects access time of object availability of object (during partition) synchronization of updates (overhead of consistency) All objects should always be available. All objects should always be consistent. “Partitioning can destroy mutual consistency in the worst case”. Basic Design Issue: Single failure must not affect entire system (robust, reliable).

Availability and Consistency
Previous work Maintain consistency by: Voting (majority consent) Tokens (unique/resource) Primary site (LOCUS) Reliable networks (SDD-1) Prevent inconsistency at a cost does not address detection or resolution issues. Want to provide availability and correct propagation of updates.

Detecting Inconsistency
Network may continue to partition or partially merge for an unbounded time. Semantics also different with replication: naming, creation, deletion… names in on partition do not relate to entities in another partition Need globally unique system name, and user name(s). Must be able to use in partitions.

Types of Conflicting Consistency
System name consists of a < Origin, Version > pair Origin – globally unique creation name Version – vector of modification history Two types of conflicts: Name – two files have same user-name Version – two incompatible versions of the same file. Conflicting files may be identical… Semantics of update determine action Detection of version conflicts Timestamp – overkill Version vector – “necessary + sufficient” Update log – need global synchronization

Version Vector Version vector approach each file has a version vector
(Si : ui) pairs Si – Site on which the file is stored ui – Number of updates on that site Example: < A:4, B:2; C:0; D:1 > Compatible vectors: one is at least as large as the other over all sites in vector < A:1; B:2; C:4; D:3 > ← < A:0; B:2; C:2; D:3 > < A:1; B:2; C:4; D:3 >  < A:1; B:2; C:3; D:4 > (Not Compatible) (< A:1; B:2; C:4; D:4 >)

Additional Comments Committed updates on site Si will update ui by one
Deletion/Renaming are updates Resolution on site Si increments ui to maintain consistency later. to Max Si Storing a file at new site makes vector longer by one site. Inconsistency determined as early as possible. Only works for single file consistency, and not transactions…

Example of Conflicting Operation in Different Partitions
A B C A B C A B C < A:0 B:0 C:0 > < A:0 B:0 C:0 > < A:2 B:0 C:1 > < A:2 B:0 C:0 > < A:3 B:0 C:0 > A updates file twice A updates f once B’s version adopted CONFLICT 3 > 2, 0 = 0, 0 < 1 Version vector VVi = (Si ; vi) vi update to file f at site Si

Example of Partition and Merge
A B C D D B C A A B C D B C D + + : update + +

After reconcilation at site B
Create Conflict A B C D < A:0, B:0, C:0, D:0 > < A:2, B:0, C:0, D:0 > + A B C D < A:0, B:0, C:0, D:0 > < A:0, B:0, C:0, D:0 > D A B C + + < A:2, B:0, C:1, D:0 > < A:3, B:0, C:0, D:0 > B C D < A:2, B:0, C:1, D:0 > A B C D CONFLICT! After reconcilation at site B < A:3, B:1, C:1, D:0 >

General resolution rules not possible.
External (irrevocable) actions prevent reconciliation, rollback, etc. Resolution should be inexpensive. System must address: detection of conflicts (when, how) meaning of a conflict (accesses) resolution of conflicts automatic user-assisted

Conclusions Effective detection procedure
providing access without mutual exclusion (consent). Robust during partitions (no loss). Occasional inconsistency tolerated for the sake of availability. Reconciliation semantics… Recognize dependence upon semantics.

Distributed Database Design Distributed Query Processing Distributed Transaction Management Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

Useful References B. Bhargava and John Riedl, The Raid Distributed Database System, IEEE Trans on Software Engineering, 15(6), June 1989. B. Bhargava and John Riedl, A Model for Adaptable Systems for Transaction Processing, IEEE Transactions on Knowledge and Data Engineering, 1(4), Dec 1989. B. Bhargava, Building Distributed Database Systems. Y. Zhang and B. Bhargava, WANCE: Wide area network communication emulation systems, IEEE workshop on Parallel and Distributed Systems, 1993. E. Mafla, and B. Bhargava, Communication Facilities for Distributed Transaction Processing Systems, IEEE Computer, 24(8), 1991. B. Bhargava, Y. Zhang, and E. Mafla, Evolution of a communication system for distributed transaction processing in RAID, Computing Systems, 4(3), 1991.

Implementations LOCUS (UCLA) File system OS
TABS (Camelot) (CMU) Data servers OS RAID (Purdue) Database level (server) SDD-1 (Computer Corp. of America) Transaction manager Data manager System – R* (IBM) Database level ARGUS (MIT) Guardian (server)

Architecture of RAID System
User Transaction Parser Action Driver (ensure transaction atomicity across sites) (interpret transactions) (ensure serializability) compiled transactions abort or commit Concurrency Controler Atomic Controller site j, k, l,… log//diff file Database after updates read only .

RAID Transactions Query Language DBMS completed transactions Atomicity
Controller Concurrency completed transactions

RAID Distributed System
DBMS other applications DBOS other applications RAID OS OS RAID supports reliability transactions stable storage buffer pool management

Transaction Management in one Server
Local Database User Process (UI and AD) TM Process (AM, AC, CC, RC) Remote RAID Sites (2 messages)

Server CPU Time (second)
CPU time used by RAID servers in executing transactions Server CPU Time (second) Server AC CC Transaction user system Select one tuple 0.04 0.14 0.06 select eleven tuples 0.08 0.02 Insert twenty tuples 0.20 0.16 0.12 0.13 Update one tuple 0.10 Server AD AM Transaction user system Select one tuple 0.34 0.90 0.00 select eleven tuples 0.54 1.48 Insert twenty tuples 1.23 3.10 0.14 0.71 Update one tuple 0.76 0.04 0.58

RAID Elapsed Time for Transactions in seconds
1 site 2 sites 3 sites 4 sites Select one tuple 0.3 0.4 Select eleven tuples Insert twenty tuples 0.6 0.8 Update one tuple

RAID Execution Time in seconds
Transaction 1 site 2 sites 3 sites 4 sites Select one tuple 0.4 Select eleven tuples 0.5 Insert twenty tuples 0.7 0.8 Update one tuple

Performance Comparison of the Communication Libraries
Message († multicast dest = 5) Length Bytes Raidcomm V.1 s Raidcomm V.2 Raidcomm V.3 SendNull 44 2462 1113 683 MultiNull † 12180 1120 782 Send Timestamp 48 2510 1157 668 Send Relation Descriptor 76 2652 1407 752 Send Relation Descriptor † 72 12330 1410 849 Send Relation 156 3864 2665 919 Send Write Relation 160 3930 2718 1102

Experiences with RAID Distributed Database
Unix influences must be factored out. Communications software costs dominate everything else. Server based systems can provide modularity and efficiency. Concurrent execution in several server types is hard to achieve. Need very tuned system to conduct experiments. Data is not available from others for validation. Expensive research direction, but is respected and rewarded.

Useful References E. Pitoura and B. Bhargava, Data Consistency in Intermittently Connected Distributed Systems, IEEE TKDE, 11(6), 1999. E. Pitoura and G. Samaras, Data Management for Mobile Computing, Kluwer Academic Publishers, 1998. S. Bhowmick, S. Madria, and W. K. Ng, Web Data Management: A Warehouse Approach, Springer, 2003.

What is Pervasive Computing?
“Pervasive computing is a term for the strongly emerging trend toward: – Numerous, casually accessible, often invisible computing devices – Frequently mobile or embedded in the environment – Connected to an increasingly ubiquitous network structure.” – NIST, Pervasive Computing 2001

Mobile and Wireless Computing
Goal: Access Information Anywhere, Anytime, and in Any Way. Aliases: Mobile, Nomadic, Wireless, Pervasive, Invisible, Ubiquitous Computing. Distinction: Fixed wired network: Traditional distributed computing. Fixed wireless network: Wireless computing. Wireless network: Mobile Computing. Key Issues: Wireless communication, Mobility, Portability.

Why Mobile Data Management?
Wireless Connectivity and use of PDA’s, handheld computing devices on the rise Workforces will carry extracts of corporate databases with them to have continuous connectivity Need central database repositories to serve these work groups and keep them fairly upto-date and consistent

Mobile Applications Applications:
Expected to create an entire new class of Applications new massive markets in conjunction with the Web Mobile Information Appliances - combining personal computing and consumer electronics Applications: Vertical: vehicle dispatching, tracking, point of sale Horizontal: mail enabled applications, filtered information provision, collaborative computing…

Mobile Data Applications
Sales Force Automation - especially in pharmaceutical industry, consumer goods, parts Financial Consulting and Planning Insurance and Claim Processing - Auto, General, and Life Insurance Real Estate/Property Management - Maintenance and Building Contracting Mobile E-commerce

Mobility – Impact on DBMS
Handling/representing fast-changing data Scale Data Shipping v/s Query shipping Transaction Management Replica management Integrity constraint enforcement Recovery Location Management Security User interfaces

DBMS Industry Scenario
Most RDBMS vendors support the mobile scenario - but no design and optimization aids Specialized Environments for mobile applications: Sybase Remote Server Synchrologic iMOBILE Microsoft SQL server - mobile application support Oracle Lite Xtnd-Connect-Server (Extended Technologies) Scoutware (Riverbed Technologies)

Query Processing New Issues – Location Dependent Query Processing
Energy Efficient Query Processing – Location Dependent Query Processing Old Issues - New Context Cost Model

Location Management New Issues Old Issues - New Context
Tracking Mobile Users Old Issues - New Context Managing Update Intensive Location Information Providing Replication to Reduce Latency for Location Queries Consistent Maintenance of Location Information

Transaction Processing
New Issues – Recovery of Mobile Transactions – Lock Management in Mobile Transaction Old Issues - New Context Extended Transaction Models – Partitioning Objects while Maintaining Correctness

Data Processing Scenario
One server or many servers Shared Data Some Local Data per client , mostly subset of global data Need for accurate, up-to-date information, but some applications can tolerate bounded inconsistency Client side and Server side Computing Long disconnection should not constraint availability Mainly Serial Transactions at Mobile Hosts Update Propagation and Installation

Mobile Network Architecture

Wireless Technologies
Wireless local area networks (WaveLan, Aironet) – Possible Transmission error, 1.2 Kbps-15 Mbps Cellular wireless (GSM, TDMA, CDMA)– Low bandwidth, low speed, long range - Digital: Kbps Packet radio (Metricom) -Low bandwidth, high speed, low range and cost Paging Networks – One way Satellites (Inmarsat, Iridium(LEO)) – Long Latency, long range, high cost

Terminologies GSM - Global System for Mobile Communication
GSM allows eight simultaneous calls on the same radio frequency and uses narrowband TDMA. It uses time as well as frequency division. TDMA - Time Division Multiple Access With TDMA, a frequency band is chopped into several channels or time slots which are then stacked into shorter time units, facilitating the sharing of a single channel by several calls CDMA - Code Division Multiple Access data can be sent over multiple frequencies simultaneously, optimizing the use of available bandwidth. data is broken into packets, each of which are given a unique identifier, so that they can be sent out over multiple frequencies and then re-built in the correct order by the receiver.

Mobility Characteristics
Location changes location management - cost to locate is added to communication Heterogeneity in services bandwidth restrictions and variability Dynamic replication of data data and services follow users Querying data - location-based responses Security and authentication System configuration is no longer static bursty network activity during connections

What Needs to be Reexamined?
Operating systems - TinyOS File systems - CODA Data-based systems – TinyDB Communication architecture and protocols Hardware and architecture Real-Time, multimedia, QoS Security Application requirements and design PDA design: Interfaces, Languages

Mobility Constraints CPU Power Variable Bandwidth
Delay tolerance, but unreliable Physical size Constraints on peripherals and GUIs Frequent Location changes Security Heterogeneity Expensive Frequent disconnections but predictable

What is Mobility? A device that moves between
different geographical locations Between different networks A person who moves between different networks different communication devices different applications

Device Mobility Laptop moves between Ethernet, WaveLAN and Metricom networks Wired and wireless network access Potentially continuous connectivity, but may be breaks in service Network address changes Radically different network performance on different networks Network interface changes Can we achieve best of both worlds? Continuous connectivity of wireless access Performance of better networks when available

Mobility Means Changes
Addresses IP addresses Network performance Bandwidth, delay, bit error rates, cost, connectivity Network interfaces PPP, eth0, strip Between applications Different interfaces over phone & laptop Within applications Loss of bandwidth trigger change from color to B&W Available resources Files, printers, displays, power, even routing

Bandwidth Management Clients assumed to have weak and/or
unreliable communication capabilities Broadcast--scalable but high latency On-demand--less scalable and requires more powerful client, but better response Client caching allows bandwidth conservation

Energy Management Battery life expected to increase by only
20% in the next 10 years Reduce the number of messages sent Doze modes Power aware system software Power aware microprocessors Indexing wireless data to reduce tuning time

Wireless characteristics
Variant Connectivity Low bandwidth and reliability Frequent disconnections predictable or sudden Asymmetric Communication Broadcast medium Monetarily expensive Charges per connection or per message/packet Connectivity is weak, intermittent and expensive

Portable Information Devices
PDAs, Personal Communicators Light, small and durable to be easily carried around dumb terminals, palmtops, wristwatch PC/Phone, will run on AA+ /Ni-Cd/Li-Ion batteries may be diskless I/O devices: Mouse is out, Pen is in Wireless connection to information networks either infrared or cellular phone Specialized Hardware (for compression/encryption) Li-Ion (Lithium-Ion)

Portability Characteristics
Battery power restrictions transmit/receive, disk spinning, display, CPUs, memory consume power Battery lifetime will see very small increase need energy efficient hardware (CPUs, memory) and system software planned disconnections - doze mode Power consumption vs. resource utilization

Portability Characteristics Cont.
Resource constraints Mobile computers are resource poor Reduce program size – interpret script languages (Mobile Java?) Computation and communication load cannot be distributed equally Small screen sizes Asymmetry between static and mobile computers Query By Icons (QBI): Iconic visual Language [Massari&Chrysanthis95]

Useful References B. Bhargava and L. Lilien, Private and Trusted Collaborations, in Proceedings of Secure Knowledge Management (SKM), Amherst, NY, Sep W. Wang, Y. Lu, and B. Bhargava, On Security Study of Two Distance Vector Routing Protocols for Mobile Ad Hoc Networks, in Proc. of IEEE Intl. Conf. on Pervasive Computing and Communications (PerCom), Dallas-Fort Worth, TX, March 2003. B. Bhargava, Y. Zhong, and Y. Lu, Fraud Formalization and Detection, in Proc. of 5th Intl. Conf. on Data Warehousing and Knowledge Discovery (DaWaK), Prague, Czech Republic, September 2003. B. Bhargava, C. Farkas, L. Lilien, and F. Makedon, Trust, Privacy, and Security, Summary of a Workshop Breakout Session at the National Science Foundation Information and Data Management (IDM) Workshop held in Seattle, Washington, September , 2003, CERIAS Tech Report , CERIAS, Purdue University, November 2003. P. Ruth, D. Xu, B. Bhargava, and F. Regnier, E-Notebook Middleware for Accountability and Reputation Based Trust in Distributed Data Sharing Communities, in Proc. of the Second International Conference on Trust Management (iTrust), Oxford, UK, March 2004.

Motivation Sensitivity of personal data
82% willing to reveal their favorite TV show Only 1% willing to reveal their SSN Business losses due to privacy violations Online consumers worry about revealing personal data This fear held back $15 billion in online revenue in 2001 Federal Privacy Acts to protect privacy E.g., Privacy Act of 1974 for federal agencies Still many examples of privacy violations even by federal agencies JetBlue Airways revealed travellers’ data to federal gov’t E.g., Health Insurance Portability and Accountability Act of 1996 (HIPAA)

Privacy and Trust Privacy Problem Trust must be established
Consider computer-based interactions From a simple transaction to a complex collaboration Interactions involve dissemination of private data It is voluntary, “pseudo-voluntary,” or required by law Threats of privacy violations result in lower trust Lower trust leads to isolation and lack of collaboration Trust must be established Data – provide quality an integrity End-to-end communication – sender authentication, message integrity Network routing algorithms – deal with malicious peers, intruders, security attacks

Fundamental Contributions
Provide measures of privacy and trust Empower users (peers, nodes) to control privacy in ad hoc environments Privacy of user identification Privacy of user movement Provide privacy in data dissemination Collaboration Data warehousing Location-based services Tradeoff between privacy and trust Minimal privacy disclosures Disclose private data absolutely necessary to gain a level of trust required by the partner system

Outline Assuring privacy in data dissemination Privacy-trust tradeoff
Privacy metrics

1. Privacy in Data Dissemination
“Owner” (Private Data Owner) Guardian 1 Original Guardian “Data” (Private Data) Guardian 5 Third-level Guardian 2 Second Level Guardian 4 Guardian 3 Guardian 6 “Guardian:” Entity entrusted by private data owners with collection, storage, or transfer of their data owner can be a guardian for its own private data owner can be an institution or a system Guardians allowed or required by law to share private data With owner’s explicit consent Without the consent as required by law research, court order, etc.

Problem of Privacy Preservation
Guardian passes private data to another guardian in a data dissemination chain Chain within a graph (possibly cyclic) Owner privacy preferences not transmitted due to neglect or failure Risk grows with chain length and milieu fallibility and hostility If preferences lost, receiving guardian unable to honor them

Challenges Ensuring that owner’s metadata are never decoupled from his data Metadata include owner’s privacy preferences Efficient protection in a hostile milieu Threats - examples Uncontrolled data dissemination Intentional or accidental data corruption, substitution, or disclosure Detection of data or metadata loss Efficient data and metadata recovery Recovery by retransmission from the original guardian is most trustworthy

Proposed Approach Design self-descriptive private objects
Construct a mechanism for apoptosis of private objects apoptosis = clean self-destruction Develop proximity-based evaporation of private objects

A. Self-descriptive Private Objects
Comprehensive metadata include: owner’s privacy preferences guardian privacy policies metadata access conditions enforcement specifications data provenance context-dependent and other components How to read and write private data For the original and/or subsequent data guardians How to verify and modify metadata How to enforce preferences and policies Who created, read, modified, or destroyed any portion of data Application-dependent elements Customer trust levels for different contexts Other metadata elements

Notification in Self-descriptive Objects
Self-descriptive objects simplify notifying owners or requesting their permissions Contact information available in the data provenance component Notifications and requests sent to owners immediately, periodically, or on demand Via pagers, SMSs, , mail, etc.

Optimization of Object Transmission
Transmitting complete objects between guardians is inefficient They describe all foreseeable aspects of data privacy For any application and environment Solution: prune transmitted metadata Use application and environment semantics along the data dissemination chain

B. Apoptosis of Private Objects
Assuring privacy in data dissemination In benevolent settings: use atomic self-descriptive object with retransmission recovery In malevolent settings: when attacked object threatened with disclosure, use apoptosis (clean self-destruction) Implementation Detectors, triggers, code False positive Dealt with by retransmission recovery Limit repetitions to prevent denial-of-service attacks False negatives

C. Proximity-based Evaporation of Private Data
Perfect data dissemination not always desirable Example: Confidential business data shared within an office but not outside Idea: Private data evaporate in proportion to their “distance” from their owner “Closer” guardians trusted more than “distant” ones Illegitimate disclosures more probable at less trusted “distant” guardians Different distance metrics Context-dependent

Bank I -Original Guardian
Examples of Metrics Examples of one-dimensional distance metrics Distance ~ business type Distance ~ distrust level: more trusted entities are “closer” Multi-dimensional distance metrics Security/reliability as one of dimensions Insurance Company B 5 1 2 Bank I -Original Guardian Insurance Company C Insurance Company A Bank II Bank III Used Car Dealer 1 Used Car Dealer 2 Used Car Dealer 3 If a bank is the original guardian, then: -- any other bank is “closer” than any insurance company -- any insurance company is “closer” than any used car dealer

Evaporation Implemented as Controlled Data Distortion
Distorted data reveal less, protecting privacy Examples: accurate more and more distorted 250 N. Salisbury Street West Lafayette, IN [home address] [home phone] Salisbury Street West Lafayette, IN 250 N. University Street [office address] [office phone] somewhere in West Lafayette, IN P.O. Box 1234 [P.O. box] [office fax]

Evaporation as Apoptosis Generalization
Context-dependent apoptosis for implementing evaporation Apoptosis detectors, triggers, and code enable context exploitation Conventional apoptosis as a simple case of data evaporation Evaporation follows a step function Data self-destructs when proximity metric exceeds predefined threshold value

Privacy metrics

2. Privacy-trust Tradeoff
Problem To build trust in open environments, users provide digital credentials that contain private information How to gain a certain level of trust with the least loss of privacy? Challenges Privacy and trust are fuzzy and multi-faceted concepts The amount of privacy lost by disclosing a piece of information is affected by: Who will get this information Possible uses of this information Information disclosed in the past

Proposed Approach Formulate the privacy-trust tradeoff problem
Estimate privacy loss due to disclosing a set of credentials Estimate trust gain due to disclosing a set of credentials Develop algorithms that minimize privacy loss for required trust gain

A. Formulate Tradeoff Problem
Set of private attributes that user wants to conceal Set of credentials Subset of revealed credentials R Subset of unrevealed credentials U Choose a subset of credentials NC from U such that: NC satisfies the requirements for trust building PrivacyLoss(NC+R) – PrivacyLoss(R) is minimized

Formulate Tradeoff Problem - cont.1
If multiple private attributes are considered: Weight vector {w1, w2, …, wm} for private attributes Privacy loss can be evaluated using: The weighted sum of privacy loss for all attributes The privacy loss for the attribute with the highest weight

B. Estimate Privacy Loss
Query-independent privacy loss Provided credentials reveal the value of a private attribute User determines her private attributes Query-dependent privacy loss Provided credentials help in answering a specific query User determines a set of potential queries that she is reluctant to answer

Privacy Loss Estimation Methods
Probability method Query-independent privacy loss Privacy loss is measured as the difference between entropy values Query-dependent privacy loss Privacy loss for a query is measured as difference between entropy values Total privacy loss is determined by the weighted average Conditional probability is needed for entropy evaluation Bayes networks and kernel density estimation will be adopted Lattice method Estimate query-independent loss Each credential is associated with a tag indicating its privacy level with respect to an attribute aj Tag set is organized as a lattice Privacy loss measured as the least upper bound of the privacy levels for candidate credentials

C. Estimate Trust Gain Increasing trust level
Adopt research on trust establishment and management Benefit function B(trust_level) Provided by service provider or derived from user’s utility function Trust gain B(trust_levelnew) - B(tust_levelprev)

D. Minimize Privacy Loss for Required Trust Gain
Can measure privacy loss (B) and can estimate trust gain (C) Develop algorithms that minimize privacy loss for required trust gain User releases more private information System’s trust in user increases How much to disclose to achieve a target trust level?

Privacy metrics

3. Privacy Metrics Problem Challenges
How to determine that certain degree of data privacy is provided? Challenges Different privacy-preserving techniques or systems claim different degrees of data privacy Metrics are usually ad hoc and customized Customized for a user model Customized for a specific technique/system Need to develop uniform privacy metrics To confidently compare different techniques/systems

Requirements for Privacy Metrics
Privacy metrics should account for: Dynamics of legitimate users How users interact with the system? E.g., repeated patterns of accessing the same data can leak information to a violator Dynamics of violators How much information a violator gains by watching the system for a period of time? Associated costs Storage, injected traffic, consumed CPU cycles, delay

Proposed Approach Anonymity set size metrics Entropy-based metrics

A. Anonymity Set Size Metrics
The larger set of indistinguishable entities, the lower probability of identifying any one of them Can use to ”anonymize” a selected private attribute value within the domain of its all possible values “Hiding in a crowd” “More” anonymous (1/n) “Less” anonymous (1/4)

Anonymity Set Anonymity set A A = {(s1, p1), (s2, p2), …, (sn, pn)}
si: subject i who might access private data or: i-th possible value for a private data attribute pi: probability that si accessed private data or: probability that the attribute assumes the i-th possible value

Effective Anonymity Set Size
Effective anonymity set size is Maximum value of L is |A| iff all pi’’s are equal to 1/|A| L below maximum when distribution is skewed skewed when pi’’s have different values Deficiency: L does not consider violator’s learning behavior

B. Entropy-based Metrics
Entropy measures the randomness, or uncertainty, in private data When a violator gains more information, entropy decreases Metric: Compare the current entropy value with its maximum value The difference shows how much information has been leaked

Dynamics of Entropy Decrease of system entropy with attribute disclosures (capturing dynamics) When entropy reaches a threshold (b), data evaporation can be invoked to increase entropy by controlled data distortions When entropy drops to a very low level (c), apoptosis can be triggered to destroy private data Entropy increases (d) if the set of attributes grows or the disclosed attributes become less valuable – e.g., obsolete or more data now available H* Entropy Level All attributes Disclosed attributes (a) (b) (c) (d)

Quantifying Privacy Loss
Privacy loss D(A,t) at time t, when a subset of attribute values A might have been disclosed: H*(A) – the maximum entropy Computed when probability distribution of pi’s is uniform H(A,t) is entropy at time t wj – weights capturing relative privacy “value” of attributes

Using Entropy in Data Dissemination
Specify two thresholds for D For triggering evaporation For triggering apoptosis When private data is exchanged Entropy is recomputed and compared to the thresholds Evaporation or apoptosis may be invoked to enforce privacy

Secure Data Warehouse

Basics of Data Warehouse
Data warehouse is an integrated repository derived from multiple distributed source databases. Created by replicating or transforming source data to new representation. Some data can be web-database or regular databases (relational, files, etc.). Warehouse creation involves reading, cleaning, aggregating, and storing data. Warehouse data is used for strategic analysis, decision making, market research types of applications. Open access to third party users.

Examples: Human genome databases.
Drug-drug interactions database created by thousands of doctors in hundreds of hospitals. Stock prices, analyst research. Teaching material (slides, exercises, exams, examples). Census data or similar statistics collected by government.

Ideas for Security Replication Aggregation and Generalization
Exaggeration and Mutilation Anonymity User Profiles, Access Permissions

Anonymity One can divulge information to a third party without revealing where it came from and without necessarily revealing the system has done so. User privacy and warehouse data privacy. User does not know the source of data. Warehouse system does not store the results and even the access path for the query. Separation of storage system and audit query system. Non-intrusive auditing and monitoring. Distribution of query processing, logs, auditing activity. Secure multi-party computation. Mental poker (card distribution).

Equivalent Views Witness (Permission Inference)
User can execute query Q if there is an equivalent query Q for which the user has permission. Security is on result and not computation. Create views over mutually suspicious organizations by filtering out sensitive data.

Similarity Depends on Application
Two objects might be similar to a K-12 student, but not a scientist. 1999 and 1995 annual reports of the CS department might be similar to a graduate school applicant, but not to a faculty applicant. Goal: Use ideas of replication to provide security by using a variety of similarity criterion Different QoS to match different classes of users.

Similarity Based Replication
SOME DEFINITIONS: Distinct functions used to determine how similar two objects are (Distinct Preserving Transformations). Precision: fraction of retrieved data as needed (relevant) for the user query. False Positive: object retrieved that is similar to the data needed by query, but it is not. False Negative: object is needed by the query, but not retrieved.

Access Permission Information permission (system-wide)
(employee salary is releasable to payroll clerks and cost analyst). Physical permission (local) (cost analysts are allowed to run queries on the warehouse).

Cooperation Instead of Autonomy in Warehouse
In UK, the Audit Commission estimated losses of the order of $2 billion. Japanese Yakuza made a profit of $7 billion. A secure organization needs to secure data, as well as it’s interpretation. (Integrity of data OK, but the benefit rules were interpreted wrong and misapplied.)  Interpretation Integrity

Extensions to the SQL Grant/Revoke Security Model
Limitation is a generalization of revoke. Limitation Predicates should apply to only paths (reduces chance of inadvertent & malicious denial of service). One can add either limitation or reactivation, or both. Limitation respects lines of authority. Flexibility can be provided to limitation.

Aggregation and Generalization
Summaries, Statistics (over large or small set of records) (various levels of granularity) Graphical image with numerical data. Reduce the resolution of images. Approximate answers (real-time vs. delayed quotes, blood analysis results) Inherit access to related data.

Dynamic Authenticate users dynamically and provides access privileges.
Mobile agent interacts with the user and provides authentication and personalized views based on analysis and verification. Rule-based interaction session. Analysis of the user input. Determination of the user’s validity and creating a session id for the user and assignment of access permission.

Exaggeration and Misleading
Give low or high range of normal values. Initially (semantically normal). Partially incorrect or difficult to verify data. Quality improves if security is assured. Give old data, check damage done, give better data. Projected values than actual values.

User Profile User profiles are used for providing different levels of security. Each user can have a profile stored at the web server or at third party server. User can change profile attributes at run-time. User behavior is taken into account based on past record. Mobile agent accesses the web page on behalf of the user and tries to negotiate with web server for the security level.

User Profile Personal category Data category
personal identifications; name, dob, ss, etc. Data category document content; keywords document structure; audio/video, links source of data Delivery data – web views, Secure data category

Static Predefined set of user names, domain names, and access restrictions for each (restricted & inflexible) Virtual view, Materialized view, Query driven Build user profiles and represent them past behavior feedback earlier queries type, content and duration

Useful References Y. Lu, W. Wang, D. Xu, and B. Bhargava, Trust-Based Privacy Preservation for Peer-to-peer, in the 1st NSF/NSA/AFRL workshop on secure knowledge management (SKM), Buffalo, NY, Sep

Problem statement Privacy in peer-to-peer systems is different from the anonymity problem Preserve privacy of requester A mechanism is needed to remove the association between the identity of the requester and the data needed

Proposed solution A mechanism is proposed that allows the peers to acquire data through trusted proxies to preserve privacy of requester The data request is handled through the peer’s proxies The proxy can become a supplier later and mask the original requester

Related work Trust in privacy preservation
Authorization based on evidence and trust Developing pervasive trust Hiding the subject in a crowd K-anonymity Broadcast and multicast

Related work (2) Fixed servers and proxies
Publius Building a multi-hop path to hide the real source and destination FreeNet Crowds Onion routing

Related work (3) Herbivore
provides sender-receiver anonymity by transmitting packets to a broadcast group Herbivore Provides provable anonymity in peer-to-peer communication systems by adopting dining cryptographer networks

Privacy measurement A tuple <requester ID, data handle, data content> is defined to describe a data acquirement. For each element, “0” means that the peer knows nothing, while “1” means that it knows everything. A state in which the requester’s privacy is compromised can be represented as a vector <1, 1, y>, (y Є [0,1]) from which one can link the ID of the requester to the data that it is interested in.

Privacy measurement (2)
For example, line k represents the states that the requester’s privacy is compromised.

Mitigating collusion An operation “*” is defined as:
This operation describes the revealed information after a collusion of two peers when each peer knows a part of the “secret”. The number of collusions required to compromise the secret can be used to evaluate the achieved privacy

Trust based privacy preservation scheme
The requester asks one proxy to look up the data on its behalf. Once the supplier is located, the proxy will get the data and deliver it to the requester Advantage: other peers, including the supplier, do not know the real requester Disadvantage: The privacy solely depends on the trustworthiness and reliability of the proxy

Trust based scheme – Improvement 1
To avoid specifying the data handle in plain text, the requester calculates the hash code and only reveals a part of it to the proxy. The proxy sends it to possible suppliers. Receiving the partial hash code, the supplier compares it to the hash codes of the data handles that it holds. Depending on the revealed part, multiple matches may be found. The suppliers then construct a bloom filter based on the remaining parts of the matched hash codes and send it back. They also send back their public key certificates.

Examining the filters, the requester can eliminate some candidate suppliers and finds some who may have the data. It then encrypts the full data handle and a data transfer key kdata with the public key. The supplier sends the data back using kdata through the proxy Advantages: It is difficult to infer the data handle through the partial hash code The proxy alone cannot compromise the privacy Through adjusting the revealed hash code, the allowable error of the bloom filter can be determined

Data transfer procedure after improvement 1
Requester Proxy of Supplier Requester R: requester S: supplier Step 1, 2: R sends out the partial hash code of the data handle Step 3, 4: S sends the bloom filter of the handles and the public key certificates Step 5, 6: R sends the data handle and encrypted by the public key Step 7, 8: S sends the required data encrypted by

The above scheme does not protect the privacy of the supplier To address this problem, the supplier can respond to a request via its own proxy

Requester Proxy of Proxy of Supplier Requester Supplier

Trustworthiness of peers
The trust value of a proxy is assessed based on its behaviors and other peers’ recommendations Using Kalman filtering, the trust model can be built as a multivariate, time-varying state vector

Experimental platform - TERA
Trust enhanced role mapping (TERM) server assigns roles to users based on Uncertain & subjective evidences Dynamic trust Reputation server Dynamic trust information repository Evaluate reputation from trust information by using algorithms specified by TERM server

Trust enhanced role assignment architecture (TERA)

Conclusion A trust based privacy preservation method for peer-to-peer data sharing is proposed It adopts the proxy scheme during the data acquirement Extensions Solid analysis and experiments on large scale networks are required A security analysis of the proposed mechanism is required

Peer to Peer Systems and Streaming

Useful References G. Ding and B. Bhargava, Peer-to-peer File-sharing over Mobile Ad hoc Networks, in the First International Workshop on Mobile Peer-to-Peer Computing, Orlando, Florida, March 2004. M. Hefeeda, A. Habib, B. Botev, D. Xu, and B. Bhargava, PROMISE: Peer-to-Peer Media Streaming Using CollectCast, In Proc. of ACM Multimedia 2003, 45-54, Berkeley, CA, November 2003.

Overview of Peer-to-Peer (P2P) Systems
Autonomy: no central server Similar power Share resources among a large number of peers P2P is a distributed system where peers collaborate to accomplish tasks

P2P Applications P2P file-sharing P2P Communication P2P Computation
Napster, Gnutella, KaZaA, eDonkey, etc. P2P Communication Instant messaging Mobile Ad hoc network P2P Computation

P2P Searching Algorithms
Search for file, data, or peer Unstructured Napster, Gnutella, KaZaA, eDonkey, etc. Structured Chord, Pastry, Tapestry, CAN, etc.

Napster: Central Directory Server
Bob wants to contact Alice, he must go through the central server Benefits: Efficient search Limited bandwidth usage No per-node state Drawbacks: Central point of failure Limited scale Copyrights Bob Alice Peer Peer Central Server Peer Peer Judy Jane

Gnutella: Distributed Flooding
Bob wants to talk to Alice, he must broadcast request and get information from Jane Benefits: No central point of failure Limited per-node state Drawbacks: Slow searches Bandwidth intensive Scalability Carl Jane Bob Alice Judy

KaZaA: Hierarchical Searching
Bob talks to Alice via Server B and Server A. Popularity: More than 3 M peers Over 3,000 Terabytes >50% Internet traffic ? Benefits: Only super-nodes do searching Parallel downloading Recovery Drawbacks: Copyrights Bob SB SA Alice

P2P Streaming Peers characterized as Problem Highly diverse Dynamic
Have limited capacity, reliability Problem How to select and coordinate multiple peers to render the best possible quality streaming? Streaming large videos is our focus. Streaming has stringent real-time constraints And consume a lot of resources (bandwidth & storage). Bulk download may not provide a satisfactory answer: long download time. Imagine how long will take to download a one -hour movie before in its entirety. Our work has applications beyond multimedia streaming: data streaming. Scientific data sharing and processing. Data generators (instruments) generate huge amount of data that may need to be transferred to processing site with special-purpose equipment in real time. Video surveillance, several cameras sending. Multiple is critical: you do not want to overload suppliers. Sometimes, some suppliers can not send at full rate.

CollectCast (Developed at Purdue)
CollectCast is a new P2P service Middleware layer between a P2P lookup substrate and applications Collects data from multiple senders Functions Infer and label topology Select best sending peers for each session Aggregate and coordinate contributions from peers Adapt to peer failures and network conditions Unlike other casts, e.g., concast, CollectCast probabilistically selects peers based on peers and network conditions. It also dynamically switch senders.

CollectCast (cont’d) CollectCast is installed on peers between the p2p substrate and the applications. PROMISE is a sample application. Other applications may include scientific data streaming.

Simulations Compare selection techniques in terms of
The aggregated received rate, and The aggregated loss rate With and without peer failures Impact of peer availability on size of candidate set Size of active set Load on peers

Simulation: Setup Topology Streaming session Peers
On average 600 routers and 1,000 peers Hierarchical (Internet-like) Streaming session Rate R0 = 1 Mb/s Duration = 60 minutes Loss tolerance level αu = 1.2 Peers Offered rate: uniform in [0.125R0, 0.5R0] Availability: uniform in [0.1, 0.9] Diverse P2P community Results are averaged over 100 runs with different seeds

Aggregate Rated: No Failures
Careful selection pays off!

PROMISE and Experiments on PlanetLab (Test-bed at Purdue)
PROMISE is a P2P media streaming system built on top of CollectCast Tested in local and wide area environments Extended Pastry to support multiple peer look up

PlanetLab Experiments
PROMISE is installed on 15 nodes Use several MPGE-4 movie traces Select peers using topology-aware (the one used in CollectCast) and end-to-end Evaluate Packet-level performance Frame-level performance and initial buffering Impact of changing system parameters Peer failure and dynamic switching

Packet-Level: Aggregated Rate
Smoother aggregated rate achieved by CollectCast

Conclusions New service for P2P networks (CollectCast)
Infer and leverage network performance information in selecting and coordinating peers PROMISE is built on top of CollectCast to demonstrate its merits Internet Experiments show proof of concept Streaming from multiple, heterogeneous, failure-prone, peers is indeed feasible Extend P2P systems beyond file sharing Concrete example of network tomography

Week 14, Lecture 2 Final Review

Outline Introduction Background Distributed DBMS Architecture

Similar presentations

Presentation on theme: "Outline Introduction Background Distributed DBMS Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Introduction Background Distributed DBMS Architecture

Similar presentations

Presentation on theme: "Outline Introduction Background Distributed DBMS Architecture"— Presentation transcript:

Similar presentations

About project

Feedback