Presentation on theme: "Distribution of Data and Processes The cost of acquiring a distributed system is now often within the budget of an individual department or plant. No need."— Presentation transcript:
Distribution of Data and Processes The cost of acquiring a distributed system is now often within the budget of an individual department or plant. No need for massive corporate mainframe budgets. However the need for close cooperation with other departments taking distributed options is probably more necessary as departments increase autonomy. There is some consistency and uniformity across processes and departments that is desirable. The cost and scale of distributed systems make it easy to ignore the role as a member of a community.
This lack of integration can negate any benefits of moving to a distributed option. 'Corporate' distributed applications are claimed to achieve the largest 'leverage' and payback. The major issues in distributed database systems do not centre around physical configuration of nodes and communications links but around processing within the architecture and data storage and information flow.
The major organisational decisions are 1.Organisation of corporate and private data 2.Separating operational and decision support functions 3.Organisation of the data warehouse
Operational and Decision Support Processing Operational Processing operates on current value data. Requirement driven- operates to a pre-defined requirement. Manipulates data on a detailed record-by-record basis. Has access to archival data that has a high probability of access and typically low volume. Ownership of data is very important. Mainly for the benefit of the community making up-to-the- second decisions
Decision Support Systems Processing Supports managerial roles. Operates on archival data not subject to updates (snapshots). Operates on integrated data (from a variety of operational systems). Processing essentially data driven i.e. queries can be ad-hoc and driven by results of previous queries. No specific requirement statement. Operates on data in a collective way e.g. summaries and statistics. Used in making medium to long term decisions.
Private and Corporate Data There is a need for uniformity and consistency of data within a distributed environment. Within the network there is data of interest to individuals and other data of interest to the corporate whole. e.g. One node may be for vehicle insurance actuary. Noone is concerned what the actuary does in terms of analysing vehicle data. But when actuary modifies insurance rates, coverages, lengths of policies, renewal terms etc...the changes have to made in a 'corporate' fashion so that everyone is aware of the changes.
At each node there is likely to be corporate data and private data. Freely mixing corporate and private data in an unplanned manner will cause problems. Also mixing operational data and decision support data will lead to problems Mixing all together in unplanned manner is a recipe for disaster.
Node Residency The bulk of the data that a node needs to process should be held locally. This aids the definition of the system of record. It also reduces the complexity of the client-server network. Performance is improved.
Process Based (Functional) Each node has data pertinent to the processing to be carried out e.g. the accounting node, the marketing node, the actuarys node.... Will lead to much redundant data between nodes. Unique Processes, Redundant Data. This will lead to a great deal of integrity checking between nodes. However there is a strong desire to align node residency with organisational arrangements.
Data Based Each nodes holds a type of data e.g. customer information, accounts information, premium information... Minimised data redundancy but much overlap of processing (i.e. between nodes) as different organisational functions want data integrated from the separate nodes. Unique Data, Redundant Processes. It is easier to produce the same software that will execute on many nodes rather than software to run on one node and 'shuffle' transactions across the network to maintain data integrity (see process-based). So data based node residency leads to simplicity in terms of processing. Unfortunately it does not correspond to the organisational structures of enterprises. So there is some loss of local autonomy and control- a driver for client-server architectures.
Geography Based e.g. one node dealing with business in the South-west, one for North-east..... Although data and processing redundant traffic between nodes minimal. Data structures and programs replicated across each node.
‘Pure client/server’ Data not distributed. All data (and software possibly) held on a server node. Clients read data when required. System of record needs to be defined if clients allowed to update data.
System of Record Notion that for all data that can be updated at any moment in time there is one and only one node responsible for update. If, at any instant in time, 2 or more parties have 'control' of some data at the same time then there is a flaw in the system. Many different ways that the SOR can be implemented- hence the general term. The system of record needs to be clearly and accurately defined.
e.g. Insurance policy updates Single node could be responsible for updating all policies One node updates A-M, another N-Z. One node updates all policies except those formally passed to a different node for a specific time and then returned.etc... In all cases there is one 'owner' of the data at any moment in time.
The most 'natural' form of node residency is process- based, for reasons of control and node autonomy mainly. Hence much redundant data. Hence high risk of data inconsistency and loss of database integrity. A change to a particular set of data must be replicated at all nodes holding the same data.
How is the system of record to be established? Which node will trigger the replication? A node can be given 'ownership' of data and the copy at that node is always regarded as correct. Owner node triggers changes on nodes holding copies. No other node allowed to update that data and local changes by non-owner nodes ignored if carried out. Non-owner nodes must request data from owner node if unsure of correctness of local data or to be sure of using correct data.
Ensuring the integrity of transactions in a distributed system can be very complex. This is normally dealt with using the Two Phase Commit
Two Phase Commit Required when a transaction changes data in more than one database server. Simple commit will not work because some servers may succeed and others fail breaching principle of atomicity (all or nothing) First phase All locations involved are sent a request asking them to prepare to commit. Second phase Can only start when all locations have responded positively to first phase. Request sent to all locations to commit. Commit fails if any location responds that it cannot commit or fails first phase.
One location takes control of whole transaction to provide coordination. Bear in mind that a transaction may involve mixed operations e.g. an insert on one system,, a delete on another, and update on another If the coordinator process receives a fail message from any location then it must tell other locations to rollback. Or write log information so that failed nodes can bring themselves up-to-date eventually So, the actual protocol for a two phase commit can be complex.
3 Phase Commit 3 Phase Commit adds a ‘Pre-Commit’ phase. First phase-Vote All locations involved are sent a request asking them to prepare to commit. Second-Pre Commit Phase All locations are told that nodes are going to commit or otherwise Third phase-Global Decision to Commit Can only start when all locations have responded positively to first phase.
DSS System of Record DSS processing based on archival data Important that there is consistency of recording archival data. Otherwise the criteria to select data for archiving likely to be different at different nodes. Archive data is a 'snapshot' at a particular time so all archive data time-stamped. So data at all nodes must be archived using the same selection criteria and in the same time-slot
for instance- crime records.... every crime, between mon-fri, night-time, day-time, major/minor- or what? The system of record in this context is the node responsible for archiving- the Data Warehouse. Warehouse may be separate node from clients doing operational or DSS processing. Typically high performance, massive storage, dedicated machine.
Fragmentation/ Partitioning Horizontal ]Distribution of different rows in the same table to be different sites. Suitable when different sites are carrying out the same functions on data. Vertical distribution of data based on columns. This may involve redundantly including the primary key at different sites to ensure uniqueness. Suitable when different sites are carrying out different functions.
Integrity Checking Amount of checking required and complexity depends on the degree and type of fragmentation, replication and distribution policy. Replication transactions must also be replicated successfully. Horizontal- Constraints more complex e.g. key constraints must be checked on all inserts at each fragment (inc. referential integrity). Unless distribution based on partition of key values. Vertical- fragments must contain primary key attribute(s).