Presentation on theme: "Summary of Selections from Chapters 9, 10 prepared by Kirk Scott"— Presentation transcript:
1Summary of Selections from Chapters 9, 10 prepared by Kirk Scott SQL Unit 18: Data Management: Databases and Organizations Richard WatsonSummary of Selections from Chapters 9, 10 prepared by Kirk Scott
2Chapter 9, The Relational Model and Relational Algebra Generally speaking, the contents of this chapter should not be too difficultThe idea is that most of the information has been introduced inductively in the foregoing sectionsThis chapter puts some of the earlier information into context and sums up the idea of relational databases
3BackgroundDatabases existed before the development of the relational modelThey were based on networks or hierarchiesIn other words, their implementation was based on linked data structuresThese kinds of databases were not easy to understand or codeNote how everything is cyclical: O-O databases are basically modern hierarchical databases
4The general idea of storing data in tables was an obvious alternative from the beginning. However, it was unclear whether the relational data model was a practical alternative for large data sets.The main apparent problem was performance.Linked code can run quickly.Performing joins, for example, by traversing two tables is not very efficient.
5More basic questions included, “What is a database More basic questions included, “What is a database?” and, “What should the user interface be like?”E. F. Codd is recognized as the main figure in the development of the relational model as a practical alternative to existing dbms’s.He and others addressed all of these questions.
6Normalization was one of the results of this work. It illustrates the value of theory.Garden variety idiots think they understand tables.“What is there not to understand?” they think.Without a theoretical understanding there was no clarity about tables or how to use them.
7The two remaining areas of development were related. Efficient algorithms for performing relational operations were necessary.Would the user be exposed to the implementations of the operations?Or would the user be given a different language as an interface?
8Codd made the following observations about existing systems: 1. They forced programmers to write low level codeThis meant that queries were more difficult to write, took longer to write, and typically required debugging because they were error prone.
92. No commands were available for processing multiple records at a time. Existing systems used procedural algorithms which used loops to traverse linked data structures.Linked list traversal can be efficient, but the code can be difficult to writeBy definition, inside the loop one record at a time was accessed.
10The relational model was inherently set-based It would be desirable to give the user a set-based interfaceThat would require the implementation of set level commands in the db internals.Efficient implementations would be needed before the relational model could be adopted
113. The existing systems were not amenable to ad hoc querying. Trained programmers are needed in order to write procedural code.SQL is simple enough that an end user can learn it (maybe).Also, the development time for an SQL query is short enough that it becomes practical to write one-time queries, not suites of programs.
12Observations about existing systems and the contrasts with the relational model led Codd to these three goals for a database management system:1. Data independence2. Communicability3. Set processing
131. Data independenceThe users of databases should not have to worry about how the data was physically stored.They should be free to envision the data simply as a collection of related tables, regardless of the physical implementation.Any physical level questions would be at the operating system or database administrator level.
142. CommunicabilityThe basic idea here is that the relational model, based on tables, records, keys, and values, is relatively easily understood by both users and programmers, making it easier for clients and developers to work together.This is in marked contrast to earlier database models.
153. Set processingThis is basically just a repetition of information given above.The beauty of the relational model is that it allows queries to be non-procedural and still supports the retrieval of multiple records.The model is “tell what you want” rather than “tell how to get it”.
16The Major Components of the Relational Model The relational data model has three major components:Data structuresIntegrity rulesOperators used to retrieve, derive, or modify data
17Data StructuresThe following terms summarize the data structures that the relational model is based on:Domains (fields)Relations (collections of fields)Primary keyCandidate key = Alternate keyForeign keyRelational database (relations in a primary to foreign key relationship)
18Integrity Rules These are the integrity rules of the relational model: Entity integrityThe primary key is unique and not nullReferential integrityEvery foreign key value has to have a matching primary key value
19Operators = Manipulation Languages A complete dbms has to support two kinds of functionalityThe two kinds of functionality be together in one language or they may be implemented in separate forms:DDL = data definition language = defining the database tablesDML = data management language = inserting, updating, and deleting data
20There are essentially four language or manipulation options when it comes to relational databases: Relational calculusRelational algebraSQLQBE
21Relational CalculusRelational calculus is based on the mathematical underpinnings of the relational modelIt has never been implemented as a language in a widely accepted dbms productRelational calculus will not be pursued at all
22Relational AlgebraRelational algebra also emphasizes the mathematical underpinnings of the relational modelThe query language for Postgres, Quel, was based on relational algebraIn the marketplace, it has largely been superseded by SQL
23Relational algebra will be pursued for two reasons: It provides a useful vocabulary for talking about queriesEven without delving into the theory, it is possible to make some useful observations about the necessary contents of a query language based on relational algebra concepts
24Relational algebra is fundamentally based on 8 operations: 1. Restrict (select): This picks a subset of rows from a table2. Project: This picks a subset of columns from a table:3. Product: This forms all possible pairings of the rows of two tables
254. Union: This forms a vertical combination of the rows of two tables 5. Intersect: This finds the rows that appear in both of two tables6. Difference: This finds the rows that appear in one table but not another7. Join: this finds a subset of rows of a product, typically where corresponding field values match
268. Divide:Relational divide is not as simple as the other concepts and has not been fully explained yetFor the sake of completeness it will be explained in the following overheadsAfter explaining division, the discussion will return to relational concepts in general
27Relational Division Divide should be (and is) the converse of product Division was mentioned in passing in the unit on SQL querying that covered double NOT EXISTSIn the context of the products of relations, the logical concept of FOR ALL is closely linked to the concept of divisionPractically speaking, division will be accomplished using double NOT EXISTS
28Relational Division Example The plan for this section is to explain relational division with the help of a few examples.These examples are actually the last four questions on the assignment for this unit.(Note that the current offering of the course may not include this assignment for credit.)The answers to these questions will be given here as part of the explanation.
29If you do the assignment, your goal should not be to copy the answers given. Instead, after having read the explanatory material, hopefully enough of it will stick in your memory that you can come up with the correct answer on your own.If not, you can refer back to the explanations again.
30TableX, TableY, and TableZ are given for the questions/examples. TableY is the table in the middle in a m – n relationship between TableX and TableZTableY contains a subset of the Cartesian product of the pk (id) fields of TableX and TableYThe tables are shown on the following overhead.
32Two tables are needed in order to do division. In this example we are interested in the quotient of TableY and TableZIn other words, we’re interested in finding TableY DIVIDED BY TableZ.TableX is included in the example in order to help visualize the relationship between TableY and TableZ.
33Dividing TableY by TableZ won’t yield TableX It will yield a subset of TableXThis is because TableY isn’t the full Cartesian product of TableX and TableZIf TableY were the full Cartesian product of TableX and TableZ, then TableY DIVIDED BY TableZ would give TableX
34Part of the goal of this discussion is to show how relational division can be accomplished in SQL. Division in SQL is done by means of double NOT EXISTS queriesThe familiar structure of such queries consists of double nesting with three tablesTherefore, it’s convenient to have TableX available along with TableY and TableZ
35Now consider TableY and TableZ. The first column of TableY is the field xid.The second column of TableY is the field zid.TableY and TableZ have the field zid in common.The second field of TableZ, field zone, does not play a role in the division.The division of the two tables is based on the common field, zid.The result of the division will be in terms of the first field in TableY, xid.
36The definition of relational division can be explained using these two tables as an example. The verbal expression of what TableY DIVIDED BY TableZ is supposed to produce as a result is this:It should find all of those values of xid, the first field in TableY, where those values of xid are matched with every value of zid, the common field, that appears in TableZ.
37The verbal expression can be restated in this way: The division of the two tables should find those values of xid in TableY that are in a Cartesian product with the values of zid in TableZ.The division operation will not include in the results any values of xid in TableY that are not matched with every value of zid in TableZ.
38TableY divided by TableZ on the fields TableY. zid and TableZ TableY divided by TableZ on the fields TableY.zid and TableZ.zid, respectively, gives a one column result table containing xid values taken from TableY.TableX, TableY, and TableZ are repeated on the next overhead.The result of dividing TableY by TableZ is shown on the overhead following that one.
41Another ExampleIn order to help the idea stick, another example is explained here verbally without completely illustrating it with tables.Suppose some TableR was the full Cartesian product of the xid values in TableX and the zid values in TableZ.What would the result be of dividing TableR by TableZ on their common field zid?
42Except for the fact that it's stated verbally rather than completely illustrated, this question is easier than the first one.In this example TableR replaces TableY.If TableR is the Cartesian product of TableX.xid and TableZ.zid, then every xid value in TableR will be in the result of TableR divided by TableZ.In other words, the actual results of the division would be the table shown on the next overhead.
44Relational Division Using SQL From a mathematical point of view, relational division is a binary operation.Using SQL syntax, relational division can be accomplished with double NOT EXISTS.Double NOT EXISTS on three different tables is easier to keep track of than double NOT EXISTS on two tables, where one table appears once and the other table appears twice in the query.That’s why TableX is included in the example.
45Relationally, the operation of interest is TableY divided by TableZ Operationally, this means finding those values of TableX.xid which are paired in TableY with all of the values of TableZ.zidIn other words, find those TableX.xid values where there does not exist a TableZ.zid value that it's not matched with in TableY.The desired results can be phrased using universal quantification, all, or double negation.
46This is the indication that in SQL the desired result can be obtained with a double NOT EXISTS query.If this query is written correctly, the result set of TableX.xid values will equal the set of TableY.xid values that would result from dividing TableY by TableZ on the fields TableY.zid and TableZ.zid, respectively.The desired query is shown on the next overhead.
48Phrased informally, as was done in the unit that covered the double not exists queries, this query asks for those values of xid in TableX where there is not a zid value in TableZ that it's not matched with, through the table in the middle, TableY.
49Notice that this query follows the pattern for double NOT EXISTS queries The first query opens the left base tableThe second query opens the right base tableThe third query opens the table in the middle.For reasons of scope, both of the joining conditions are in the third query.
50It is also possible to write such a query using just the two tables that are involved in the division.When considering the double NOT EXISTS query an example was given where all of the relevant fields were in the table in the middle and it could be opened three times with aliases in order to achieve the desired results.
51In the division example the table in the middle, TableY, is both the thing that is being divided (the dividend) and the thing that has the result field in it (the quotient).TableZ is the thing you're dividing by (the divisor).
52Using the terms for division as the aliases, TableY can be substituted for TableX in the previous example.This is possible because the result field of interest is xid, which is in TableY as well as TableX.The desired query is shown on the next overhead.
53SELECT DISTINCT xidFROM TableY AS QuotientWHERE NOT EXISTS(SELECT *FROM TableZ AS DivisorFROM TableY AS DividendWHERE Quotient.xid = Dividend.xidAND Dividend.zid = Divisor.zid));
54What does the division operation have to do with the Cartesian product? Are division and product complementary or inverse operations in a relational system?If TableY were the full Cartesian product of the xid from TableX and the zid from TableZ, then TableY divided by TableZ would return all of the xid values in TableX.Yes, they’re relational inverses.
55The special case of the first example that was used to illustrate division is actually the more common case.TableY is not a full Cartesian product of TableX and TableZ.Only some of the values of xid have been matched with all of the values of zid.Division is also defined in this case, as explained above.
56In essence, division finds the inverse for any values that could or would have been the result of a Cartesian product.Relational division ignores those values that did not participate in a full Cartesian product.
57As you may already have noted, relational algebra is not the same as arithmetic algebra. If it were, we would be working with numbers, not relations.It seems that in the special case, which is the common case, relational division is not a full inverse.However, there is another way of viewing this.
58When doing integer division, there is a remainder. In a sense, when doing relational division there is also a remainder.Those values xid in TableY which did not participate in a Cartesian product are left overThose values are in some sense the remainder upon relational division.
59For those interested in things mathematical and logical, it is interesting that the SQL syntax for implementing relational division is the same syntax for implementing the logical quantifier FOR ALL.Pursuing an explanation of this aspect of the situation is beyond the scope of these notes.
60Relational AlgebraThis, then, is the full list of the eight relational algebra operations:Restrict (Select)ProjectProductUnionIntersectDifferenceJoinDivide
61A Primitive Set of Relational Operators The truth is that there are only five basic relational operations:RestrictProjectProductUnionDifference
62The five basic operations are basic for the following reason: They cannot be defined in terms of any of the other basic operationsPut another way, the effects they achieve cannot be achieved using any other combination of basic operationsHowever, the remaining 3 operations can be defined in terms of the basic 5.
63The assertion that these five operations are basic will not be demonstrated. However, for those who are interested in the question, the following can be noted:The five basic operations can be viewed as corresponding to basic operations in a simple algebraic system.To a mathematician, the “basicness” of the operations would not be in doubt.
64The three non-basic operations are join, intersection, and division. Showing that these three can be defined in terms of the other five will be pursued.
65Defining the JoinThe join can be defined in terms of the Cartesian product, selection, and projectionFirst, form the Cartesian product of two tablesThen do a selection (restriction) which applies the joining condition to the two corresponding fields which are internal to the product tableThen do a projection to obtain only those fields that you want
66Defining Intersection The intersection can be defined in terms of the union and set differencesThis is illustrated on the next overhead with the help of some Venn diagrams.
67(A union B) – (A – B) – (B – A) A intersect B =(A union B) – (A – B) – (B – A)A - BB - A
68Defining DivisionJust as defining division was a bit messy, explaining why it isn’t a basic operation is also a bit messy.Let TableX, TableY, and TableZ again be given as a starting point for the discussion.TableY is a partial product, not necessarily a full Cartesian product, of TableX and TableZ
69We’re interested in dividing TableY by TableZ We would like to find a sequence of basic relational algebra operations that will result in the same contents as TableY divided by TableZ
70Let TableC = the Cartesian product of TableX and TableZ. Consider the difference TableC – TableY.Let some xid be in TableYConsider such an xid that matched with every zid in TableZ
71When you find the difference, TableC – TableY, all occurrences of the xid value in the result of TableY divided by TableZ would be eliminated by the subtraction.In the result of the subtraction, no xid value would remain that was in TableY divided by TableZ
72TableY can hold xid values that don’t participate in the full Cartesian product These are the remainder xid valuesOnly some occurrences of the remainder values would be eliminated by the subtraction.In other words, in TableC – TableY, some remainder values in TableC, the full Cartesian product, would not be eliminated
73Now do a projection on TableX on the xid column, giving a single column table, TableAllXid, containing all values of xid.Also do a projection on (TableC – TableY) on the xid column, giving a single column table, TableRemainders, containing all of the remainder values of xid.Then the result of the division would be TableAllXid – TableRemainders.
74In summary:TableC = TableX CARTESIAN PRODUCT TableZTableRemainders = projection on xid(TableC – TableY)TableAllXid = projection on xid(TableX)TableY DIVIDED BY TableZ = TableAllXid – TableRemaindersIn short, division can be accomplished with a combination of a Cartesian product, two subtractions, and two projectionsDivision is not a basic operation because it can be accomplished by a combination of basic operations.
75Who Cares About the Primitive Operators? Some database management systems used relational algebra as their query language.The Quel language of Ingres is an example.This has largely been supplanted by SQL.The point of the basic relational operators is that a system with a language that can accomplish what the five basic operators can accomplish is known as relationally complete.
76In other words, all data stored in the database is retrievable. All systems can be measured against this standard.Theoretically speaking, SQL is a bit of a syntactical mish-mash.Whether successful or not, the designers’ goal was to make it friendly to users, not necessarily theoretically beautiful.
77In any case, SQL is relationally complete. This is easily established by showing that it supports the five basic operations.1. The WHERE clause implements restriction (selection).
782. The listing of the desired fields in a SELECT statement implements projection. 3. A join without a joining condition implements the Cartesian product.4. SQL has a UNION operator, so it implements union.
795. Finally, relational subtraction is implemented through NOT EXISTS. Let relations A and B be given.Let A and B be union compatible.In other words, they have the same set of attributes.For the sake of illustration, let the attributes simply be named 1, 2, …, n.
80Then this SQL query would find A – B: SELECT *FROM AWHERE NOT EXISTS(SELECT *FROM BWHERE A.1 = B.1 AND A.2 = B.2 AND …AND A.n = B.n)
81In other words, find all of those records of A where there is no record in B that is exactly the same.Any record of A where there was a record in B that was exactly the same would be subtracted out.
82As you know, SQL also supports joining with separate syntax. This is part of what makes SQL a mish-mash, but in this instance, it certainly helps make SQL more user friendly.
83A Fully Relational Database As stated at the beginning, a relational dbms has three components:Structures: domains and relationsIntegrity rules: entity and referentialA manipulation language: DDL, DML.For example, relational algebra, or something else which is relationally complete.Relational completeness has just been explained.The query language supports the five basic operations.
84Do no confuse the phrase “relationally complete” with the phrase “fully relational”. The book notes that there are commercially available systems that advertise themselves as relational but which have certain limitations.For example, the systems may have an implementation of SQL but not support domains or integrity rules.The question is, is it fair to call these systems relational?
85The answer is that they are not fully relational. E. F. Codd was one of the people instrumental in developing relational databases.He came up with a list of 12 characteristics that could be included in a fully relational dbms implementation, and which such a system should have.These are the accepted measuring stick for whether a system is fully relational.
86When considering the current state of the dbms market it is worth noting that Codd’s rules were enunciated in 1985.When reading the rules, it may be helpful to read them “negatively.”In other words, for every rule there is or has been a dbms advertised as relational that did not have that characteristic.
87Codd’s Rules for a Fully Relational Database 1. The information ruleRegardless of the underlying implementation, from the user’s point of view, there is only one logical representation of data in a database:Values stored in fields stored in tables.
882. The guaranteed access rule Every value in a database has to be accessible by specifying the table name, the column name, and the primary key value of the row in which it’s stored.
893. Systematic treatment of null values The system has to support the semantics of null.It can’t rely on devices such as storing blanks or 0’s or other default values to signify null.The system also has to support the syntax of null in the query language.
904. Active online catalog of the relational model The system has to maintain an online catalog.This will include tables like SYSTABLE, SYSCOLUMN, SYSINDEX, etc.It should be possible for the user to query the catalog and find out all of the information about a given user database.Note that informally a data dictionary is at least a partial representation of the contents of the system catalog.
915. The comprehensive data sublanguage rule The system has to have a language or languages that support the following:Data definitionData manipulationSecurity and integrityTransaction processingInteractive querying and querying embedded in a programming languageEven if a graphical user interface is provided, a text based language supporting these functions has to be providedNote that SQL meets all of these requirements
926. The view updating ruleThe dbms has to be able to update any view that is theoretically updatable.Comment mode on:Note that when views were covered, it was explained that a change to a view should cause a change in the underlying table(s).This rule tells you that some systems have not implemented views in this theoretically correct way.
937. High-level insert, update, and delete The system has to support set-at-a-time operations.In other words, it has to be possible to insert, update, and delete multiple records at a time.Comment mode on:Note that this is a swipe at graphical user interface-only systems.Without a real language, like SQL, it is unlikely that a graphically based system will be able to support multiple inserts, updates, and deletes.
948. Physical data independence The logical appearance of tables and data to users will not change even if there is some change in their physical storage.For example, a database may be ported to a different machine, hard drive, etc.As long as the dbms is the same, the db should seem unchanged.This is also true for changes such as adding indexes.
95A user may notice a change in performance, but every query should still run, and it should not be necessary for the user to write queries with syntax that specifies that an index should be used when executing it.The system itself is responsible for all access issues at the physical level.
969. Logical data independence Information-preserving changes to the base tables should not affect queries or applications.For example, adding a new table to a db (even if it’s in a relationship with an existing table) should in no way affect any pre-existing applications.Or, adding a new field to a table shouldn’t affect existing queries.
9710. Integrity independence Integrity constraints should be part of the dbms’s function.Application programs should not have to contain the logic for maintaining the constraints.It should be possible to change the constraints in the system without affecting existing applications.Note that this should not be confused with data integrity, which is a user problem.
9811. Distribution independence If a dbms advertises itself as distributed, the distribution should be entirely transparent.In other words, all tables, data and applications should be accessible and work in the same way as they do without distribution, without any changes needed on the part of the user.
9912. The non-subversion rule It should not be possible to get around the security or integrity constraints by using some other interface or access into the database.
100Rule 0At a later time Codd also stated this rule:The dbms should make it possible to manage a database entirely through its relational capacities.In other words, you may supply a graphical user interface or some user tools that are not explicitly relational, but you also have to provide the relational interface.
101By way of explanation, the author now introduces another phrase, “totally relational”. The idea is that the system won’t allow non-relational tools to subvert the database.It also has a complete set of relational tools to manage the database.If these two conditions are met, along with the other 12 (plus rule 0), the dbms is totally relational, even though it may also provide other kinds of interfaces for convenience.
102Chapter 10, SQLChapter 10 in the book reviews SQL syntax and then presents some additional informationThe syntax review will be ignoredThe additional information will be summarized
103User Defined Functions SQL allows for the creation of a user defined functionThe syntax is CREATE FUNCTION…the specifics aren’t importantThe general idea is that the user can create a simple numerical/arithmetic function
104User Defined Procedures SQL allows for the creation of a user defined procedureThe syntax is CREATE PROCEDURE…the specifics aren’t importantThe general idea is that the user can package together a sequence of SQL commands/operations/queries in order to support multi-part transactions
105User Defined TriggersSQL allows for the creation of a user defined triggerThe syntax is CREATE TRIGGER…the specifics aren’t importantThe general idea is that the user can create a type of stored procedure which is automatically triggered when some action is taken on the database such as inserting, updating, or deleting the rows of a tableTriggers can be used to enforce business rules, data integrity checking, transaction logging, etc.
106Database SecuritySQL supports security by making it possible to grant or revoke the ability to take certain actions to individuals or groups of usersThis is the basic syntax:GRANT privilege(s) ON object(s) TO user(s) [WITH GRANT OPTION]
107These are the privileges that apply to base tables and views: SELECT, INSERT, UPDATE, DELETEThese are the privileges that apply only to base tables:ALTER, INDEXIt is also possible to specify the following:ALL PRIVILEGES
108Users can be lists of userid’s or potentially all users, PUBLIC The WITH GRANT OPTION tells whether or not a user who has been granted a privilege also has the right to grant it to another userPrivileges can be withdrawn with REVOKEIf REVOKE is issued on a user who granted a privilege to another user, the privilege is also revoked from this other user (cascading REVOKE)
109The System CatalogThe system catalog was touched on briefly in the previous chapterThe catalog is a db in its own rightBy querying tables like SYSCATALOG, SYSCOLUMNS, SYSINDEXES, etc., it is possible to find out everything there is to know about the databases recorded in the catalogNote: It is a mystery why SYSCOLUMNS is plural rather than singular in this discussion
110Natural Language Processing Some vendors may offer natural language processing as a feature of their dbmsThis would allow users to write queries in EnglishThe system would translate them to SQL
111This is problematic because of the possible ambiguities in English It is also possibly problematic because a user who doesn’t understand the database well enough to apply SQL to it may not be able to form clear, meaningful queries against the database in English
112Database Connectivity and Drivers ODBC and JDBC stand for open/Java database connectivityThis is a set of standards/technology with the following purpose:In a client-server environment, a client can use a server database where the server dbms may be one of several different kindsThis is accomplished by defining one standard interface and writing a driver for each kind of dbms which supports the common interface
113Embedded SQLThis topic will be relevant at the end of the course when considering PHPBy that time, you will be working on your projectMost likely you will figure out how this works just by following examples, not by listening to lectures on the topic
114SQL can be used as a stand-alone language for ad hoc queries Procedural programming languages also have syntax allowing for SQL statements to be embedded in themThis allows a program to process the results of a query, for exampleIt also allows a program to enter data into tables
115SQL Standardization SQL was first standardized in the 1980’s For example, there was a standard known as SQL-89SQL-92, also known as SQL2 is the current gold standardIn other words, most vendors support this standard, potentially with additional features
116SQL-99 added object-oriented features It is not clear yet whether vendors will follow this standard or go their own wayIt’s also not clear whether it’s an improvement to keep adding new features to a standard that has been relatively simple and successfulSQLJ refers to another direction taken in SQL standardization, trying to integrate it with Java
117Summary Purists may quibble about one or more features of SQL Also, SQL keeps on developing and it’s not clear all of the developments will succeed in the marketplaceHowever, the core of SQL has been around for some timeThere is no sign that SQL is going to go away any sooner than relational database management systems are going to go away.
119Why is there a remainder in relational division? What's left behind are those values that couldn't have been the result of a product in the first place, because they are not matched with all of the other values.In any event, relational division is certainly related to, and complementary to the operation of finding a product.
120Those xid values that were not eliminated in TableC – TableY would be the same as those xid values that were in the remainder.The remainder values, by definition, are those that didn’t match with all of the zid values.That’s how come there will be remainder values left after the set subtraction