Presentation on theme: "Fall 2000Fall ‘01CSE330CIS5501 CIS550: Introduction to Database Management Systems Fall ‘01."— Presentation transcript:
Fall 2000Fall ‘01CSE330CIS5501 CIS550: Introduction to Database Management Systems Fall ‘01
Fall 2000Fall ‘01CSE330CIS5502 Administrative Stuff What you should know to take this class. Handouts: Syllabus and Homework 1. Resources: Text, TA, Web site, bulletin board and office hours. Coursework: homeworks, exams, project. Computer accounts.
Fall 2000Fall ‘01CSE330CIS5503 What the subject is about Organization of data Efficient retrieval of data Reliable storage of data Maintaining consistent data Not surprisingly, all these topics are interrelated.
Fall 2000Fall ‘01CSE330CIS5504 What is a DBMS? A database (DB) is a large, integrated collection of data. A DB models a real-world enterprise. A database management system (DBMS) is a software package designed to store and manage databases.
Fall 2000Fall ‘01CSE330CIS5505 Why study databases? Everybody needs them, i.e. $$$. There are lots of interesting problems, both in database research and in implementation. Good design is always a challenge.
Fall 2000Fall ‘01CSE330CIS5506 Connection to other areas of CS… Programming languages and software engineering (obviously) Algorithms (obviously) Logic, discrete math, and theory of comp. “Systems” issues: concurrency, operating systems, file organization and networks.
Fall 2000Fall ‘01CSE330CIS5507 But 80% of the world’s data is not in a DB! Examples: -scientific data (large images, complex programs that analyze the data) -personal data -WWW
Fall 2000Fall ‘01CSE330CIS5508 Why don't we ``program up'' databases when we need them? For simple and small databases this is often the best solution. Flat files and grep get us a long way. We run into problems when –The structure is complicated (more than a simple table) –The database gets large –Many people want to use it simultaneously
Fall 2000Fall ‘01CSE330CIS5509 We might start by building a file with the following structure: This text file is easy to deal with. So there's no need for a DBMS! Example: Personal Calendar WhatDayWhenWhoWhere Lunch10/241pmRickJoe’s Diner CS12310/259amDr. EggheadMorris234 Biking10/269amJaneJane’s house Dinner10/266PMJaneCafé Le Boeuf
Fall 2000Fall ‘01CSE330CIS55010 Problem 1: Data Organization Consider the all-important ``who'' field. Do we also want to keep addresses, telephone numbers etc? Expand our file to look like: Now we are keeping our address book in our calendar and doing so redundantly. WhatWhenWho-nameWho- Who-tel …. Where …
Fall 2000Fall ‘01CSE330CIS55011 “Link” Calendar with Address Book? Two conceptual “entities” -- contact information and calendar -- with a relationship between them, linking people in the calendar to their contact information. This link could be based on something as simple as the person's name.
Fall 2000Fall ‘01CSE330CIS55012 Problem 2: Efficiency Size of personal address book is probably less than one hundred entries, but there are things we'd like to do quickly and efficiently. –“Give me all appointments on 10/28” –“When am I next meeting Jim?” “Program” these as quickly as possible. Have these programs executed efficiently. What would happen if you were using a “corporate” calendar with hundreds of thousands of entries?
Fall 2000Fall ‘01CSE330CIS55013 Problem 3. Concurrency and Reliability Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess? Suppose the system crashes while we are changing the calendar. How do we recover our work?
Fall 2000Fall ‘01CSE330CIS55014 Example Suppose I schedule a meeting with a student after class today (3:00pm) and at the same time my secretary schedules me to meet with the Chairman. We both see that the time is open, but presumably only one of the two meetings will show on the calendar later.
Fall 2000Fall ‘01CSE330CIS55015 Transactions Key concept for concurrency is that of a transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry). Key concept for recoverability is that of a log : keeping track of all actions carried out by the db.
Fall 2000Fall ‘01CSE330CIS55016 Database architecture -- the traditional view It is common to describe databases in two ways: –The logical structure. What users see. The program or query language interface. –The physical structure. How files are organized. What indexing mechanisms are used. Further it is traditional to split the logical level into two components: overall database design (conceptual) and the views that various users get to see.
Fall 2000Fall ‘01CSE330CIS55017 Three-level architecture View 1View 2…View N Physical Level (file organization, indexing) Schema Conceptual Level
Fall 2000Fall ‘01CSE330CIS55018 Data independence A user of a relational database system should be able to use SQL to query the database without knowing about how the precisely how data is stored, e.g. After all, you don't worry much how numbers are stored when you program some arithmetic or use a computer-based calculator. SELECT When, Where FROM Calendar WHERE Who = "Bill"
Fall 2000Fall ‘01CSE330CIS55019 More on data independence Logical data independence protects the user from changes in the logical structure of the data -- could completely reorganize the calendar “schema” without changing how I query it. Physical data independence protects the user from changes in the physical structure of data: could add an index on Who without changing how the user would write the query, but the query would execute faster (query optimization).
Fall 2000Fall ‘01CSE330CIS55020 That's the traditional view, but... Three-level architecture is not always ``achievable'' for database programmers. When databases get big, queries must be carefully written to achieve efficiency. There are databases over which we have no control. The Web is a giant, disorganized, database. There are also well-organized database on the web, e.g., for which the terminology does not quite apply.
Fall 2000Fall ‘01CSE330CIS55021 In this course... Study relational databases, their design, how to query, what forms of indices to use. Beyond relational algebra: a logical model of data (Datalog), recursion. Beyond “first-normal form”: object-oriented databases, how to query, using OO design techniques. Newer applications and models: –On-Line Analytical Processing (OLAP) –XML and semi-structured data models
Fall 2000Fall ‘01CSE330CIS55022 What we won’t cover in any depth...
Fall 2000Fall ‘01CSE330CIS55023 The Relational Model: Relational Algebra
Fall 2000Fall ‘01CSE330CIS55024 Data Models and database design When we design a database we try to think “logically”, but need some kind of framework in which to design the database. It is like designing a data structure in some programming language. You might use arrays, lists, etc. depending on what is available. A data model is like a type system, but is abstract. In the relational data model we organize the data into tables. We don't (initially) worry about how these tables are implemented.
Fall 2000Fall ‘01CSE330CIS55025 The Relational Model- An introduction In the first few lectures we are going to discuss relational query languages. –We'll start by discussing the relational algebra, a “theoretical language”. Later we'll discuss -- and use -- the “commercial standard”, SQL. –Limitations of the relational algebra will also be discussed by contrast with a logical language, Datalog. The “theoretical language” is also used as an “internal language” to implement and optimize SQL.
Fall 2000Fall ‘01CSE330CIS55026 What is a relational db? As you probably guessed, it is a collection of tables. Routes RId RName Grade Rating Height 1 Last Tango II Garden Path I The Sluice I Picnic III Climbers CId Cname Skill Age 123 Edmund EXP Arnold BEG Bridget EXP James MED 27 Climbs CId RId Date Duration /10/ /08/ /08/ /07/ /07/94 3
Fall 2000Fall ‘01CSE330CIS55027 Why is the database like this? Each route has an id, a name, a grade (an estimate of the time needed), a rating (how difficult it is), and a height. Each climber has an id, a name, a skill level and an age. A climb records who climbed what route on what date and how long it took ( duration ). We will deal with how we arrive at such a design later. Right now observe that the data values in these tables are all “simple”. None of them are complex structures -- like other relations.
Fall 2000Fall ‘01CSE330CIS55028 Some terminology The column names of a relation are often called attributes or fields The rows of a relation are called tuples Each attribute has values taken from a domain. For example, the domain of CName is string and that for rating is real. A relation is a set of tuples; no tuple can occur more than once. Objects differ in that they have “identity”.
Fall 2000Fall ‘01CSE330CIS55029 Describing Relations Relations are described by a schema which can be expressed in various ways, but to a DBMS is usually expressed in a data definition language (DDL)-- something like a type system of a programming language. Routes(RId:int, RName:string, Grade:string, Rating:int, Height:int) Climbers(CId:int, CNname:string, Skill:string, Age:int) Climbs(CId:int, RId:int, Date:date, Duration:int)
Fall 2000Fall ‘01CSE330CIS55030 A note on domains Relational DBMSs have fixed “built-in” domains, such as int, string etc. Also some other domains like date but not, for example, roman-numeral (which might be useful here). In object-oriented and object-relational systems, new domains can be added either by the programmer/user or are sold by the vendor. Database people, when they are discussing design, often get sloppy and forget domains. They write, for example, Routes(RID, RName, Grade, Rating, Height)
Fall 2000Fall ‘01CSE330CIS55031 Integrity Constraints Domains are, in a sense, a primitive form of constraint on a valid instance of the schema. Other important constraints include: –Key constraints: each tuple must be distinct. A key is a subset of fields that uniquely identifies a tuple, and for which no subset of the key has this property. –Inclusion dependencies (referential integrity constraints): a field in one relation may refer to a tuple in another relation by including its key. The referenced tuple must exist in the other relation for the database instance to be valid. Typically, a relation may have several candidate keys one of which is chosen as the primary key.
Fall 2000Fall ‘01CSE330CIS55032 Expressing constraints In SQL-92, these constraints are defined as follows: CREATE TABLE Climbers CREATE TABLE Climbs (CId INTEGER, (CId INTEGER, CName CHAR(20), RId INTEGER, Skill CHAR(4), Date DATE, Age INTEGER, Duration INTEGER, PRIMARY KEY (Cid), PRIMARY KEY (CId, RId), UNIQUE (CName,Age)) FOREIGN KEY (CId) REFERENCES Climbers, FOREIGN KEY (RId) REFERENCES Routes)
Fall 2000Fall ‘01CSE330CIS55033 Example The instances below satisfy these constraints. Insert (123, Jeremy, MED, 16) into Climbers? Insert (456, 2, 09/13/98, 3) into Climbs? Delete (313, Bridget, EXP, 33) from Climbers? Modify 123 to 456 in Climbers? Wouldn't it be nice if the web enforced some form of referential integrity! Climbers: Climbs: CId CName Skill Age CId RId Date Duration 123 Edmund EXP /10/ Arnold BEG /08/ Bridget EXP /08/ James MED /07/ /07/94 3
Fall 2000Fall ‘01CSE330CIS55034 Relational Algebra Relational algebra is a set of operations (functions) each of which takes a relation (or relations) as input and produces a relation as output. There are five basic operations: –Projection –Selection –Union –Difference –Product Using these we can build up sophisticated database queries.
Fall 2000Fall ‘01CSE330CIS55035 Projection Given a list of column names A and a relation R, extracts the columns in A from the relation. Example: Routes: RId RName Grade Rating Height 1 Last Tango II Garden Path I The Sluice I Picnic III RId Height
Fall 2000Fall ‘01CSE330CIS55036 Projection, cont. Suppose the result of a projection has a repeated value, how do we treat it? In “pure” relational algebra the answer is always a set (the second answer). However SQL and some other languages return, by default, a multiset. Height Height
Fall 2000Fall ‘01CSE330CIS55037 Selection Selection takes a relation R and extracts those rows from it that satisfy the condition C. For example, RId RName Grade Rating Height 2 Garden Path I The Sluice I 8 60
Fall 2000Fall ‘01CSE330CIS55038 What can go in a condition? Conditions are built up from boolean-valued operations on the field names. E.g. Height >=100, RName = "Picnic". Predicates constructed from these using logical or, and, not It turns out that we don't lose any expressive power if we don't have complex predicates in the language, but they are convenient and useful in practice.
Fall 2000Fall ‘01CSE330CIS55039 Set operations -- Union If two relations have the same structure (Database terminology: are union-compatible. Programming language terminology: have the same type) we can perform set operations. Climbers: Hikers: CId CName Skill Age 123 Edmund EXP Arnold BEG Arnold BEG Jane MED Bridget EXP James MED 27 CId CName Skill Age 123 Edmund EXP Arnold BEG Bridget EXP James MED Jane MED 39
Fall 2000Fall ‘01CSE330CIS55040 Set operations -- difference An example: Beginners: Climbers – Beginners: CId CName Skill Age 214 Arnold BEG Edmund EXP James MED Bridget EXP 33 Climbers: CId CName Skill Age 123 Edmund EXP Arnold BEG Bridget EXP James MED 27 Beginners: Climbers – Beginners: CId CName Skill Age 214 Arnold BEG Edmund EXP James MED Bridget EXP 33 Climbers: CId CName Skill Age 123 Edmund EXP Arnold BEG Bridget EXP James MED 27 Beginners: Climbers – Beginners: CId CName Skill Age 214 Arnold BEG Edmund EXP James MED Bridget EXP 33 Climbers: CId CName Skill Age 123 Edmund EXP Arnold BEG Bridget EXP James MED 27
Fall 2000Fall ‘01CSE330CIS55041 Set operations -- other It turns out we can implement the other set operations using those we already have. For example, for any relations (sets) R, S Again, we have to be careful. Although it is mathematically nice to have fewer operators, operations like set difference may be less efficient than intersection.
Fall 2000Fall ‘01CSE330CIS55042 Optimizations -- a hint of things to come We mentioned earlier that compound predicates in selections were not “essential” to relational algebra. This is because we can translate selections with compound predicates into set operations. Example: However, which do you think is more efficient? Also, how would you translate ?
Fall 2000Fall ‘01CSE330CIS55043 Database Queries Queries are formed by building up expressions with the operations of the relational algebra. Even with the operations we have defined so far we can do something useful. For example, select-project expressions are very common: –What does this mean in English? –Also, could we interchange the order of the and Can we always do this? As another example, how would you “delete” the climber named James from the database?
Fall 2000Fall ‘01CSE330CIS55044 Joins Join is a generic term for a variety of operations that connect two relations that are not union compatible. The basic operation is the product, Rx S, which concatenates every tuple in R with every tuple in S. A B x C D = A B C D a1 b1 c1 d1 a2 b2 c2 d2 a1 b1 c2 d2 c3 d3 a1 b1 c3 d3 a2 b2 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3
Fall 2000Fall ‘01CSE330CIS55045 Products, cont. What happens when we form a product of two relations with columns with the same name? Details vary, but a common answer is to suffix the attribute names with 1 and 2. Climbs x Climbers will have a schema: (CId.1, RId, Date, Duration, CId.2, CName, Skill, Age) Climbers: Climbs: CId CName Skill Age CId RId Date Duration 123 Edmund EXP /10/ Arnold BEG /08/ Bridget EXP /08/ James MED /07/ /07/94 3
Fall 2000Fall ‘01CSE330CIS55046 Products, cont. Products are hardly ever used alone; they are typically use in conjunction with a selection. Note that this relation has useful information. We can tell, for example, the names of climbers who have climbed a certain route. CId.1 RId Date Duration CId.2 CName Skill Age /10/ Edmund EXP /08/ Edmund EXP /08/ Bridget EXP /07/ Arnold BEG /07/ Bridget EXP 33
Fall 2000Fall ‘01CSE330CIS55047 Theta Joins The combination of a selection and a product is so common that we give it a special symbol (and name) Example: The condition in a theta join is almost always an equality or conjunction of equalities. (Note: the name “theta” refers to the condition, C; this is also called the “conditional” join.)
Fall 2000Fall ‘01CSE330CIS55048 Renaming Our example yields a relation with fields CId.1 and CId.2 with the same information. Almost certainly we want to get rid of one of them, and this can be done using projection. We probably also want to rename the remaining field CId.1 to CId. For this we need a renaming operation, which renames the a attribute of R to b. In practical query languages, renaming is carried out by a different means, and we shall usually ignore this unimportant operation.
Fall 2000Fall ‘01CSE330CIS55049 Natural Join The most common join to do is an equality join of two relations on commonly named fields, and to leave one copy of those fields in the resulting relation. This is what we just did with Climbs and Climbers. This is called natural join and its symbol is (no subscript). CId RId Date Duration CName Skill Age /10/88 5 Edmund EXP /08/87 1 Edmund EXP /08/89 5 Bridget EXP /07/92 2 Arnold BEG /07/94 3 Bridget EXP 33
Fall 2000Fall ‘01CSE330CIS55050 Examples This completes the basic operations of the relational algebra. We shall soon find out in what sense this is an adequate set of operations. Try writing queries for these: –The names of climbers older than 32. –The names of climbers who have climbed route 1. –The names of climbers who have climbed the route named Last Tango. –The names of climbers with age less than 40 who have climbed a route with rating higher than 5. –The names of climbers who have not climbed anything.
Fall 2000Fall ‘01CSE330CIS55051 Division (not in the book) Division is a somewhat messy operation and can be expressed in terms of the operations we have already defined. It is used to express queries such as “The CId's of climbers who have climbed all routes”. Another way of phrasing this is to ask for “The Cid’s of climbers for which there does not exist a route that they haven’t climbed.”
Fall 2000Fall ‘01CSE330CIS55052 Division, cont. Let's express this query with the operations we have already defined. First we can build a relation with all possible pairs of routes and climbers: Let's call this relation Allpairs. Next, compute the set of all (Cid,RId) pairs for which climber CId has not climbed route RId. Let’s call this relation NotClimbed:
Fall 2000Fall ‘01CSE330CIS55053 Division, cont. Next, is the set of id's of climbers who have not climbed some route. Finally, the climbers who have climbed all routes are the ones who have not failed to climb some route:
Fall 2000Fall ‘01CSE330CIS55054 Division: the operator Rather than write this long expression, it is easier to use the notation. The schema of R must be a superset of the schema of S, and the result has schema schema(R)-schema(S). We could write “Climbers who have climbed all routes” as What about “Routes that have been climbed by all climbers”?