Relational Algebra Projection π(R,c1, …, cn) = π c1…cn R select a subset c1 … cn of columns of R Selection σ(R, pred) = σ pred R select a subset of rows that satisfy pred Cross Product (||R|| = #attrs in R, |R| = #rows in row) R1 X R2 (aka Cartesian product) combine R1 and R2, producing a new relation with ||R1|| + ||R2|| attrs, |R1| * |R2| rows Join (R1, R2, pred) = R1 pred R2 = σ pred (R1 X R2)
Relational Algebra (cont’d) Union A U B All distinct tuples in A and B Intersection Select the subset of tuples that appear in both A and B Aggregation (not in relational model, but common extension) Sum, count, avg, min, max group_by G avg(v) (R) Apply function to all tuples, produce one output per group
Relational Algebra SQL SELECT List Projection FROM List all tables referenced WHERE SELECT and JOIN Many equivalent relational algebra expressions to any one SQL query (due to relational identities) Join reordering Select reordering Select pushdown
Example animals(name,age,species,cageno,keptby,feedtime) keepers(kid,name) Cages kept by Joe: π cageno (σ name=‘joe’ (animals keptby=kid keepers)) SELECT cageno FROM keepers,animals WHERE keptby=kid AND keeper.name = ‘joe’
Multiple Feedtimes animals:(name STRING,cageno INT,keptby INT,age INT,feedtime TIME) CREATE TABLE feedtimes(aname STRING, feedtime TIME); ALTER TABLE animals RENAME TO animals2; ALTER TABLE animals2 DROP COLUMN feedtime; CREATE VIEW animals AS SELECT name, cageno, keptby, age, (SELECT feedtime FROM feedtimes WHERE aname=name LIMIT 1) AS feedtime FROM animals2 Views enable logical data independence by emulating old schema in new schema
Questions 1) What SQL query is this expression equivalent to: π bldg (rooms rid=c_rid (σ c_name=‘339’ classes)) 2) Write an equivalent relational algebra expression to: SELECT s_name FROM student,takes,classes WHERE t_sid=sid AND t_cid=cid AND c_name=‘339’ a) Are there other possible expressions? b) Do you think one would be more “efficient” to execute? Why?
Hobby Schema SSNNameAddressHobbyCost 123johnmain stdolls$ 123johnmain stbugs$ 345marylake sttennis$$ 456joefirst stdolls$ “Wide” schema – has redundancy and anomalies in the presence of updates, inserts, and deletes Table key is Hobby, SSN Person Hobby SSN Address Name Cost n:n Entity Relationship Diagram
Database Normalization Superkey: subset of attrs that have a distinct value for each tuple Functional dependencies (FDs): X Y Normal forms – 1NF: all values are atomic, single val / attr – 2NF: all non-key values FD on whole primary key – 3NF: no transitive deps w/in relation (e.g., X Y and Y Z)
Boyce-Codd Normal Form (BCNF) A set of relations is in BCNF if: For every functional dependency X Y, in a set of functional dependencies F over a relation R, X is a superkey key of R, (where superkey means that X contains a key of R )
BCNFify Start with one "universal relation” While some relation R is not in BCNF Find a FD F=X Y that violates BCNF on R Split R into R1 = (X U Y), R2 = R – Y
BCNFify Example for Hobbies SchemaFDs (S,H,N,A,C)S,H N,A,C S N, A H C S = SSN, H = Hobby, N = Name, A = Addr, C = Cost violates bcnf SchemaFDs (S, N,A)S N, A SchemaFDs (S,H, C)H C violates bcnf SchemaFDs (H, C)H C SchemaFDs (S,H) Iter 1 Iter 2 key Iter 3
Study Break # 2 Patient database Want to represent patients at hospitals with doctors Patients have names, birthdates Doctors have names, specialties Hospitals have names, addresses One doctor can treat multiple patients, each patient has one doctor Each patient in one hospital, hospitals have many patients Doctors work for one hospital, hospitals have many doctors 1) Draw an ER diagram 2) What are the functional dependencies? 3) What is the normalized schema? Is it redundancy-free?
Denormalization Normalization reduces data redundancy But it requires more joins Sometimes we sacrifice hd space for better performance Really depends on the application
Conclusions Use set operations to manipulate data SQL enables users to express relational algebra more clearly Start schema design with ER diagram Identify functional dependencies Normalize schema by iteratively breaking it down Trade off between # of joins vs amt of redundancy