Hierarchical databases Multi-dimensional indexes

Hierarchical databases Multi-dimensional indexes
B+-tree Hashing Introduction ER-model, Data modeling, Normalization Lossless join system architecture, Basic concepts, Relational algebra, Relational data model SQL: DDL, DML Hierarchical databases Multi-dimensional indexes JDBC (DB application in Java) not covered Sept. 2012 Yangjun Chen ACS-3902

The main characters of a database The basic database design method
What is a database? The main characters of a database The basic database design method The entity-relationship data model for application modeling Introduction to the database systems Sept. 2012 Yangjun Chen ACS-3902

The main characteristics of the database approach:
single repository of data sharable by multiple users concurrency control and transaction concept security and integrity constraints self-describing - system catalogue contains meta data program-data independence some changes to the database are transparent to programs/users multiple views of data - to support individual needs of programs/users Sept. 2012 Yangjun Chen ACS-3902

Database schema, Schema evolution, Database state
Working process with a database system Database system architecture Data independence concept Concepts and Architecture Sept. 2012 Yangjun Chen ACS-3902

Course CName CNo CrHrs Dept
Database schema Relation schema Schema evolution Database state Course CName CNo CrHrs Dept Database CS C CS Student Name StNo Class Major Grades StNo Sid Grade Smith CS A Brown CS B Section SId CNo Semester Yr Instructor Spring Smith Winter Smith Spring Jones Sept. 2012 Yangjun Chen ACS-3902

Working process with a database system:
Definition record structure data elements names data types constraints etc Manipulation querying updating Construction create database files populate the database with records Sept. 2012 Yangjun Chen ACS-3902

Database Management System (DBMS)
collection of software facilitating the definition, construction and manipulation of databases DBMS Meta data Request manager Storage manager, Query evaluation Stored database Users/actors Sept. 2012 Yangjun Chen ACS-3902

Three-schema architecture
A specific user or groups view of the database External view External view Describes the whole database for all users Conceptual schema Physical storage structures and details Internal schema Sept. 2012 Yangjun Chen ACS-3902

Entity-relationship model - Entity types - strong entities
- weak entities - Relationships among entities - Attributes - attribute classification - Constraints - cardinality constraints - participation constraints - is-a relationship ER-to-Relation-mapping Data modeling using ER-model Sept. 2012 Yangjun Chen ACS-3902

ER-model: department employee project dependent works for controls
name number location 1 fname lname works for minit N department salary address name sex M number of employees startdate 1 ssn controls employee manages bdate 1 M 1 N M hours N supervisor supervisee supervision 1 works on project dependents of name number location N dependent relationship name sex birthdate Sept. 2012 Yangjun Chen ACS-3902

student graduate undergraduate
Subtype is determined by the student_class attribute student A student must be a graduate or undergraduate Student_class The bubble and the d imply disjoint subtypes (o - overlap subtypes) d The arc implies graduate and undergraduate are subtypes of student graduate undergraduate Participation of supertype may be mandatory or optional Subtypes may be disjoint or overlapping a predicate (on an attribute) determines the subtype: e.g. attribute Student_class Student_class = ‘graduate’; Student_class = ‘undergraduate’ Sept. 2012 Yangjun Chen ACS-3902

Mapping to a relational database 4 choices:
1. Create separate relations for the supertype and each of the subtypes. 2. Create relations for the subtypes only - each contains attributes from the supertype. 3. (disjoint subtypes) Create only one relation - includes all of the attributes for the supertype and all for the subtypes, and one discriminator attribute. 4. (overlapping subtypes) Create only one relation - includes all of the attributes for the supertype and all for the subtypes, and one logical discriminator attribute per subtype. PK is always the same - determined from the supertype Sept. 2012 Yangjun Chen ACS-3902

fname, minit, lname, ssn, bdate, address, JobType
bDates JobType name Ssn EMPLOYEE TypingSpeed d EngType TGrade SECRETARY TECHNICIAN ENGINEER EMPLOYEE fname, minit, lname, ssn, bdate, address, JobType SECRETARY TECHNICIAN ENGINEER Essn, TypingSpeed Essn, TGrade Essn, EngType Sept. 2012 Yangjun Chen ACS-3902

Example for super- & sub-types: choice 2
Price LicensePlate VehicleId Vehicle TNoOfPassengers d NoOfAxles CAR TRUCK Tonnage MaxSpeed CAR VehicleId, LicensePlate, Price, MaxSpeed, NoOfPassenger TRUCK VehicleId, LicensePlate, Price, NoOfAxles, Tonnage Sept. 2012 Yangjun Chen ACS-3902

fname lname minit Address bDates JobType name Ssn EMPLOYEE d TypingSpeed EngType TGrade SECRETARY TECHNICIAN ENGINEER EMPLOYEE fname, minit, lname, ssn, bdate, address, JobType, TypingSpeed, Tgrade, EngType Sept. 2012 Yangjun Chen ACS-3902

Description PartNo Part manufactureDate o Supplier DrawingNo ListPrice Manufacture_Part Purchased_Part BatchNo Part PartNo, Desription, MFlag, Drawing, ManufactureDate, BatchNo, Pflag, Supplier, ListPrice Sept. 2012 Yangjun Chen ACS-3902

static hashing & dynamic hashing hash function
external hashing static hashing & dynamic hashing hash function mathematical function that maps a key to a bucket address collisions collision resolution scheme - open addressing - chaining - multiple hashing linear hashing Hashing technique Sept. 2012 Yangjun Chen ACS-3902

External hashing: the data are on the disk.
Static hashing: using a hashing function to map keys to bucket addresses primary area can not be changed collision resolution schema: open addressing chaining multiple hashing Dynamic hashing: primary area can be changed linear hashing Sept. 2012 Yangjun Chen ACS-3902

What bucket will be chosen to split next?
Linear hashing: What is a phase? How to split a bucket? When to split a bucket? What bucket will be chosen to split next? Sept. 2012 Yangjun Chen ACS-3902

initially hash file contains M buckets
Linear hashing: initially hash file contains M buckets hi = key mod 2iM (i = 0, 1, 2, ...) insertion process can be divided into several phases phase 1: insertion using h0 = key mod M splitting using h1 = key mod 2M splitting rule: overflow of a bucket or if load factor > constant (e.g., 0.70) overflow will be put in the overflow area or redistributed through splitting a bucket splitting buckets from n = 0 to n = M- 1 (after each splitting n is increased by 1. Phase 1 finishes when n = M (in this case, the primary area becomes 2M buckets long) Sept. 2012 Yangjun Chen ACS-3902

insertion using h1 = key mod 2M splitting using h2 = key mod 4M
phase 2: insertion using h1 = key mod 2M splitting using h2 = key mod 4M splitting rule: overflow of a bucket or if load factor > constant (e.g., 0.70) overflow will be put in the overflow area or redistributed through splitting a bucket splitting buckets from n = 0 to n = 2M- 1 (after each splitting n is increased by 1. Phase 1 finishes when n = 2M (in this case, the primary area will contain 4M buckets.) phase 3: ... … h2 = …, h3 = …, ... Sept. 2012 Yangjun Chen ACS-3902

- root, internal, leaf, subtree - parent, child, sibling
balanced, unbalanced b+-tree - splits on overflow; merge on underflow - in practice it is usually 3 or 4 levels deep search, insert, delete algorithms Multi-level index Sept. 2012 Yangjun Chen ACS-3902

B+-tree provides a short access path.
Motivation B+-tree provides a short access path. file of records page1 Index page2 page3 Inverted index Signature file B+-tree Hashing … … Sept. 2012 Yangjun Chen ACS-3902

A B+-tree pinternal = 3, pleaf = 2. 5 3 7 8 1 3 5 6 7 8 9 12
Records in a file Sept. 2012 Yangjun Chen ACS-3902

B+-tree insertion: leaf node splitting, internal node splitting
Leaf splitting When a leaf splits, a new leaf is allocated the original leaf is the left sibling, the new one is the right sibling key and pointer pairs are redistributed: the left sibling will have smaller keys than the right sibling a 'copy' of the key value which is the largest of the keys in the left sibling is promoted to the parent 22 33 33 12 22 31 33 insert 31 Sept. 2012 Yangjun Chen ACS-3902

Internal node splitting
If an internal node splits and it is not the root, insert the key and pointer and then determine the middle key a new 'right' sibling is allocated everything to its left stays in the left sibling everything to its right goes into the right sibling the middle key value along with the pointer to the new right sibling is promoted to the parent (the middle key value 'moves' to the parent to become the discriminator between this left and right sibling) 26 55 55 22 33 22 33 Insert 26 Sept. 2012 Yangjun Chen ACS-3902

Internal node splitting
When a new root is formed, a key value and two pointers must be placed into it. 40 26 55 Insert 40 Sept. 2012 Yangjun Chen ACS-3902

Deleting nodes from a B+-tree:
When deleting a key from a node A, check whether the number of the remaining keys (or pointers) is  p/2. If it is not the case, redistribute the keys in the left sibling B or in the right sibling C if it is possible. Otherwise, merge A and B or merge A and C. When redistributing or merging, change the key values in the parent node so that the following condition is satisfied: < P1, K1, P2, K2, …, Pq-1, Kq-1, Pq > K1 < K2 < ... < Kq (i.e. it is an ordered set) for the key values, X, in the subtree pointed to by Pi Ki-1 < X <= Ki for 1 < i < q X <= K1 for i = 1 Kq-1 < X for i = q Sept. 2012 Yangjun Chen ACS-3902

A b+-tree p = 3, pleaf = 2. 5 3 7 8 1 3 6 7 5 8 9 12 Records
Sept. 2012 Yangjun Chen ACS-3902

Deleting 8 causes the node redistribute.
Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 7 9 1 3 6 7 5 9 12 Deleting 8 causes the node redistribute. Sept. 2012 Yangjun Chen ACS-3902

Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 7 1 3 6 7 5 9
12 is removed. Sept. 2012 Yangjun Chen ACS-3902

Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 6 1 3 5 6 7
9 is removed. Sept. 2012 Yangjun Chen ACS-3902

Entry deletion - deletion sequence: 8, 12, 9, 7 5 3 6 1 3 5 6
Deleting 7 makes this pointer no use. Therefore, a merge at the level above the leaf level occurs. Sept. 2012 Yangjun Chen ACS-3902

Entry deletion - deletion sequence: 8, 12, 9, 7 5 A 3 5 5 B 1 3 C 5 6
This point becomes useless. The corresponding node should also be removed. 3 5 5 B 1 3 C 5 6 For this merge, 5 will be taken as a key value in A since any key value in B is less than or equal to 5 but any key value in C is larger than 5. Sept. 2012 Yangjun Chen ACS-3902

Entry deletion - deletion sequence: 8, 12, 9, 7 3 5 5 1 3 5 6
Sept. 2012 Yangjun Chen ACS-3902

- relational schema (database schema)
Relational Data Model - relational schema (database schema) - relation schema, relations, database state - integrity constraints and updating Relational algebra - select, project, join, cartesian product - division - set operations: union, intersection, difference, Data modeling using Relational model Relational algebra Sept. 2012 Yangjun Chen ACS-3902

Integrity Constraints
any database will have some number of constraints that must be applied to ensure correct data (valid states) 1. domain constraints a domain is a restriction on the set of valid values domain constraints specify that the value of each attribute A must be an atomic value from the domain dom(A). 2. key constraints a superkey is any combination of attributes that uniquely identify a tuple: t1[superkey]  t2[superkey]. - Example: <Name, SSN> (in Employee) a key is superkey that has a minimal set of attributes - Example: <SSN> (in Employee) Sept. 2012 Yangjun Chen ACS-3902

Integrity Constraints
If a relation schema has more than one key, each of them is called a candidate key. one candidate key is chosen as the primary key (PK) foreign key (FK) is defined as follows: i) Consider two relation schemas R1 and R2; ii ) The attributes in FK in R1 have the same domain(s) as the primary key attributes PK in R2; the attributes FK are said to reference or refer to the relation R2; iii) A value of FK in a tuple t1 of the current state r(R1) either occurs as a value of PK for some tuple t2 in the current state r(R2) or is null. In the former case, we have t1[FK] = t2[PK], and we say that the tuple t1 references or refers to the tuple t2. Example: FK Employee(SSN, …, Dno) Dept(Dno, … ) Sept. 2012 Yangjun Chen ACS-3902

Integrity Constraints 3. entity integrity no part of a PK can be null
4. referential integrity domain of FK must be same as domain of PK FK must be null or have a value that appears as a PK value 5. semantic integrity other rules that the application domain requires: state constraint: gross salary > net income transition constraint: Widowed can only follow Married; salary of an employee cannot decrease Sept. 2012 Yangjun Chen ACS-3902

Other SQL capabilities Assertions can be used for some constraints
e.g. Create Assertion Executed and enforced by DBMS Constraint: The salary of an employee must not be greater than the salary of the manager of the department that the employee works for. CREATE ASSERTION salary_constraint CHECK (NOT EXISTS (SELECT * FROM employee e, employee m, department d where e.salary > m.salary and e.dno=d.dnumber and d.mgrssn=m.ssn)); Sept. 2012 Yangjun Chen ACS-3902

Retrieve for each female employee a list of the names of her
Relational algebra Retrieve for each female employee a list of the names of her dependents: FEMALE_EMPS  SEX = ‘F’ (EMPLOYEE) EMPNAMES  FNAME,LNAME, SSN(FEMALE_EMPS) ACTUAL_DEPENDENTS  EMPNAMES DEPENDENT SSN = ESSN RESULT FNAME, LNAME, DEPENDENT_NAME(ACTUAL_DEPENDENTS ) Sept. 2012 Yangjun Chen ACS-3902

- select-from-where clause - group by, having, order by - update
DDL - creating schemas - modifying schemas DML - select-from-where clause - group by, having, order by - update - view SQL Sept. 2012 Yangjun Chen ACS-3902

DDL - Examples: Create schema:
Create schema COMPANY authorization JSMITH; Create table: Create table EMPLOYEE (FNAME VARCHAR(15) NOT NULL, MINIT CHAR, LNAME VARCHAR(15) NOT NULL, SSN CHAR(9) NOT NULL, BDATE DATE, ADDRESS VARCHAR(30), SEX CHAR, SALARY DECIMAL(10, 2), SUPERSSN CHAR(9), DNO INT NOT NULL, PRIMARY KEY(SSN), FOREIGN KEY(SUPERSSN) REFERENCES EMPLOYEE(SSN), FOREIGN KEY(DNO) REFERENCES DEPARTMENT(DNUMBER)); Sept. 2012 Yangjun Chen ACS-3902

DDL - Examples: drop schema drop table alter table
DROP SCHEMA CAMPANY CASCADE; DROP SCHEMA CAMPANY RESTRICT; drop table DROP TABLE DEPENDENT CASCADE; DROP TABLE DEPENDENT RESTRICT; alter table ALTER TABLE COMPANY.EMPLOYEE ADD JOB VARCHAR(12); DROP ADDRESS CASCADE; Sept. 2012 Yangjun Chen ACS-3902

DDL - Examples: Specifying constraints: Create table EMPLOYEE (…,
DNO INT NOT NULL DEFAULT 1, CONSTRAINT EMPPK PRIMARY KEY(SSN), CONSTRAINT EMPSUPERFK FOREIGN KEY(SUPERSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET NULL ON UPDATE CASCADE, CONSTRAINT EMPDEPTFK FOREIGN KEY(DNO) REFERENCES DEPARTMENT(DNUMBER) ON DELETE SET DEFAULT ON UPDATE CASCADE); Create domain: CREATE DOMAIN SSN_TYPE AS CHAR(9); Sept. 2012 Yangjun Chen ACS-3902

set null or cascade: strategies to maintain data consistency
Employee delete ssn supervisor null Employee ssn supervisor null not reasonable delete cascade delete Sept. 2012 Yangjun Chen ACS-3902

set null or cascade: strategies to maintain data consistency
Employee delete ssn supervisor null Employee ssn supervisor reasonable null set null null delete Sept. 2012 Yangjun Chen ACS-3902

set default: strategy to maintain data consistency
Employee ssn DNO 4 change this value to the default value 1. … … Department DNUMBER … … 1 … … … … 4 delete Sept. 2012 Yangjun Chen ACS-3902

DML - select-from-where clause
Retrieve a list of employees and the projects they are working on, ordered by department, within each department, ordered alphabetically by last name, first name: SELECT DNAME, LNAME, FNAME, PNAME FROM DEPARTMENT, EMPLOYEE, WORKS_ON, PROJECT WHERE DNUMBER = DNO AND SSN = ESSN AND PNO = PNUMBER ORDER BY DNAME, LNAME, FNAME order by – clause group by – clause having – clause aggregation functions: max, min, average, count, sum Sept. 2012 Yangjun Chen ACS-3902

DML - select-from-where clause Insert Update Delete
INSERT INTO employee ( fname, lname, ssn, dno ) VALUES ( "Joe", "Smith", 909, 1); UPDATE employee SET salary = WHERE ssn=909; DELETE FROM employee WHERE ssn=909; Note that Access changes the above to read: INSERT INTO employee ( fname, lname, ssn, dno ) SELECT "Joe", "Smith", 909, 1; Sept. 2012 Yangjun Chen ACS-3902

Use a Create View command
View definition Use a Create View command essentially a select specifying the data that makes up the view Create View Enames as select lname, fname from employee CREATE VIEW Enames (lname, fname) AS SELECT LNAME, FNAME FROM EMPLOYEE Sept. 2012 Yangjun Chen ACS-3902

CREATE VIEW DEPT_INFO (DEPT_NAME, NO_OF_EMPS, TOTAL_SAL)
AS SELECT DNAME, COUNT(*), SUM(SALARY) FROM DEPARTMENT, EMPLOYEE WHERE DNUMBER = DNO GROUP BY DNAME; Sept. 2012 Yangjun Chen ACS-3902

(Database application in Java)
JDBC (Database application in Java) Sept. 2012 Yangjun Chen ACS-3902

To develop a database application, JDBC or ODBC should be used.
JDBC – JAVA Database Connectivity ODBC – Open Database Connectivity JDBC-ODBC Bridge ODBC Driver Database Client Client Server Database Sept. 2012 Yangjun Chen ACS-3902

Connection to a database: Loading driver class
Class.forName(“sun.jdbc.odbc.JdbcOdbcDriver”); 2. Connection to a database String url = “jdbc:odbc:<databaseName>”; Connction con = DriverManager.getConnection(url, <userName>, <password>) Sept. 2012 Yangjun Chen ACS-3902

Sending SQL statements Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(“SELECT * FROM Information WHERE Balance >= 5000”); 4. Getting results while (rs.next()) { … } a table name Sept. 2012 Yangjun Chen ACS-3902

password import java.sql.*; public class DataSourceDemo1
{ public static void main(String[] args) { Connection con = null; try { //load driver class Class.forName{“sun.jdbs.odbs.JdbsOdbcDriver”); //data source String url = “jdbs:odbc:Customers”; //get connection con = DriverManager.getConnection(url, “sa”, “ “) password Sept. 2012 Yangjun Chen ACS-3902

//create SQL statement Statement stmt = con.createStatement();
//execute query Result rs = stmt.executeQuery(“SELECT * FROM Information WHERE Balance >= 5000”); String firstName, lastName; Date birthDate; float balance; int accountLevel; Sept. 2012 Yangjun Chen ACS-3902

{ firstName = rs.getString(“FirstName”);
while(rs.next()) { firstName = rs.getString(“FirstName”); lastName = rs.getString(“lastName”); balance = rs.getFloat(“Balance”); System.out.println(firstName + “ “ + lastName + “, balance = “ + balance); } catch(Exception e) { e.printStackTrace();} finally { try{con.close();} catch(Exception e){ } Sept. 2012 Yangjun Chen ACS-3902

Programming in an dynamical environment:
Disadvantage of DataSourceDemo1: If the JDBC-ODBC driver, database, user names, or password are changed, the program has to be modifid. Solution: Configuration file: config.driver=sun.jdbc.odbc.JdbcOdbcDriver config.protocol=jdbc config.subprotocol=odbc config.dsname=Customers config.username=sa config.password= file name: datasource.config <property> = <property value> config – datasource name Sept. 2012 Yangjun Chen ACS-3902

import java.util.Properties; public class DatabaseAccess
import java.sql.*; import java.io.*; import java.util.Properties; public class DatabaseAccess { private String configDir; //directory for configuration file private String dsDriver = null; private String dsProtocol = null; private String dsSubprotocol = null; private String dsName = null; private String dsUsername = null; private String dsPassword = null; Sept. 2012 Yangjun Chen ACS-3902

public DatabaseAccess(String configDir)
{ this.configDir = configDir; } public DatabaseAccess() { this(“.”); } //source: data source name //configFile: source configuration file public Connection getConnection(String source, String configFile) throws SQLException, Exception { Connection con; try { Properties prop = loadConfig(ConfigDir, ConfigFile); getConnection(“config”, “datasource.config”); Sept. 2012 Yangjun Chen ACS-3902

{ dsDriver = prop.getProperty(source + “.driver”);
if (prop != null) { dsDriver = prop.getProperty(source + “.driver”); dsProtocol = prop.getPropert(source + “.protocol”); dsSubprotocol = prop.getPropert(source + “.subprotocol”); if (dsName == null) dsName = prop.getProperty(source + “.dsName”); if (dsUsername == null) dsUsername = prop.getProperty(source + “.username”); if (dsPassword == null) dsPassword = prop.getProperty(source + “.password”); Sept. 2012 Yangjun Chen ACS-3902

Class.forName(dsDriver); //connect to data source
//load driver class Class.forName(dsDriver); //connect to data source String url = dsProtocol + “:” + dsSubprotocol + “:” + dsName; con = DriverManager.getConnection(url, dsUsername, dsPassword) } else throw new Exception(“*Cannot find property file” + configFile); return con; catch (ClassNotFoundException e) { throw new Exception(“* Cannot find driver class “ + dsDriver + “!”); } Sept. 2012 Yangjun Chen ACS-3902

//dir: directory of configuration file //filename: file name
public Properties loadConfig(String dir, String filename) throws Exception { File inFile = null; Properties prop = null; try { inFile = new File(dir, filename); if (inFile.exists() { prop = new Properties(); prop.load(new FileInputStream(inFile)); } else throw new Exception(“* Error in finding “ + inFile.toString()); finally {return prop;} Sept. 2012 Yangjun Chen ACS-3902

DatabaseAccess db = new databaseAccess();
Using class DatabaseAccess, DataSourceDemo1 should be modified a little bit: DatabaseAccess db = new databaseAccess(); con = db.getConnection(“config”, “datasource.config”); Sept. 2012 Yangjun Chen ACS-3902

function dependencies - data redundancy, update anomalies
- what is a function dependency? - inference rules, minimal set of FDs normal forms - first normal form - second normal form - third normal form - Boyce Codd normal form Normalization Sept. 2012 Yangjun Chen ACS-3902

Data redundancy and update anomalies:
EmployeeDepartment ename ssn bdate address dnumber dname This is similar to Employee, but we have included dname. Sept. 2012 Yangjun Chen ACS-3902

This is similar to Works_on, but we have included ename and plocation
EmployeeProject ssn pnumber hours ename plocation This is similar to Works_on, but we have included ename and plocation Sept. 2012 Yangjun Chen ACS-3902

Redundant data leads to additional space requirements update anomalies
In the two prior cases with EmployeeDepartment and EmployeeProject, we have redundant information in the database … if two employees work in the same department, then that department name is replicated if more than one employee works on a project then the project location is replicated if an employee works on more than one project his/her name is replicated Redundant data leads to additional space requirements update anomalies Sept. 2012 Yangjun Chen ACS-3902

modification anomalies
Suppose EmployeeDepartment is the only relation where department name is recorded insert anomalies adding a new department is complicated unless there is also an employee for that department deletion anomalies if we delete all employees for some department, what should happen to the department information? modification anomalies if we change the name of a department, then we must change it in all tuples referring to that department Sept. 2012 Yangjun Chen ACS-3902

Functional dependencies:
Suppose we have a relation R comprising attributes X,Y, … We say a functional dependency exists between the attributes X and Y, if, whenever a tuple exists with the value x for X, it will always have the same value y for Y. X Y X Y LHS RHS Sept. 2012 Yangjun Chen ACS-3902

Student course_no student_no student_name gender Student_no
Given a specific student number, there is only one value for student name and only one value for gender found with it. Sept. 2012 Yangjun Chen ACS-3902

Inference Rules for Function Dependencies
From a set of FDs, we can derive some other FDs Example: F = {ssn  {Ename, Bdate, Address, dnumber}, dnumber  {dname, dmgrssn}} inference ssn  {dname, dmgrssn}, ssn  dnumber, dnumber  dname. F+ (closure of F): The set of all FDs that can be deduced from F (with F together) is called the closure of F. Sept. 2012 Yangjun Chen ACS-3902

Inference Rules for Function Dependencies Inference rules:
- IR1 (reflexive rule): If X  Y, then X  Y. (X  X.) - IR2 (augmentation rule): {X  Y} |= ZX  ZY. - IR3 (transitive rule): {X  Y, Y  Z} |= X  Z. - IR4 (decomposition, or projective, rule): {X  ZY} |= X  Y, X  Z. - IR5 (union, or additive, rule): {X  Y, Y  Z} |= X  ZY. - IR6 (pseudotransitive rule): {X  Y, WY  Z} |= WX  Z. Sept. 2012 Yangjun Chen ACS-3902

Equivalence of Sets of FDs E and F are equivalent if E+ = F+.
Minimal sets of FDs every dependency has a single attribute on the RHS the attributes on the LHS of a dependency are minimal we cannot remove any dependency from F and still have a set of dependencies that is equivalent to F. ssn pnumber hours ename plocation {ssn, pnumber}  hours, ssn  ename, pnumber  plocation. Sept. 2012 Yangjun Chen ACS-3902

We’ll consider 1NF, 2NF, 3NF, and BCNF.
Normal Forms A series of normal forms are known that have, successively, better update characteristics. We’ll consider 1NF, 2NF, 3NF, and BCNF. A technique used to improve a relation is decomposition, where one relation is replaced by two or more relations. When we do so, we want to eliminate update anomalies without losing any information. Sept. 2012 Yangjun Chen ACS-3902

Bellaire, Sugarland, Houston
1NF - First Normal Form The domain of an attribute must only contain atomic values. This disallows repeating values, sets of values, relations within relations, nested relations, … In the example database we have a department located in possibly several locations: department 5 is located in Bellaire, Sugarland, and Houston. If we had the relation then it would not be 1NF because there are multiple values to be kept in dlocations. Department dnumber dname dmgrssn dlocations 5 Research Bellaire, Sugarland, Houston Sept. 2012 Yangjun Chen ACS-3902

Generally considered the best solution 5 Houston
1NF - First Normal Form If we have a non-1NF relation we can decompose it, or modify it appropriately, to generate 1NF relations. There are 3 options: option 1: split off the problem attribute into a new relation (create a DepartmentLocation relation). Department DepartmentLocation dnumber dname dmgrssn dnumber dlocation 5 Research 5 Bellaire 5 Sugarland Generally considered the best solution 5 Houston Sept. 2012 Yangjun Chen ACS-3902

full functional dependency
2NF - Second Normal Form full functional dependency X  Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more. ssn pnumber hours ename plocation EmployeeProject {ssn, pnumber}  hours is a full dependency (neither ssn  hours , nor pnumber  hours). Sept. 2012 Yangjun Chen ACS-3902

partial functional dependency
2NF - Second Normal Form partial functional dependency X  Y is a partial functional dependency if removal of some attribute A from X does not affect the dependency. {ssn, pnumber}  ename is a partial dependency because ssn ename holds.) ssn pnumber hours ename plocation EmployeeProject Sept. 2012 Yangjun Chen ACS-3902

A relation schema is in 2NF if (1) it is in 1NF and
2NF - Second Normal Form A relation schema is in 2NF if (1) it is in 1NF and (2) every non-key attribute must be fully functionally dependent on the primary key. If we had the relation EmployeeProject ssn pnumber hours ename plocation then this relation would not be 2NF because of two separate violations of the 2NF definition: Sept. 2012 Yangjun Chen ACS-3902

ssn pnumber ssn pnumber ssn pnumber
2NF - Second Normal Form We correct this by decomposing the relation into three relations - splitting off the offending attributes - splitting off partial dependencies on the key. EmployeeProject ssn pnumber hours ename plocation ssn pnumber hours 2NF ssn ename pnumber plocation Sept. 2012 Yangjun Chen ACS-3902

Transitive dependency
3NF - Third Normal Form Transitive dependency A functional dependency X  Y in a relation schema R is a transitive dependency if there is a set of attributes Z that is not a subset of any key of R, and both X  Z and Z  Y hold. EmployeeDept ename ssn bdate address dnumber dname ssn  dnumber and dnumber  dname Sept. 2012 Yangjun Chen ACS-3902

A relation schema is in 3NF if (1) it is in 2NF and
3NF - Third Normal Form A relation schema is in 3NF if (1) it is in 2NF and (2) each non-key attribute must not be fully functionally dependent on another non-key attribute (there must be no transitive dependency of a non-key attribute on the PK) If we had the relation ename ssn bdate address dnumber dname then this relation would not be 3NF because dname is functionally dependent on dnumber and neither is a key attribute Sept. 2012 Yangjun Chen ACS-3902

3NF - Third Normal Form We correct this by decomposing - splitting off the transitive dependencies EmployeeDept ename ssn bdate address dnumber dname ename ssn bdate address dnumber 3NF dnumber dname Sept. 2012 Yangjun Chen ACS-3902

Boyce Codd Normal Form, BCNF
Consider a different definition of 3NF, which is equivalent to the previous one. A relation schema R is in 3NF if, whenever a function dependency X  A holds in R, either (a) X is a superkey of R, or (b) A is a prime attribute of R. A superkey of a relation schema R = {A1, A2, ..., An} is a set of attributes S R with the propertity that no tuples t1 and t2 in any legal state r of R will have t1[S] = t2[S]. An attribute is called a prime attribute if it is a member of any key. Sept. 2012 Yangjun Chen ACS-3902

If we remove (b) from the previous definition for 3NF, we have the definition for BCNF. A relation schema is in BCNF if every determinant is a superkey key. Stronger than 3NF: - no partial dependencies - no transitive dependencies where a non-key attribute is dependent on another non-key attribute - no non-key attributes appear in the LHS of a functional dependency. Sept. 2012 Yangjun Chen ACS-3902

Consider: Instructor teaches one course only. student_no course_no instr_no In 3NF! Student takes a course and has one instructor. {student_no, course_no}  instr_no instr_no  course_no Sept. 2012 Yangjun Chen ACS-3902

Boyce Codd Normal Form, BCNF Some sample data:
student_no course_no instr_no 121 1803 99 Instructor 99 teaches 1803 Instructor 77 teaches 1903 Instructor 66 teaches 1803 121 1903 77 222 1803 66 222 1903 77 Sept. 2012 Yangjun Chen ACS-3902

student_no course_no instr_no 121 1803 99 1903 77 222 66 Instructor 99 teaches 1803 Instructor 77 teaches 1903 Instructor 66 teaches 1803 Deletion anomaly: If we delete all rows for course 1803 we’ll lose the information that instructors 99 teaches student 121 and 66 teaches student 222. Insertion anomaly: How do we add the fact that instructor 55 teaches course 2906? Sept. 2012 Yangjun Chen ACS-3902

Which decomposition preserves all the information? S# C# C# I# 121 1803 1803 99 student_no course_no 121 1903 1903 77 ? course_no instr_no 222 1803 1803 66 222 1903 student_no course_no Joining these two tables leads to spurious tuples - result includes student_no instr_no student_no instr_no course_no instr_no Sept. 2012 Yangjun Chen ACS-3902

 student_no course_no instr_no student_no course_no course_no
121 1803 99 1903 77 222 66 121 1803 1903 222 S# C# 1803 99 1903 77 66 C# I#  Sept. 2012 Yangjun Chen ACS-3902

Which decomposition preserves all the information? S# C# S# I# 121 1803 121 99 student_no course_no 121 1903 121 77 course_no instr_no 222 1803 222 66 222 1903 222 77 student_no course_no ? Joining these two tables leads to spurious tuples - result includes student_no instr_no student_no instr_no course_no instr_no Sept. 2012 Yangjun Chen ACS-3902

 student_no course_no instr_no student_no course_no student_no
121 1803 99 1903 77 222 66 121 1803 1903 222 S# C# 99 77 66 121 222 I# S#  Sept. 2012 Yangjun Chen ACS-3902

This decomposition preserves all the information. S# I# C# I# 121 99 1803 99 student_no instr_no 121 77 1903 77 course_no instr_no 222 66 1803 66 222 77 Only FD is instr_no course_no but the join preserves {student_no, course_no} instr_no Sept. 2012 Yangjun Chen ACS-3902

student_no course_no instr_no student_no Instr_no course_no instr_no
121 1803 99 1903 77 222 66 121 99 77 222 66 S# I# 1803 1903 99 77 66 C# I# Sept. 2012 Yangjun Chen ACS-3902

Definition of lossless join property - relation decomposition
Testing algorithm - matrix construction - matrix initialization - matrix modification Lossless join Sept. 2012 Yangjun Chen ACS-3902

Basic definition of Lossless-join
A decomposition D = {R1, R2,..., Rm} of R has the lossless join property with respect to the set of dependencies F on R if, for every relation r of R that satisfies F, the following holds, (R1(r), ..., Rm(r)) = r, where  is the natural join of all the relations in D. The word loss in lossless refers to loss of information, not to loss of tuples. Sept. 2012 Yangjun Chen ACS-3902

F = {SSN  ENAME, PNUM  {PNAME, PLOCATION}, {SSN, PNUM}  hours}
Emp_PROJ SSN PNUM hours ENAME PNAME PLOCATION F = {SSN  ENAME, PNUM  {PNAME, PLOCATION}, {SSN, PNUM}  hours} R1 R2 SSN ENAME PNUM PNAME PLOCATION R3 SSN PNUM hours Lossless join Sept. 2012 Yangjun Chen ACS-3902

decomposion-1 A1 SSN A2 ENAME A3 PNUM A4 PNAME A5 PLOCATION A6 hours
b11 b21 b31 b12 b22 b32 b13 b23 b33 b14 b24 b34 b15 b25 b35 b16 b26 b36 R1 R2 R3 a1 b21 a2 b22 b32 b13 a3 b14 a4 b34 b15 a5 b35 b16 b26 a6 R1 R2 R3 Sept. 2012 Yangjun Chen ACS-3902

PNUM  {PNAME, PLOCATION}
SSN  ENAME SSN ENAME R1 R2 R3 a1 a2 b13 b14 b15 b16 b21 b22 a3 a4 a5 b26 a1 a2 a3 b34 b35 a6 PNUM  {PNAME, PLOCATION} PNUM PNAME PLOCATION R1 R2 R3 a1 a2 b13 b14 b15 b16 b21 b22 a3 a4 a5 b26 a1 a2 a3 a4 a5 a6 Sept. 2012 Yangjun Chen ACS-3902

Example: decomposition-2
Emp_PROJ SSN PNUM hours ENAME PNAME PLOCATION F = {SSN  ENAME, PNUM  {PNAME, PLOCATION}, {SSN, PNUM}  hours} R1 ENAME PLOCATION Not lossless join R2 SSN PNUM hours PNAME PLOCATION Sept. 2012 Yangjun Chen ACS-3902

The matrix can not be changed!
decomposition-2 A1 SSN A2 ENAME A3 PNUM A4 PNAME A5 PLOCATION A6 hours b11 b21 b12 b22 b13 b23 b14 b24 b15 b25 b16 b26 R1 R2 R1 b11 a2 b13 b14 a5 b16 R2 a1 b22 a3 a4 a5 a6 SSN  ENAME PNUM  {PNAME, PLOCATION} {SSN, PNUM}  hours The matrix can not be changed! Sept. 2012 Yangjun Chen ACS-3902

Multi-Dimensional Indexes
Multiple-key indexes kd-trees Quad trees R-trees Bit map Inverted files Sept. 2012 Yangjun Chen ACS-3902

(Indexes over more than one attributes)
Multiple-key indexes (Indexes over more than one attributes) Employee ename ssn age salary dnumber Aaron, Ed Abbott, Diane Adams, John Adams, Robin Sept. 2012 Yangjun Chen ACS-3902

(Indexes over more than one attributes)
Multiple-key indexes (Indexes over more than one attributes) Index on age Index on salary Sept. 2012 Yangjun Chen ACS-3902

Multiple-key indexes 25 30 45 50 60 70 85 400 350 260 75 100 120 275 110 140 Sept. 2012 Yangjun Chen ACS-3902

(A generalization of binary trees)
kd-Trees (A generalization of binary trees) A kd-tree is a binary tree in which interior nodes have an associated attribute a and a value v that splits the data points into two parts: those with a-value less than v and those with a-value equal or larger than v. Sept. 2012 Yangjun Chen ACS-3902

kd-Trees salary 150 age 60 age 47 salary 80 70, 110 85, 140 salary 300
50, 275 60, 260 age 38 50, 100 50, 120 30, 260 25, 400 45, 350 25, 60 45, 60 50, 75 Sept. 2012 Yangjun Chen ACS-3902

Insert a new entry into a kd-tree:
salary 150 age 60 age 47 salary 80 70, 110 85, 140 salary 300 50, 275 60, 260 age 38 50, 100 50, 120 30, 260 25, 400 45, 350 25, 60 45, 60 50, 75 Sept. 2012 Yangjun Chen ACS-3902

Insert a new entry into a kd-tree:
salary 150 insert(35, 500): age 60 age 47 salary 80 70, 110 85, 140 salary 300 50, 275 60, 260 age 38 50, 100 50, 120 30, 260 age 35 25, 60 45, 60 50, 75 25, 400 35, 500 45, 350 Sept. 2012 Yangjun Chen ACS-3902

Quad-trees In a Quad-tree, each node corresponds to a square region in two dimensions, or to a k-dimensional cube in k dimensions. If the number of data entries in a square is not larger than what will fit in a block, then we can think of this square as a leaf node. If there are too many data entries to fit in one block, then we treat the square as an interior node, whose children correspond to its four quadrants. Sept. 2012 Yangjun Chen ACS-3902

Quad-trees … … … 400k salary 100 age name age salary … 25 400
100 age Sept. 2012 Yangjun Chen ACS-3902

Quad-trees SW NW SE NE SW – south-west SE – south-east NW – north-west
400k Quad-trees 50, 200 100 SW NW SE NE 25, 60 46, 60 50, 275 60, 260 75, 100 25, 300 50, 75 50, 100 85, 140 50, 120 70, 110 30, 260 25, 400 45, 350 SW – south-west SE – south-east NW – north-west NE – north-east Sept. 2012 Yangjun Chen ACS-3902

An R-tree is an extension of B-trees for multidimensional data.
R-trees An R-tree is an extension of B-trees for multidimensional data. An R-tree corresponds to a whole area (a rectangle for two-dimensional data.) In an R-tree, any interior node corresponds to some interior regions, or just regions, which are usually a rectangle Each region x in an interior node n is associated with a link to a child of n, which corresponds to all the subregions within x. Sept. 2012 Yangjun Chen ACS-3902

Suppose that the local cellular phone company adds a POP (point
of presence, or base station) at the position shown below. 100 school POP road1 house2 house1 pipeline road2 100 Sept. 2012 Yangjun Chen ACS-3902

R-trees ((0, 0), (60, 50)) ((20, 20), (100, 80)) road1 road2 house1
((0, 0), (60, 50)) ((20, 20), (100, 80)) road1 road2 house1 school house2 pipeline pop Sept. 2012 Yangjun Chen ACS-3902

Insert a new region r into an R-tree.
100 school POP road1 house2 house1 pipeline ((70, 5), (980, 15)) road2 house3 100 Sept. 2012 Yangjun Chen ACS-3902

Search the R-tree, starting at the root. If the encountered node is internal, find a subregion into which r fits. If there is more than one such region, pick one and go to its corresponding child. If there is no subregion that contains r, choose any subregion such that it needs to be expanded as little as possible to contain r. ((70, 5), (980, 15)) ((0, 0), (60, 50)) ((20, 20), (100, 80)) road1 road2 house1 school house2 pipeline pop Sept. 2012 Yangjun Chen ACS-3902

If we expand the lower subregion, corresponding to the first
Two choices: If we expand the lower subregion, corresponding to the first leaf, then we add 1000 square units to the region. If we extend the other subregion by lowering its bottom by 5 units, then we add 1200 square units. ((0, 0), (80, 50)) ((20, 20), (100, 80)) road1 road2 house1 house3 school house2 pipeline pop Sept. 2012 Yangjun Chen ACS-3902

If the encountered node v is a leaf, insert r into it. If there is no room for r, split the leaf into two and distribute all subregions in them as evenly as possible. Calculate the ‘parent’ regions for the new leaf nodes and insert them into v’s parent. If there is the room at v’s parent, we are done. Otherwise, we recursively split nodes going up the tree. Add POP (point of presence, or base station) Suppose that each leaf has room for 6 regions. ((0, 0), (100, 100)) road1 road2 house1 school house2 pipeline Sept. 2012 Yangjun Chen ACS-3902

Bit map Image that the records of a file are numbered 1, …, n.
A bitmap for a data field F is a collection of bit-vector of length n, one for each possible value that may appear in the field F. The vector for a specific value v has 1 in position i if the ith record has v in the field F, and it has 0 there if not. Sept. 2012 Yangjun Chen ACS-3902

Example Employee Bit maps for age: 30: 1100000 40: 0010000 50: 0001000
ename ssn age salary dnumber Aaron, Ed Abbott, Diane Adams, John Adams, Robin Brian, Robin Brian, Mary Widom, Jones 30 40 50 55 60 75 78 80 100 Bit maps for age: 30: 40: 50: 55: 60: Bit maps for salary: 60: 75: 78: 80: 100: Sept. 2012 Yangjun Chen ACS-3902

Query evaluation Select ename From Employee
Where age = 55 and salary = 80 In order to evaluate this query, we intersect the vectors for age = 55 and salary = 80. vector for age = 55 vector for salary = 80  This indicates the 6th tuple is the answer. Sept. 2012 Yangjun Chen ACS-3902

Range query evaluation
Select ename From Employee Where 30 < age < 55 and 60 < salary < 78 We first find the bit-vectors for the age values in (30, 50); there are only two: and for 40 and 50, respectively. Take their bitwise OR:  = Next find the bit-vectors for the salary values in (60, 78) and take their bitwise OR:  =  The 3rd and 4th tuples are the answer. Sept. 2012 Yangjun Chen ACS-3902

Compression of bitmaps Run-length encoding:
Run in a bit vector: a sequence of i 0’s followed by a 1. This bit vector contains two runs. Run compression: a run r is represented as another bit string r’ composed of two parts. part 1: i expressed as a binary number, denoted as b1(i). part 2: Assume that b1(i) is j bits long. Then, part 2 is a sequence of (j – 1) 1’s followed by a 0, denoted as b2(i). r’ = b2(i)b1(i). Sept. 2012 Yangjun Chen ACS-3902

Compression of bitmaps Run-length encoding:
Run in a bit vector s: a sequence of i 0’s followed by a 1. This bit vector contains two runs. r’ = b2(i)b1(i). r1 = r1’ = b11 = 7 = 111, b12 = 110 r2 = 0001 b11 = 3 = 11, b12 = 10 r2’ = 1011 Sept. 2012 Yangjun Chen ACS-3902

Decoding a compressed sequence s’:
Starting at the beginning, find the first 0 at the 3rd bit, so j = 3. The next 3 bits are 111, so we determine that the first integer is 7. In the same way, we can decode1011. r1’ r2’ = Decoding a compressed sequence s’: 1. Scan s’ from the beginning to find the first 0. 2. Let the first 0 appears at position j. Check the next j bits. The corresponding value is a run. 3. Remove all these bits from s’. Go to (1). Sept. 2012 Yangjun Chen ACS-3902

Inverted files An inverted file - A list of pairs of the form: <key word, pointer> … the cat is fat cat … was raining cats and dogs … dog … Fido the Dogs … a bucket of pointers Sept. 2012 Yangjun Chen ACS-3902

Inverted files When we use “buckets” of pointers to occurrences of each word, we may extend the idea to include in the bucket array some information about each occurrence. type position … title 5 … the cat is fat header 10 cat anchor 3 … text 57 … was raining cats and dogs … dog … … … Fido the Dogs … Sept. 2012 Yangjun Chen ACS-3902

Hierarchical database schema - hierarchical schema
- record type, PCR type - virtual PCR: virtual child, virtual parent Database languages - HDDL - HDML Hierarchical databases Sept. 2012 Yangjun Chen ACS-3902

ERD for Chapter 6 database example dependent
1 1 Works on employee m n n 1 project n 1 1 1 department Dept_locations n 1 Sept. 2012 Yangjun Chen ACS-3902

Virtual Parent-child Relationships
- Hierarchical schema using VPCR - for a Company database D Department Dname Dnum E Employee Ename Minit … ... L P Dlocation Location Project Esupervisee Pname … ... SPTR S Y Demployee Dependent T EPTR ... DEPname Minit M Dmanager W StartDate MPTR Pworker Hours WPTR Sept. 2012 Yangjun Chen ACS-3902

Hierarchical databases Multi-dimensional indexes

Similar presentations

Presentation on theme: "Hierarchical databases Multi-dimensional indexes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchical databases Multi-dimensional indexes

Similar presentations

Presentation on theme: "Hierarchical databases Multi-dimensional indexes"— Presentation transcript:

Similar presentations

About project

Feedback