Download presentation
Presentation is loading. Please wait.
1
Compiler Concepts for Database Systems
Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT (860)
2
Overview Motivation and Background Database System Architecture
Exploring its Capabilities Focusing on Compiler-Related Concepts Compile Time Issues in Database Systems The SQL Query Language Optimization Issues in Database Systems Typing Runtime Issues in Database Systems Transaction Processing Execution for Complex Joins
3
Database System Architecture
What are the Various Components? How do they Relate to Compilers?
4
How Does it Compare to Java Environment?
5
Database Concepts - Summary
Schema vs. Data Database-Structured Collection of Data Describing Objects of Universe of Discourse being Modeling. A Database Consists of Schema and Data Schema: Describes the Intension (Type) of Objects Data: Describes the Extension (Instances) of Objects What is Schema w.r.t. Compilers? What is Data?
6
What is a DBMS? A Database Management System (DBMS) is the Generalized Tool that Facilitates the Management of and Access to the Database Main Functions: Defining a Database: Specifying Data Types, Structures, and Constraints Constructing a Database: the Process of Storing the Data Itself on Some Storage Medium Manipulating a Database: Function for Querying Specific Data in the Database and Updating the Database What are the Analogies of Each of the Main Functions w.r.t. Programming Languages and Compilers?
7
What is a DBMS? Additional Functions: Interaction with File Manager
So that Details Related to Data Storage and Access are Removed From Application Programs Integrity Enforcement Guarantee Correctness, Validity, Consistency Security Enforcement Prevent Data From Illegal Uses Concurrency Control Control the Interference Between Concurrent Programs Recovery from Failure Query Processing and Optimization Again – What are Relevant Compiler Concepts?
8
DBMS Architecture DBMS Languages Data Definition Language (DDL)
Data Manipulation Language (DML) From Embedded Queries or DB Commands Within a Program “Stand-alone” Query Language Host Language: DML Specification (e.g., SQL) is Embedded in a “Host” Programming Language (e.g., Java, C++) DBMS Interfaces Menu-Based Interface Graphical Interface Forms-Based Interface Interface for DBA (DB Administrator)
9
DBMS Architecture Main DBMS Modules DDL Compiler DML Compiler
Ad-hoc (Interactive) Query Compiler Run-time Database Processor Stored Data Manager Concurrency/Back-Up/Recovery Subsystem DBMS Utility Modules Loading Routines Backup Utility System Catalog/data Dictionary
10
Components of a DBMS
11
ANSI/SPARC - Three Schema Architecture
External Data Schema (Users’ view) Conceptual Data Schema (Logical Schema) Internal Data Schema (Physical Schema) What are the Programming Language Analogies?
12
Conceptual Schema Describes the Meaning of Data in the Universe of Discourse Emphasizes on General, Conceptually Relevant, and Often Time Invariant Structural Aspects of the Universe of Discourse Excludes the Physical Organization and Access Aspects of the Data This could be a UML Design that Realizes a Set of Classes (no data) or Java Class Declarations (APIs)
13
Conceptual Schema Another Example – A Programming Language Level Definition
14
External Schema Describes Parts of the Information in the Conceptual Schema in a form Convenient to a Particular User Group’s View Derived from the Conceptual Schema What is the View of the Outside World in OO? Akin to Public Interface
15
External Schema Another Example
16
Internal Schema Describes How the Information Described in the Conceptual Schema is Physically Represented in a Database to Provide the Overall Best Performance
17
Internal Schema Another Example
This Corresponds to Data Typing and Layout in Compilers from Runtime Environment!
18
Unified Example of Three Schemas
19
Database Access Process
What Does This Access Process Resemble? Akin to Runtime Execution Environment! A More Complex Activation Process!
20
Metadata vs. Data Recall Introspection and Reflection in Java where you Can “Look” into the Class Definitions Themselves!
21
Data Independence Ability that Allows Application Programs Not Being Affected by Changes in Irrelevant Parts of the Conceptual Data Representation, Data Storage Structure and Data Access Methods Invisibility (Transparency) of the Details of Entire Database Organization, Storage Structure and Access Strategy to the Users Recall Software Engineering Concepts: Abstraction the Details of an Application's Components Can Be Hidden, Providing a Broad Perspective on the Design Representation Independence: Changes Can Be Made to the Implementation that have No Impact on the Interface and Its Users Realized in Today’s Modern PLs!
22
What are System Components?
How are these Similar to Complier/PL Concepts?
23
Relational Model Relational Model of Data Based on the Concept of a Relation Relation - a Mathematical Concept Based on Sets Strength of the Relational Approach to Data Management Comes From the Formal Foundation Provided by the Theory of Relations RELATION: A Table of Values A Relation May Be Thought of as a Set of Rows A Relation May Alternately be Though of as a Set of Columns Each Row of the Relation May Be Given an Identifier Each Column Typically is Called by its Column Name or Column Header or Attribute Name
24
Relational Tables - Rows/Columns/Tuples
25
Relational Database Definition
CREATE TABLE Student: Name(CHAR(30)), SSN(CHAR(9)), Gpa(FLOAT(2)) CREATE TABLE Faculty: Name(CHAR(30)), SSN(CHAR(9)), Ophone(CHAR(7)) CREATE TABLE Courses: Course#(CHAR(6)), Title(CHAR(20)), Descrip(CHAR(100)), PCourse#(CHAR(6)) CREATE TABLE Formats: Section#(INTEGER(3)), Quarter(CHAR(10)), Campus(CHAR(15)) CREATE TABLE TakeorTeach: SSN(CHAR(9)), Course#(CHAR(6)), Section#(INTEGER(3)) CREATE TABLE COfferings: Course#(CHAR(6)), Section#(INTEGER(3)) Student(Name*, SSN, Gpa) Faculty(Name*, SSN, Ophone) Courses(Course#*, Title, Descrip, PCourse#*) Formats(Section#*, Quarter, Campus) TakeorTeach(SSN, Course#, Section#) COfferings(Course#, Section#)
26
Relational Views Two Views Derived From Prior Tables
Student Transcript View Course Prerequisite View
27
SQL: Tuple Relational Calculus-Based
SQL is a Partial Example of a Tuple Relational Language Simple Queries are all Declarative More Complex Queries are both Declarative and Procedural (e.g., joins, nested queries) Find the names of employees working on the CAD/CAM project SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”) SQL Defines a Programming Language and Associated Semantics for Usage and Processing
28
SQL Components Data Definition Language (DDL)
For External and Conceptual Schemas Views - DDL for External Schemas Data Manipulation Language (DML) Interactive DML Against External and Conceptual Schemas Embedded DML in Host PLs (EQL, JDBC, etc.) Note: Separation of Definition (DDL) from Usage (DML) – Is there Something Similar in PLs? Others Integrity (Allowable Values/Referential) Transaction Control (Long-Duration and Batch) Authorization (Who can Do What When)
29
SQL DDL and DML Data Definition Language (DDL) - Declarations
Defining the Relational Schema - Relations, Attributes, Domains - The Meta-Data CREATE TABLE Student: Name(CHAR(30)),SSN(CHAR(9)),GPA(FLOAT(2)) CREATE TABLE Courses: Course#(CHAR(6)), Title(CHAR(20)), Descrip(CHAR(100)), PCourse#(CHAR(6)) Data Manipulation Language (DML) - Code Defining the Queries Against the Schema SELECT Name, SSN From Student Where GPA > 3.00
30
Data Definition Language - DDL
A Pre-Defined set of Primitive Types Numeric Character-string Bit-string Additional Types Defining Domains Defining Schema Defining Tables Defining Views Note: Each DBMS May have their Own DBMS Specific Data Types - Is this Good or Bad? What is this Similar to re. Different C++ Compilers? These are Akin to PL Data Types! primitive types numeric (integers, floating-point real number REAL, DOUBLE PRECISION, FLOAT(N) DECIMAL(P,D), with P digits of which D are to the right of the decimal point character-string CHAR(N) (or CHARACTER(N)) is a fixed-length character string VARCHAR(N) (or CHAR VARYING(N), or CHARACTER VARYING(N)) is a variable-length character string with at most N characters bit-strings BIT(N) is a fixed-length bit string VARBIT(N) (or BIT VARYING(N)) is a bit string with at most N bits
31
DDL - Primitive Types Numeric INTEGER (or INT), SMALLINT
REAL, DOUBLE PRECISION FLOAT(N) Floating Point with at Least N Digits DECIMAL(P,D) (DEC(P,D) or NUMERIC(P,D)) have P Total Digits with D to Right of Decimal Note that INTs and REALs are Machine Dependent (Based on Hardware/OS Platform) Again – this is Similar to PLs/Compilers and Code Generation – Data Layout primitive types numeric (integers, floating-point real number REAL, DOUBLE PRECISION, FLOAT(N) DECIMAL(P,D), with P digits of which D are to the right of the decimal point character-string CHAR(N) (or CHARACTER(N)) is a fixed-length character string VARCHAR(N) (or CHAR VARYING(N), or CHARACTER VARYING(N)) is a variable-length character string with at most N characters bit-strings BIT(N) is a fixed-length bit string VARBIT(N) (or BIT VARYING(N)) is a bit string with at most N bits
32
DDL - Primitive Types Character-String CHAR(N) or CHARACTER(N) - Fixed
VARCHAR(N), CHAR VARYING(N), or CHARACTER VARYING(N) Variable with at Most N Characters Bit-Strings BIT(N) Fixed VARBIT(N) or BIT VARYING(N) Variable with at Most N Bits primitive types numeric (integers, floating-point real number REAL, DOUBLE PRECISION, FLOAT(N) DECIMAL(P,D), with P digits of which D are to the right of the decimal point character-string CHAR(N) (or CHARACTER(N)) is a fixed-length character string VARCHAR(N) (or CHAR VARYING(N), or CHARACTER VARYING(N)) is a variable-length character string with at most N characters bit-strings BIT(N) is a fixed-length bit string VARBIT(N) (or BIT VARYING(N)) is a bit string with at most N bits
33
DDL - Primitive Types These Specialized Primitive Types are Used to:
Simplify Modeling Process Include “Popular” Types Reduce Composite Attributes/Programming DATE : YYYY-MM-DD TIME: HH-MM-SS TIME(I): HH-MM-SS-F....F - I Fraction Seconds TIME WITH TIME ZONE: HH-MM-SS-HH-MM TIME-STAMP: YYYY-MM-DD-HH-MM-SS-F...F{-HH-MM} PLs also have Specialized Types! Problem: Different Database Systems Sometime Implement these Types very Differently This Impacts Portability! primitive types numeric (integers, floating-point real number REAL, DOUBLE PRECISION, FLOAT(N) DECIMAL(P,D), with P digits of which D are to the right of the decimal point character-string CHAR(N) (or CHARACTER(N)) is a fixed-length character string VARCHAR(N) (or CHAR VARYING(N), or CHARACTER VARYING(N)) is a variable-length character string with at most N characters bit-strings BIT(N) is a fixed-length bit string VARBIT(N) (or BIT VARYING(N)) is a bit string with at most N bits
34
What is a SQL Schema? A Schema in SQL is the Major Meta-Data Construct
Supports the Definition of: Relation - Table with Name Attributes - Columns and their Types Identification - Primary Key Constraints - Referential Integrity (FK) Two Part Definition CREATE Schema - Named Database or Conceptually Related Tables CREATE Table - Individual Tables of the Schema
35
DDL-Create/Drop a Schema
Creating a Schema: CREATE SCHEMA MY_COMPANY AUTHORIZATION Demurjian; Schema MY_COMPANY bas Been Created and is Owner by the User “Demurjian” Tables can now be Created and Added to Schema Dropping a Schema: DROP SCHEMA MY_COMPANY RESTRICT; DROP SCHEMA MY_COMPANY CASCADE; Restrict: Drop Operation Fails If Schema is Not Empty Cascade: Drop Operation Removes Everything in the Schema
36
DDL - Create Tables CREATE TABLE EMPLOYEE
( FNAME VARCHAR(15) NOT NULL , MINIT CHAR , LNAME VARCHAR(15) NOT NULL , SSN CHAR(9) NOT NULL , BDATE DATE ADDRESS VARCHAR(30) , SEX CHAR , SALARY DECIMAL(10,2) , SUPERSSN CHAR(9) , DNO INT NOT NULL , PRIMARY KEY (SSN) , FOREIGN KEY (SUPERSSN) REFERENCES EMPLOYEE(SSN) , FOREIGN KEY (DNO) REFERENCES DEPARTMENT(DNUMBER) ) ;
37
DDL - Create Tables (continued)
CREATE TABLE DEPARTMENT ( DNAME VARCHAR(15) NOT NULL , DNUMBER INT NOT NULL , MGRSSN CHAR(9) NOT NULL , MGRSTARTDATE DATE , PRIMARY KEY (DNUMBER) , UNIQUE (DNAME) , FOREIGN KEY (MGRSSN) REFERENCES EMPLOYEE(SSN) ) ; CREATE TABLE DEPT_LOCATIONS (DNUMBER INT NOT NULL , DLOCATION VARCHAR(15) NOT NULL , PRIMARY KEY (DNUMBER, DLOCATION) , FOREIGN KEY (DNUMBER) REFERENCES DEPARTMENT(DNUMBER) ) ;
38
DDL - Create Tables (continued)
CREATE TABLE PROJECT (PNAME VARCHAR(15) NOT NULL , PNUMBER INT NOT NULL , PLOCATION VARCHAR(15) , DNUM INT NOT NULL , PRIMARY KEY (PNUMBER) , UNIQUE (PNAME) , FOREIGN KEY (DNUM) REFERENCES DEPARTMENT(DNUMBER) ) ; CREATE TABLE WORKS_ON (ESSN CHAR(9) NOT NULL , PNO INT NOT NULL , HOURS DECIMAL(3,1) NOT NULL , PRIMARY KEY (ESSN, PNO) , FOREIGN KEY (ESSN) REFERENCES EMPLOYEE(SSN) , FOREIGN KEY (PNO) REFERENCES PROJECT(PNUMBER) ) ;
39
DDL - Create Tables with Constraints
CREATE TABLE EMPLOYEE ( , DNO INT NOT NULL DEFAULT 1, CONSTRAINT EMPPK PRIMARY KEY (SSN) , CONSTRAINT EMPSUPERFK FOREIGN KEY (SUPERSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET NULL ON UPDATE CASCADE , CONSTRAINT EMPDEPTFK FOREIGN KEY (DNO) REFERENCES DEPARTMENT(DNUMBER) ON DELETE SET DEFAULT ON UPDATE CASCADE );
40
DDL - Create Tables with Constraints
CREATE TABLE DEPARTMENT ( , MGRSSN CHAR(9) NOT NULL DEFAULT ' ' , . . . , CONSTRAINT DEPTPK PRIMARY KEY (DNUMBER) , CONSTRAINT DEPTSK UNIQUE (DNAME), CONSTRAINT DEPTMGRFK FOREIGN KEY (MGRSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET DEFAULT ON UPDATE CASCADE ); Is there an Equivalent to Keys and Constraints in PLs? What Does Java Have Internally? Constraints Facilitate Type Checking at Data Level!
41
Data Manipulation Language - DML
SQL has the SELECT Statement for Retrieving Info. from a Database (Not Relational Algebra Select) SQL vs. Formal Relational Model SQL Allows a Table (Relation) to have Two or More Identical Tuples in All Their Attribute Values Hence, an SQL Table is a Multi-set (Sometimes Called a Bag) of Tuples; it is Not a Set of Tuples SQL Relations Can Be Constrained to Sets by PRIMARY KEY or UNIQUE Attributes Using the DISTINCT Option in a Query Implied Processing and Procedural Semantics SQL Queries have Specific Semantics These Semantics Dictate Processing Includes Code Generation, Optimization, etc. primitive types numeric (integers, floating-point real number REAL, DOUBLE PRECISION, FLOAT(N) DECIMAL(P,D), with P digits of which D are to the right of the decimal point character-string CHAR(N) (or CHARACTER(N)) is a fixed-length character string VARCHAR(N) (or CHAR VARYING(N), or CHARACTER VARYING(N)) is a variable-length character string with at most N characters bit-strings BIT(N) is a fixed-length bit string VARBIT(N) (or BIT VARYING(N)) is a bit string with at most N bits
42
Interactive DML - Main Components
Select-from-where Statement Contains: Select Clause - Chosen Attributes/Columns From Clause - Involved Tables Where Clause - Constrain Tuple Values Tuple Variables - Distinguish Among Same Names in Different Tables String Matching - Detailed Matching Including Exact Starts With Near Ordering of Rows - Sorting Tuple Results
43
Recall Prior Schema
44
…and Corresponding DB Tables
Which Represent Tuples/Instances of Each Relation A S C null W B 1 4 5
45
…and Corresponding DB Tables
46
Simple SQL Queries Query 0: Retrieve the Birthdate and Address of the Employee whose Name is 'John B. Smith'. SELECT BDATE, ADDRESS FROM EMPLOYEE WHERE FNAME='John' AND MINIT='B’ AND LNAME='Smith’ Which Row(s) are Selected? Note: While All of these Next Queries are from Chapter 8, Some are From “Earlier” Edition B S C null W
47
Simple SQL Queries Query 1: Retrieve Name and Address of all Employees who work for the 'Research' Department SELECT FNAME, MINIT, LNAME, ADDRESS, DNAME FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO What Action is Being Performed? Join! Cartesian Product!
48
Simple SQL Queries - Result
Theta Join on DNO=DNUMBER
49
Simple SQL Queries Query 2: For Every Project in 'Stafford', list the Project Number, the Controlling Dept. Number, and the Dept. Manager's Last Name, Address, and Birthdate SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford' In Q2, there are Two Join Conditions: The Join Condition DNUM=DNUMBER Relates a Project to its Controlling Department The Join Condition MGRSSN=SSN Relates the Controlling Department to the Employee who Manages that Department
50
Query Results SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford' A S C null W B
51
Qualification of Attributes
In SQL, the Same Name for Two (or More) Attributes is Allowed if Attributes are in Different Relations In Those Cases, Query Must Qualify by Prefixing the Relation Name to the Attribute Name EMPLOYEE.LNAME, DEPARTMENT.DNAME Aliases: When Queries Must Refer to the Same Relation Twice Alias is Akin to a Variable – Reference in PL! In These Situations, it is Considered that there are Two Different Copies of the Same Relation Let’s See Examples of Both Concepts
52
Attribute Qualification
Query 8: For Each Employee, Retrieve the Employee's Name, and Name of his or her Immediate Supervisor SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME FROM EMPLOYEE E S WHERE E.SUPERSSN=S.SSN E and S are aliases for the EMPLOYEE relation E Represents Employees in the Role of Supervisees S Represents Employees in the Role of Supervisor Another Form of Query 8 is: SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME FROM EMPLOYEE AS E, EMPLOYEE AS S WHERE E.SUPERSSN=S.SSN
53
Query Results SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME FROM EMPLOYEE AS E, EMPLOYEE AS S WHERE E.SUPERSSN=S.SSN A S C null W B
54
Nested Queries SQL SELECT Nested Query is Specified within WHERE-clause of another Query (the Outer Query) Query 1A: Retrieve the Name and Address of all Employees who Work for the 'Research' Department SELECT FNAME, LNAME, ADDRESS FROM EMPLOYEE WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE DNAME='Research' ) Note: This Reformulates Earlier Query 1 The End Result is Essentially: Outer and Inner For/While Loops!
55
How Does Nested Query Work?
The Nested Query Selects Number of 'Research' Dept. The Outer Query Selects an EMPLOYEE Tuple If Its DNO Value Is in the Result of Either Nested Query IN represents Set Inclusion of Result Set We Can Have Several Levels of Nested Queries SELECT FNAME, LNAME, ADDRESS FROM EMPLOYEE WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE Dname=’Research' )
56
NULLS in SQL Queries SQL Allows Queries that Check if a value is NULL (Missing or Undefined or not Applicable) SQL uses IS or IS NOT to compare NULLs since it Considers each NULL value Distinct from other NULL Values, so Equality Comparison is not Appropriate Query 18: Retrieve the names of all employees who do not have supervisors. SELECT FNAME, LNAME FROM EMPLOYEE WHERE SUPERSSN IS NULL Why Would Such a Capability be Useful? Downloading/Crossloading a Database Promoting a Attribute to PK/FK
57
Aggregate Functions in SQL Queries
Query 19: Find Maximum Salary, Minimum Salary, and Average Salary among all Employees SELECT MAX(SALARY), MIN(SALARY), AVG(SALARY) FROM EMPLOYEE Query 20: Find maximum and Minimum Salaries among 'Research' Department Employees SELECT MAX(SALARY), MIN(SALARY) FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO What Does Query 22 Do? SELECT COUNT(*) FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO
58
Grouping in SQL Queries
Query 24: For Each Department, Retrieve the DNO, Number of Employees, and Their Average Salary SELECT DNO, COUNT (*), AVG (SALARY) FROM EMPLOYEE GROUP BY DNO EMPLOYEE tuples are Divided into Groups; each group has the Same Value for Grouping Attribute DNO COUNT and AVG functions are applied to each Group of Tuples Aeparately SELECT-clause Includes only the Grouping Attribute and the Functions to be Applied on each Tuple Group Are there PL Equivalents to these Data Oriented Actions? Yes – in Specific APIs but Not PL Itself!
59
Results of Query 24: SELECT DNO, COUNT (*), AVG (SALARY) FROM EMPLOYEE GROUP BY DNO
60
INSERT SQL Queries Add one or more Tuples to a Relation, with Attribute values Listed in the order specified in the CREATE Update 1: INSERT INTO EMPLOYEE VALUES ('Richard','K','Marini', ' ', '30-DEC-52', '98 Oak Forest,Katy,TX', 'M', ,' ', 4 ) Another Form of Update 1: INSERT INTO EMPLOYEE (FNAME, LNAME, SSN) VALUES ('Richard','K','Marini') All PK and FK Values must be Provided Nulls are Allowed DDL Constraints are Enforced Another form of “Type Checking” at Instance Level This is Akin to Dynamic Type Checking!
61
DELETE SQL Queries Sample Deletes Include
DELETE FROM EMPLOYEE WHERE LNAME='Brown' DELETE FROM EMPLOYEE WHERE SSN=' ’ DELETE FROM EMPLOYEE WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE DNAME='Research') DELETE FROM EMPLOYEE No. of Tuples Deleted Dependent on WHERE Clause Referential Integrity (Type Checking!) is Enforced During DELETE
62
UPDATE SQL Queries Give all Employees in the 'Research' Dept. a 10% raise UPDATE EMPLOYEE SET SALARY = SALARY *1.1 WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT WHERE DNAME='Research') Modified SALARY Value Depends on the Original SALARY Value in each Tuple SALARY = SALARY *1.1 - Use PL Interpretation
63
Query Processing and Optimization
What are the Processing Issues for DBs? Database Applications of Today and Tomorrow Require High Volumes of Information! Increase of Information Still Requires High Performance! Throughput and Response Time Where's the Bottleneck in DBS? CPU ?? Main Memory Size/Speed ?? Virtual Memory Limitations ?? Communications Bus ?? I/O Channel ?? How Does this Relate to Compilers/PLs?
64
90-10 Rule for Database Processing
Load (Transaction per second) vs. Performance (Response Time of Transactions) Processing of Large Amounts of Raw Data Addressed in Secondary Storage Staged to Main Memory Identifying Relevant Data Large Amounts of Raw Data Discarded Focus on Data Most Likely to Contain Answers Possible Loss of CPU and Main Memory Cycles This is Double Jeopardy! Load of DBS Must be Reduced Performance of DBS Degrades
65
90-10 Rule for Conventional DBS
Only 10% of Relevant Data has Answers Note: Naive Approach to Database Searching Often Occurs (Little or No Indexing in Practice!) Only 10% of Raw Data is Relevant Application Programs Operating System Database Functions On-Line I/O Disk I/O
66
Query Optimization Goal
Limit Costly Join Operation by Reducing Data to be Scanned or that Participates in the Join While Improving Selection and Projection can Help, the Main Objective is Join In Worst Case - Cartesian Product Can Improve by Introducing Indices on the Join Attributes (R.B and S.C) to Limit “Product” Can Further Improve by Sorting on the Join Attributes (R.B and S.C) This Reduces Block Accesses by Limiting the Number of Blocks that Must be Examined in a Join If B’s Values Range from 0 to 100 and C from 50 to 150, only need to Compare from 50 to 100 Focus is on Reducing Costly Ops – Same as PL Optimization to Replace * with +
67
Query Processing Internal Data Structure Memory Hierarchy
Main Memory + Secondary Memory Information Must be Staged from Secondary to Primary Memory for Database Operation Sequential Search Brute force Approach Direct Access (Indexed Search) Hash, Inverted Index file, Binary Search Tree, B-tree, B+-tree Improves Selection by Focusing on Subset of Tuples that are Involved in the Answer and Equijoin by Not Having to Compare All Blocks in Two Relations
68
Algorithms for Database Query Operators
Largely Fall into Three Classes: Sorting-Based Methods, Hash-Based Methods, Index-Based Methods Such Algorithms are Divided into Three Degrees of Difficulty and Cost (Limiting Factor is Size of Data) One Pass Algorithms Where Data is Only Read Once From Disk Two-pass Algorithms Data is Read from Disk, Processed in Some Way, Written Back to Disk, Read Again for Processing, etc. Multi-pass Algorithms Where 3 or More Passes Are Required, i.e., Recursive Generalization of the Two-pass Algorithms Akin to Multiple Pass Compilers at Data Level
69
Database Join and Sort are External
Suppose that your DBS has 1,000 1K Blocks of Memory Available for Performing Operations (e.g., Select, Project, Join, Union, Aggregation, etc.) Suppose Sort R by R.B R Contains 5000 Blocks In order to Perform a Sort/Merge - You Must Use External Algorithm since all 5000 Blocks Can Fit Into Memory at the Same Time Suppose Join R (500 Blocks) and S (800 Blocks) Again - their Total Exceeds Memory - Hence you Must Take an Approach that Compares One Block of R with All Blocks of S, etc. (Slides 22,23) 2 1 3 1000
70
Database Join and Sort are External
What’s True about Today’s DBMS Like Oracle? Oracle Recommends 2 Gigabytes of Primary Memory That 2 Gigabytes Must be Shared by: Operating System Other Applications Running on “Same” Server (Web Server, etc.) Database Management Software Even if there was 1.5 Gigabytes Available, Modern DBs can Exceed that size Very Easily Moreover, Cartesian Product Could Exceed Available Mem. Join Could Require External Approach Since All Tables Involved in Join Can’t fit in 1.5 Gigabytes External Sorting/Block Oriented Processing is Norm
71
The System Catalog Store the Meta Information that Describes Each Database, Including a Description of Conceptual Database Schema (Logical Data Model) Relations, Attributes, Keys, Indexes, Views Internal Schema External Schema Store Information Needed by Specific DBMS Modules Query Optimization Module Security and Authorization
72
Example of Catalog Information
73
Relational DBMS Catalog
All Metadata Stored as Relations Example of Metadata Tables are:
74
Uses of System Catalog DDL Compilers:
Correct Definition of Relations and Attributes DML (Query) Compiler: DML Parser Guided by the Description of DML Syntax and the Schema Information in the Catalog, Generates a Query Tree after Parser Optimizer Generates Access Paths that is Relatively Optimal for Executing a Query/ DML Command, by Accessing the Database Structure Information (Schemas), and Mapping High-level SQL Queries Into Low-level File Access Commands SELECT EMP.ENAME FROM EMP, WORKS, PROJ WHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”)
75
Revisit Typical Database Processing
Parsed and Optimized User Trans. Concurrency Control Pre-Processing - Parser/Lexical - Optimizer/Views Lock Request Response Lock Request User Transaction Low-Level Processing - Enqueue Trans. - Request Locks - Issue I/Os - Process Returned Data - Integrity Checks - Security Checks - Logging for Recovery - Release Locks - Dequeue Trans. High-Level Processing - Enqueue Trans. - Request Locks - Release Locks -Dequeue Trans. Errors Response to User I/O Request Post-Processing - Collection of Results - Aggregation Operations - Security Checks Errors Results Results Disk I/O Recovery
76
Typical Database Processing
Pre-Processing Actions Taken Upon Receipt of a Query from User SQL Query via Query Tool or JDBC Call “Compilation” of DB Query Check Syntax, Semantics, Optimize, Develop Run-Time Strategy (Similar to PL Compilation) Query is Translated to DB Transaction A Transaction Contains Multiple DB Operations Transaction has Explicit Order of Operations Database Transaction Must Succeed or Fail There is no Intermediate State Completely Executed and Committed or Aborts at any Point and Undone New State or Previous State of DB
77
Typical Database Processing
High-Level Processing Enqueue Transaction from Pre-Processing Transaction Must Wait for “Earlier” Transactions Remember - Shared DB State! Request Locks from Concurrency Control All Locks Before Proceeding vs. Locks as Needed Avoid Deadlock and Livelock Release Locks As Use of Data Completes to Increase Availability What Happens if Failure of Later Step in Transaction Dequeue Transaction Completes Transaction Processing Return “Result” to Post-Processing
78
Typical Database Processing
Low-Level Processing Enqueue Transaction - Do Actual DB Operations Request Locks - Lower Granularity Level Issue I/Os - Based on Operations to Access “Correct” and “Relevant” DB Records Process Returned Data - Aggregation, Sorting Integrity Checks: Do I/D/U Satisfy Constraints? Security Checks: Is DB R/I/D/U Allowed? Logging for Recovery - Commit the Transaction Release Locks - Available to Others Dequeue Transaction - Return Results to High-Level Processing Note: The Multiple Operations of Each DB Transaction All Must be Successful
79
Typical Database Processing
Post Processing Collection of Results May be Passed Portions of Results as they Complete For Example, Sorted Blocks of Data that are then Merged in a Final Step Aggregation Operations May be Passed Aggregate Intermediate Results Sum for Different Departments to be Totaled Security Checks Last Step Filtering to Insure Only Allowed Data is Returned May Execute Query but Only see Aggregate Result Send Results to User
80
Typical Database Processing
Concurrency Control Control Access to Information Data and Metadata Prevent Simultaneous Updates Ensure Database Always Correct and Consistent Serial Schedule vs. Serializable Transaction Two Types Pessimistic - Locking-Based - Assume Collisions Will Occur - e.g., Peoplesoft Course Registration Optimistic - Time-Based - Fix Problems After the Fact - e.g., ATM Machines Example CC Manages Locks at Different Granularity Levels (Table, Attribute, View, Tuple, Metadata, etc.)
81
Typical Database Processing
Disk I/O Performs the Actual Disk I/O for Read/Writes Block Oriented Activity Maintain Queue of All I/O Requests Ordering is Critical Related to Concurrency Control and Consistency Single DB Transactions can have Multiple DB Operations Disk I/Os for Different Operations at Different Times High and Low Level Processing will Determine What Operations Needed When Disk I/O - Relatively “Dumb”
82
Typical Database Processing
Recovery Tightly Tied to DB Transaction Concept Transactions Must be: Atomic - Happens or Doesn’t Durable - Once Committed, Results Survive Failure Consistent - Follows Protocol/Correct DB State When Failure Occurs, Can we: Recover to a Correct “Earlier” State Reconcile all “Active” Transactions that were Executing at Failure Time Involves Logging of Database Actions Objective: High Availability and Reliability
83
Query Optimization Not Really Optimizing, but Planning to Avoid Bad Execution Strategies Models Heuristics-Based Apply Transformation Rules According to a General Strategy Focus on Relational Algebra that Underlies Each Query Improve the “Order” of Relational Operations Cost-Based Minimize a Cost Function I/O Cost + CPU Cost Subject to a Set of Constraints
84
Query Processing Methodology
High-level Calculus-based Query EXTERNAL SCHEMA Query Preprocessing LOGICAL SCHEMA Algebraic Query (a tree structure) Query Optimization INTERNAL SCHEMA Execution Schedule (file access plan)
85
Refute Incorrect Queries
Example: E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR) SELECT ENAME, PNAME FROM E, P, W WHERE DUR > 27 AND DUR < 25 Incorrect Disjoint Components are Useless Multiple Relations, Missing Joins, may not be incorrect, but may indicate Cartesian product Contradictory Qualification can not be Satisfied by any Tuple DUR > 27 AND DUR < 25
86
Simplification Why Simplify?
The Simpler the Query, the Less Work there is and the Better the Performance How? Use transformation rules Elimination of Redundancy Idempotency Rules Application of Transitivity Use of Integrity Rules Example x > a and x > b DUR > 27 AND DUR > 25
87
Restructuring Convert Relational Calculus to Relational Algebra
Make use of Query Trees Example Find the names of employees other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years. SELECT ENAME FROM E, W, P WHERE E.ENO=W.ENO AND W.JNO=P.JNO AND E.ENAME°"J. Doe" AND P.JNAME="CAD/CAM" AND (W.DUR=12 OR W.DUR=24) ENAME (DUR=12 OR DUR=24) AND JNAME=“CAD/CAM” AND ENAME°“J. DOE” JNO ENO P W E Project Select Join
88
Query Optimization Objectives
Improving Performance Arriving at a Query Plan of Execution Analyzing the Relational Algebra Query Replace Costly Operations Do Selections and Projections Early Optimization Heuristics for the Relational Algebra Performing Selection and Projection Before Join Combining Several Selections Over a Single Relation Into One Selection Find Common Subexpressions Algebraic Rewriting/transformation Rules General Transformation Rules for Relational Algebra (Equivalence-preserving Algebraic Rewriting Rules)
89
Query Optimization: An Example
Why is it important? SELECT ENAME FROM E,W WHERE E.ENO = W.ENO AND W.RESP = "Manager" Strategy 1 ENAME(RESP="Manager"E.ENO=G.ENO(E W)) Strategy 2 ENAME( E ENO(RESP="Manager"(W)))
90
Cost of Alternatives Assume : card(E) = 4,000; card(W)=10,000
10% of tuples in W satisfy RESP="Manager" (selection generates 1,000 tuples) Execution time Proportional to the Sum of the Cardinalities of the Temporary Relations Searching is Done by Sequential Scanning Strategy 1 Strategy 2 Cartesian prod. = 40,000,000 Selection over W = ,000 Search over all = 40,000,000 Join(4000*1000) = 4,000,000 80,000, ,010,000
91
General Query Optimization Strategy
Perform Selections Early Yields Smaller Intermediate Results Direct Impact on Subsequent Join/Cartesian Prod. Combine Selections with a Prior Cartesian Product into a Theta or Equi Join Join is a Cheaper Operation Combine (Cascade) Selections and Projections AB(B (R)) AB(R) p1 ( p2 (R)) p1 ^ p2 (R) This Results in One Pass Instead of Two over Table
92
General Query Optimization Strategy
Identify Common Subexpressions Compute Once and Store use Stored Version for Subsequent Times Often Useful When Views are Employed Preprocess Data via Sorts and Indexes Speeds up Searches and Joins by Limiting Scope Evaluate and Assess Different Options For Cartesian Product, Use Smaller Relation for Comparison Use System Catalog (Meta-data) to Effect Order in Query Execution Plan
93
Relational Algebra Transformations
Cascade of Selection p1 ^ p2 ^ …^ pn(R)p1(p2(...(pn(R))...)) Commutativity of Selection p1(p2(R))p2(p1(R)) p1 orp2(R )p1(R p2(R) Cascade of Projection A1,A2, … An(R)A1(A2(...(An(R))...)) A1(R) if A1 A2 ... An Commuting Selection with Projection A1,A2,...,An(p(R))p(A1,A2,...,An(R)
94
Relational Algebra Transformations
Commutativity of Theta Join and Cartesian Product R A SS A R R SS R Commuting Selection with Theta Join (Cartesian) p(A)(R S) p(A)(R)) S A defined on R only p(A)^p(B)(R S) p(A)(R)) (p(B)(S)) (A defined on R, B defined on S) Also Holds for Theta Join as Well Commuting Projection with Theta Join (Cartesian) C(R S) A(R) B(S) where AB=C A are Attributes in C for R and B are Attributes in C for S
95
Relational Algebra Transformations
Commutativity of Set Operations R S S R R S S R Associativity of Set Operations (R S) T R S T) (R S) T R (S T) (R S) S R (S T) (R S) S R (S T) Commuting Select with Set Operations p(Ai)(R T) p(Ai)(R) p(Ai)(T) where Ai is defined on both R and T p(Ai)(R T) p(Ai)(R) p(Ai)(T) where Ai is defined on both R and T
96
Relational Algebra Transformations
11. Commuting Projection with Union C(R q(Aj,Bk) S) A’(R) q(Aj,Bk) B’(S) C(R S) A’ (R) B’ (S) where R[A] and S[B] C = A' B' where A' A, B’ B 12. Converting Selection/Cartesian Into Theta Join C (R S) R S C
97
Heuristic Optimization: Example
Canonical query tree at the end of query preprocessing phase ENAME (DUR=12 OR DUR=24) AND JNAME=“CAD/CAM” AND ENAME= “J. DOE” E(ENAME, ENO) P(JNO,JNAME) W(ENO,PNO,DUR) JNO ENO P W E
98
Heuristic Optimization– Example
ENAME DUR=12 OR DUR=24 JNAME=“CAD/CAM” Use cascading of selections rule to decompose selections ENAME = “J. DOE” JNO ENO P W E
99
Heuristic Optimization– Example
ENAME DUR=12 OR DUR=24 JNAME=“CAD/CAM” Push selection down using commutativity of selection over join JNO ENO ENAME = "J. Doe" P W E
100
Heuristic Optimization–Example
ENAME DUR=12 OR DUR=24 Push selection down using commutativity of selection over join JNO ENO JNAME = "CAD/CAM" ENAME = "J. Doe" P W E
101
Heuristic Optimization–Example
ENAME JNO Push selection down ENO JNAME = "CAD/CAM" DUR =12 DUR=24 ENAME = "J. Doe" P W E
102
Heuristic Optimization–Example
ENAME JNO Do early projection JNO,ENAME ENO JNO JNO,ENO JNO,ENAME JNAME = "CAD/CAM" DUR =12 DUR=24 ENAME = "J. Doe" P W E
103
Heuristic Optimization–Example
ENAME Identify subtrees that can be implemented in one algorithm JNO JNO,ENAME ENO JNO JNO,ENO JNO,ENAME JNAME = "CAD/CAM" DUR =12 DUR=24 ENAME = "J. Doe" P W E
104
Heuristic Optimization: A Second Example
Title What is the Final Step? Combine Select and Cartesian Product Result: Equijoins! Borrower.Card_No = Loans.Card_No Loans.LC_No X Books.LC_No, Title Books Books.LC_No = Loans.LC_No Loans.LC_No, Loans.Card_No X Borr.Card_No Date 1/1/88 Borrower Loans
105
Cost-Based Optimization
Reduce Defined Cost of Executing Queries What is Involved in the Cost of Executing a Query? Access Cost to Secondary Storage Search for Data Block (Index) Read/Write Index and Data Blocks Storage Cost Index and Data Blocks Intermediate Files Computation Cost Query Planning - Optimization Effort Record Search, Sort, Merge Actual Transaction/Query Operations Communications Cost Transfer of Results to the User
106
Complexity of Relational Operations
Assuming Relations of Cardinality n Sequential Scan of Data in each Relation Complexity of Each Operation is Indicated Avoid Cartesian Product at All Costs! Operation Complexity Select Project O(n) (w/o duplicate elimination) Project (with duplicate elimination) O(nlog n) Group Join O(nlog n) Division Set Operators Cartesian Product O(n2)
107
Cost-Based Optimization
To Understand Cost-Based Operations, we Must Focus on Implementation Strategy of: Select Project Join For Select and Project - There is a Fixed Cost that we Must Live With For Join Implementation Strategy Different Join Strategies Objective: Minimize the Number of Blocks Involved Note that Cost-Based and Relational Algebra Heuristic Optimization Can Complement One Another
108
Optimization Summary Most Systems Implement Only a Few Strategies
The Number of Strategies that are Considered by Any Query Optimizer is Limited Some Systems Reduce the Number of Strategies by Making a Heuristic Guess of Strategy for Each Query The Optimizer Considers Every Possible Strategy, but Terminates as Soon as it Determines the Cost is Greater than the Pre-chosen Strategy Thus Only a Few Competing Strategies Require Full Analysis of the Cost The Overhead of Query Optimization is Reduced Remember - Trade off in Optimization Time For PL - Optimization is Pre-Execution (Compile) For DB - Optimization is Part of Execution (Run)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.