Presentation on theme: "Tim Gorman Evergreen Database Technologies, Inc. North Carolina Oracle Users Group Scaling to Infinity: Partitioning."— Presentation transcript:
Tim Gorman Evergreen Database Technologies, Inc. http://www.EvDBT.com http://www.EvDBT.com North Carolina Oracle Users Group Scaling to Infinity: Partitioning Data Warehouses in Oracle
Speaker Qualifications Tim Gorman (chief, cook, and bottle-washer – EvDBT.com) Director of RMOUG “Training Days 2007” conference Info online at http://www.RMOUG.orghttp://www.RMOUG.org Co-author (with Gary Dodge) “Oracle8 Data Warehousing” 1998 John Wiley & Sons “Essential Oracle8i Data Warehousing” 2000 John Wiley & Sons Co-author (with Oak Table Network - http://www.OakTable.net)http://www.OakTable.net “Oracle Insights: Tales of the Oak Table” 2004 Apress Oracle data warehousing DBA since 1994 Technical manager at Oracle Consulting Independent consultant since 1998
Agenda I’ve seen the good, the bad, and the ugly… In the end, successful data warehouses are a combination of all three But in general, I see three major errors that result in doom… Ignore basic requirements for DW and design what is familiar Fail to portray data changes over time Fail to utilize partitioning from the beginning
Ignoring the requirements Repeat after me -- reporting and analysis applications do not enforce business rules! Reporting and analysis applications are responsible for presenting data in the format that works best for end-users and their query/analysis tools Very often, what end-users seem to want is a simple spreadsheet, hundreds of columns wide GIVE THEM WHAT THEY WANT! Conceal from them what it takes to provide what they want Do NOT build a data model to enforce referential integrity and/or business rules
Third-normal form: 1.Eliminate repeating groups Every attribute is atomic and scalar 2.Eliminate functional dependencies on composite key components Every attribute is functionally dependent on the whole key 3.Eliminate functional dependencies on non key components Every fact/attribute in the entity should rely on the whole key 4.4th, 5th, and 6th normal forms have been defined But most entities that are in 3NF are also 4th, 5th, and 6th NF Intended for use in process-oriented operational systems enforce data integrity according to business rules using referential-integrity constraint mechanisms in application code as well as databases Ignoring the requirements
Data presented in a simplistic dimensional model versus the 3rd- normal-form (3NF) entity-relationship model used by most operational systems Ralph Kimball discusses in The Data Warehouse Toolkit John Wiley & Sons - ISBN #0471153370 provide immediate, on-demand, and high-performance access to corporate or organizational subject data comprised of fact tables containing varying levels of summarized data and dimension tables representing important subject areas very simple representation of data It is a spreadsheet with one degree of normalization for flexibility also known as a star schema because diagrams generally represent the fact table as a hub and dimensions as spokes Ignoring the requirements
Customers Orders Order Lines Products Suppliers Order Facts Products DimSuppliers Dim Customers DimTime Dim Transactional Operational Entity-Relational Modeling Dimensional Modeling Ignoring the requirements
Fact tables More volatile Contain columns for: Dimension keys Measures In a spreadsheet or tabular report Dimension keys don’t appear at all Measures appear in the “cells” of the report Dimension tables Usually more static Although the dimension for people is usually quite volatile Contain columns for: Dimension keys Attributes In a spreadsheet or tabular report Dimension keys don’t appear at all Attributes appear as “column headers” or “row headers”
Time-variant data, who cares? Two major types of queries from business intelligence applications to data warehouse databases Point in time What is the present situation? What do the numbers look like now? “Situational awareness” applications, also known as “dashboards” or “executive information systems” Usually uses the present point in time, but could also use any specific point in time in the past Trend analysis How do things look now versus 3 months ago? A year ago? How have things changed day-by-day over the past quarter? Week-by-week over the past year? Month-by-month over the past 7 years?
Time-variant data, who cares? Consider this… Dimension tables are usually designed to be point-in-time or type-1 People, items, products, etc. Locations, time, etc. Fact tables are almost always designed to be time-variant or type-2 Transactions What happens when you join transactions from years ago with dimensional attributes from the present? For example, when analyzing purchases by location, does it make sense to summarize all transactions by a person’s present location? Or should it reflect the person’s location at the time of the transaction?
Time-variant data, who cares? Every data warehouse has at least one slowly-changing dimension (SCD) Usually involving “people” (i.e. accounts, customers, employees, parties, etc) Static dimensions do not need to be time-variant Identifying “static” dimensions: if a change is made to the dimension, should it be reflected across all time? SCDs should be represented as “type-2” “type-1” views of SCDs can be created as needed “type-1” views of fact tables can also be created, if necessary, to support point-in-time tactical reporting
Time-variant data, who cares? PERSON_DIM Person_key Eff_dt o Last_name o First_name o Address_1 o Address_2 o City o … CURR_PERSON_DIM Person_key o Eff_dt o Last_name o First_name o Address_1 o Address_2 o City o … Type-2 Dimension (time-variant) Type-1 Dimension (point-in-time view) PK
Time-variant data, who cares? Slowly-changing dimensions should always be “type-2” With “type-1” views constructed using the just-loaded “type-2” data So, with this in mind… Why do people so often treat time-variant tables as an after-thought? Why do “extraction-transformation-loading” (ETL) processes so often focus on “MERGE” logic (“if row doesn’t exist then INSERT else UPDATE”) on the current point-in-time tables, and then insert change data as an after-thought a.k.a. “type-1” or “point-in-time” data Instead of… inserting change data into the time-variant “type-2” table from which point-in- time “type-1” views (as materialized views?) can be built for any point-in- time? Think about it… If users should be using “type-2” data for SCDs, who usually utilizes the “type-1” views of the SCDs? What are they good for?
Four characteristics of a DW Non-volatile, time-variant, subject-oriented, integrated Bill Inmon “Building the Data Warehouse” 3rd Ed 2002 (Wiley) Think about what these mean? Consider the converse of these characteristics? Volatile? Static-image? Process-oriented? Application-specific? Time-variant, non-volatile database implies: Insert, index, and analyze each row of data only once From an implementation perspective, this is vital to remember! And often ignored completely!!! Consider an extreme situation? Analytical database for quantum research in physics 50 Tbytes of data to load every day
The Virtuous Cycle Insert-only processing enables… Tables and indexes partitioned by time Optionally sub-partitioned by other key values Partitioned tables/indexes enables… Partition pruning during queries Direct-path loads using EXCHANGE PARTITION Time-variant tables/indexes and tablespaces Purging using DROP or TRUNCATE partition instead of DELETE Partition pruning enables… Infinite scalability for queries, regardless of how large the database becomes Direct-path (a.k.a. append) loads enable… Ability to load more data, faster, more efficiently Table compression
Time-variant partitioned tables/indexes enable… Time-variant tablespaces Time-variant tablespaces enable… READ ONLY tablespaces for older, less-volatile data READ ONLY tablespaces enable… Near-line storage (i.e. NAS, SAMFS/HFS, etc) “Right-sizing” of storage to the need, classified by IOPS Backup efficiencies READ WRITE tablespaces scheduled for backup every day or week READ ONLY tablespaces scheduled for backup every quarter or year The Virtuous Cycle
Using EXCHANGE PARTITION for loads enables… Elimination of ETL “load window” and 24x7 availability for queries Direct-path loads Bitmap indexes and bitmap-join indices Bitmap indices enable… Star transformations on “star” (dimensional) schemas Star transformations enable… Bitmap-join indexes SUCCESS! optimal query-execution plan for dimensional data models!
The Death Spiral Volatile data presented in a static-image according to process-oriented concepts leads to… ETL using “conventional-path” INSERT, UPDATE, and DELETE operations (including MERGE and multi-table INSERT) trouble Conventional-path operations are trouble with: Bitmap indexes and bitmap-join indexes Forcing frequent complete rebuilds until they get too big Contention in Shared Pool, Buffer Cache, global structures Mixing of queries and loads simultaneously on table and indexes Periodic rebuilds/reorgs of tables if deletions occur Full redo logging and undo transaction tracking ETL will dominate the workload in the database Queries will consist mainly of “dumps” or extracts to downstream systems Query performance will be abysmal and worsening…
The Death Spiral Without partitioning Query performance worsens as tables/indexes grow larger Loads must be performed into “live” tables Users must be locked out during “load cycle” In-flight queries must be killed during “load cycle” Bitmap indexes must be dropped/rebuilt during “load cycle” Entire tables must be re-analyzed during “load cycle” Entire database must be backed up frequently Data cannot be “right-sized” to storage options according to IOPS Everything just gets harder and harder to do… …and that stupid Oracle software is to blame… BRING ON TERADATA OR
Exchange Partition The basic technique of bulk-loading new data into a temporary “load table”, which is then indexed, analyzed, and then “published” all at once to end-users using the EXCHANGE PARTITION operation, should be the default load technique for all large tables in a data warehouse fact tables slowly-changing or quickly-changing dimensions Assumptions for this example: Composite partitioned fact table named TXN Range partitioned on DATE column TXN_DATE Hash partitioned on NUMBER column ACCT_KEY Data to be loaded into partition P20040225 on TXN
Exchange Partition 1.Create temporary table TXN_TEMP as a hash- partitioned table 2.Perform parallel, direct-path load of new data into TXN_TEMP 3.Create indexes on the temporary hash-partitioned table TXN_TEMP corresponding to the local indexes on TXN using PARALLEL, NOLOGGING, and COMPUTE STATISTICS options 4.Gather CBO statistics on table TXN_TEMP Only table and columns stats -- leave computed index stats! 5.alter table TXN exchange partition P20040225 with table TXN_TEMP including indexes without validation update global indexes;
Exchange Partition It is a good idea to encapsulate this logic inside PL/SQL packaged- or stored-procedures: SQL> execute exchpart.prepare(‘TXN_FACT’,’TMP_’, - 2 ’25-FEB-2004’); SQL> alter session enable parallel dml; SQL> insert /*+ append parallel(n,4) */ 2 into tmp_txn_fact n 3 select /*+ full(x) parallel(x,4) */ * 4 from stage_txn_fact x 5 where load_date >= ‘25-FEB-2004’ 6 and load_date < ‘28-FEB-2004’; SQL> commit; SQL> execute exchpart.finish(‘TXN_FACT’,’TMP_’); DDL for “exchpart.sql” posted at http://www.EvDBT.com/tools.htm
Exchange Partition Loading time-variant fact and dimension tables is not the only load activity in most data warehouses Often, some tables contain current or point-in-time data Example: type-1 dimension “snowflaked” from type-2 dimension This is often an excellent situation for materialized views But, as is often the case, the refresh mechanisms built in with materialized views might not be the most efficient With each load cycle, the current images need to be updated Instead of performing transactional MERGE (I.e. Update or Insert) logic directly on the table Rebuild the table into a temporary table, then “swap” it in using EXCHANGE PARTITION
Previous cycle’s current-image data in composite- partitioned table CURR_ACCOUNT_ DIM, with single partition named PZERO New current-image data in hash- partitioned table CURR_ACCT_DIM_ TEMP EXCHANGE PARTITION Exchange Partition
INSERT /*+ append parallel(t, 8) */ INTO TMP_CURR_ACCOUNT_DIM T SELECT /*+ full(x) parallel(x, 8) */ 0 partkey, acctkey, effdt, …(and so on for all columns)… FROM(SELECT acctkey, effdt, …(and so on for all columns)…, row_number() over (partition by acctkey order by effdt) ranking FROM (SELECT acctkey, effdt, …(and so on for all columns)… FROM CURR_ACCOUNT_DIM UNION ALL SELECT acctkey, effdt, …(and so on for all columns)… FROM CURR_ACCOUNT_DIM partition (P20040225) ) WHERE RANKING = 1;
Exchange Partition ALTER TABLE CURR_ACCOUNT_DIM exchange partition PZERO with table TMP_ACCOUNT_DIM [ with | without ] validation including indexes update global indexes;
Choosing partition keys The most important decision when partitioning is… Choosing the partition key columns All benefits of partitioning hinges upon this choice!!! Which columns to partition upon? If the table contains time-variant data Choose the RANGE partition key DATE column to optimize: ETL according to load cycles End-user access through partition pruning Choose the HASH or LIST sub-partition key column to optimize: End-user access through partition pruning If the table does NOT contain time-variant data Choose the RANGE, HASH, or LIST partition key column to optimize: End-user access through partition pruning
When choosing columns to optimize ETL Choose a column which distinguishes different load cycles Should be a DATE column When choosing columns to optimize end-user access Gather hard facts about usage – don’t guess! Oracle STATSPACK and Oracle10g AWR Data dictionary table SYS.COL_USAGE$ Populated automatically by cost-based optimizer in Oracle9i and above DDL script “dba_column_usage.sql” can be downloaded from http://www.EvDBT.com/tools.htm http://www.EvDBT.com/tools.htm Ambeo Usage Tracker (http://www.ambeo.com)http://www.ambeo.com Teleran iSight (http://www.teleran.com)http://www.teleran.com Choosing partition keys
Example: fact table for credit-card processing Fact table is time-variant Use range partitioning on DATE datatype to optimize ETL and queries Use hash- or list-subpartitioning to optimizer queries Fact table has four DATE columns TXN_DT (date on which transaction occurred) POST_DT (date on which transaction was posted by merchant) PAID_DT (date on which transaction was paid to merchant) LOAD_DT (date on which transaction was loaded to DW)
Choosing partition keys Which should be chosen? And why? LOAD_DT Optimizes ETL perfectly, but does not benefit queries in any way… Data is loaded by LOAD_DT End-users don’t query on LOAD_DT TXN_DT, POST_DT, and PAID_DT Each benefits a different set of end-user queries Presents some problems for ETL processing Each date loads mostly into the latest partition, then a little into each partition for the previous 2-4 days This situation can be handled by iterating through the five steps of the basic EXCHANGE PARTITION algorithm Where each iteration processes a different LOAD_DT value
Summary recommendations 1.Use dimensional data models for the “presentation” to end-users Don’t “free lance” and confuse the end-users Understand the purpose of facts and dimensions 2.Base the database design on time-variant data structures Don’t join “type-2” fact data to “type-1” dimension data Load “type-2” data first, then rebuild “type-1” data from that 3.Use partitioning Enable the “virtuous cycle” of Oracle features that cascade from using partitioning intelligently…