Data Warehouse Design Enrico Franconi CS 636
CS 3362 Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,...
CS 3363 Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out? Replication tool Dump file Create report ODBC or third-party “wrappers”
CS 3364 Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring
CS 3365 Monitoring Issues Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes,... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways
CS 3366 Wrapper Converts data and queries from one data model to another Extends query capabilities for sources with limited capabilities Data Model B Data Model A Queries Data Queries Source Wrapper
CS 3367 Wrapper Generation Solution 1: Hard code for each source Solution 2: Automatic wrapper generation Wrapper Generator Definition
CS 3368 Integration Data Cleaning Data Loading Derived Data Client Warehouse Source Query & Analysis Integration Metadata
CS 3369 Data Integration Receive data (changes) from multiple wrappers/monitors and integrate into warehouse Rule-based Actions Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc.
CS Data Cleaning Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe Detect inconsistent, wrong data Attribute values that don’t match Patch missing, unreadable data Insert default values Notify sources of errors found
CS Data Cleaning Migration (e.g., yen to dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)
CS Loading Data in the Warehouse Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load
CS Warehouse Maintenance Warehouse data materialized view Initial loading View maintenance Derived Warehouse Data indexes aggregates materialized views View maintenance
CS Materialized Views Define new warehouse relations using SQL expressions does not exist at any source
CS Differs from Conventional View Maintenance... Warehouses may be highly aggregated and summarized Warehouse views may be over history of base data Process large batch updates Schema may evolve
CS Differs from Conventional View Maintenance... Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable
CS Warehouse Maintenance Anomalies Materialized view maintenance in loosely coupled, non-transactional environment Simple example SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age) Sold = Sale Emp
CS Warehouse Maintenance Anomalies 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1) integrator adds Sale (Mary,25) 4. (2) integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age)
CS Maintenance Anomaly - Solutions Incremental update algorithms (ECA, Strobe, etc.) Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary views are self-maintainable
CS Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age) Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect Inserts into Sale Maintain auxiliary view: Emp- clerk,age (Sold) Deletes from Emp Delete from Sold based on clerk
CS Self-Maintainability: Examples Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set
CS Partial Self-Maintainability Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age) Inserts into Sale Check if clerk already in Sold, go to source if not Or replicate all clerks over age 30 Or...
CS Warehouse Specification (ideally) Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Integrator Warehouse... Metadata Warehouse Configuration Module View Definitions Integration rules Change Detection Requirements
CS Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Warehouse Source Query & Analysis Integration Metadata
CS ROLAP Server Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”
CS MOLAP Server Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales
CS Index Structures (sketch) Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes
CS What to Materialize? Store in warehouse results useful for common queries Example: day 2 day total sales materialize
CS Materialization Factors Type/frequency of queries Query response time Storage cost Update cost
CS Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize
CS Dimension Hierarchies all state city
CS Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...
CS Interesting Hierarchy all years quarters months days weeks conceptual dimension table
CS Managing Metadata Warehouse Design Tools Client Warehouse Source Query & Analysis Integration Metadata
CS Metadata Administrative definition of sources, tools,... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control,...
CS Metadata Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails
CS Design Summary What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?
CS Tools Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data
CS Current State of Industry Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything
CS State of Commercial Practice... Connectivity to sources Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump
CS … State of Commercial Practice... Multidimensional Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2 ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys
CS … State of Commercial Practice Query/Reporting Environments IBM DataGuide SAS Access CA Visual Express Platinum Forest&Trees Informix ViewPoint Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects Lots and lots of consulting!!
CS Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on?
CS Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)
CS Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling
CS Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: easier to use provide quality information