Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...

Similar presentations


Presentation on theme: "Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,..."— Presentation transcript:

1 Data Warehouse Design Enrico Franconi CS 636

2 CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...  Managing: Metadata, Design,...

3 CS 3363 Monitoring  Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, …  How to get data out?  Replication tool  Dump file  Create report  ODBC or third-party “wrappers”

4 CS 3364 Monitoring Techniques  Periodic snapshots  Database triggers  Log shipping  Data shipping (replication service)  Transaction shipping  Polling (queries to source)  Screen scraping  Application level monitoring

5 CS 3365 Monitoring Issues  Frequency  periodic: daily, weekly, …  triggered: on “big” change, lots of changes,...  Data transformation  convert data to uniform format  remove & add fields (e.g., add date to get history)  Standards (e.g., ODBC)  Gateways

6 CS 3366 Wrapper Converts data and queries from one data model to another Extends query capabilities for sources with limited capabilities Data Model B Data Model A Queries Data Queries Source Wrapper

7 CS 3367 Wrapper Generation  Solution 1: Hard code for each source  Solution 2: Automatic wrapper generation Wrapper Generator Definition

8 CS 3368 Integration  Data Cleaning  Data Loading  Derived Data Client Warehouse Source Query & Analysis Integration Metadata

9 CS 3369 Data Integration  Receive data (changes) from multiple wrappers/monitors and integrate into warehouse  Rule-based  Actions  Resolve inconsistencies  Eliminate duplicates  Integrate into warehouse (may not be empty)  Summarize data  Fetch more data from sources (wh updates)  etc.

10 CS 33610 Data Cleaning  Find (& remove) duplicate tuples  e.g., Jane Doe vs. Jane Q. Doe  Detect inconsistent, wrong data  Attribute values that don’t match  Patch missing, unreadable data  Insert default values  Notify sources of errors found

11 CS 33611 Data Cleaning  Migration (e.g., yen to dollars)  Scrubbing: use domain-specific knowledge (e.g., social security numbers)  Fusion (e.g., mail list, customer merging) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

12 CS 33612 Loading Data in the Warehouse  Incremental vs. refresh  Off-line vs. on-line  Frequency of loading  At night, 1x a week/month, continuously  Parallel/Partitioned load

13 CS 33613 Warehouse Maintenance  Warehouse data  materialized view  Initial loading  View maintenance  Derived Warehouse Data  indexes  aggregates  materialized views  View maintenance

14 CS 33614 Materialized Views  Define new warehouse relations using SQL expressions does not exist at any source

15 CS 33615 Differs from Conventional View Maintenance...  Warehouses may be highly aggregated and summarized  Warehouse views may be over history of base data  Process large batch updates  Schema may evolve

16 CS 33616 Differs from Conventional View Maintenance...  Base data doesn’t participate in view maintenance  Simply reports changes  Loosely coupled  Absence of locking, global transactions  May not be queriable

17 CS 33617 Warehouse Maintenance Anomalies  Materialized view maintenance in loosely coupled, non-transactional environment  Simple example SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age) Sold = Sale Emp

18 CS 33618 Warehouse Maintenance Anomalies 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1)  integrator adds Sale (Mary,25) 4. (2)  integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age)

19 CS 33619 Maintenance Anomaly - Solutions  Incremental update algorithms (ECA, Strobe, etc.)  Research issues: Self-maintainable views  What views are self-maintainable  Store auxiliary views so original + auxiliary views are self-maintainable

20 CS 33620 Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age)  Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect  Inserts into Sale Maintain auxiliary view: Emp-  clerk,age (Sold)  Deletes from Emp Delete from Sold based on clerk

21 CS 33621 Self-Maintainability: Examples  Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant  Auxiliary views for self-maintainability  Must themselves be self-maintainable  One solution: all source data  But want minimal set

22 CS 33622 Partial Self-Maintainability  Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age)  Inserts into Sale  Check if clerk already in Sold, go to source if not  Or replicate all clerks over age 30  Or...

23 CS 33623 Warehouse Specification (ideally) Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Integrator Warehouse... Metadata Warehouse Configuration Module View Definitions Integration rules Change Detection Requirements

24 CS 33624 Processing  ROLAP servers vs. MOLAP servers  Index Structures  What to Materialize?  Algorithms Client Warehouse Source Query & Analysis Integration Metadata

25 CS 33625 ROLAP Server  Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

26 CS 33626 MOLAP Server  Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date 1 2 3 4 milk soda eggs soap A B Sales

27 CS 33627 Index Structures (sketch)  Traditional Access Methods  B-trees, hash tables, R-trees, grids, …  Popular in Warehouses  inverted lists  bit map indexes  join indexes  text indexes

28 CS 33628 What to Materialize?  Store in warehouse results useful for common queries  Example: day 2 day 1 129... total sales materialize

29 CS 33629 Materialization Factors  Type/frequency of queries  Query response time  Storage cost  Update cost

30 CS 33630 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129 use greedy algorithm to decide what to materialize

31 CS 33631 Dimension Hierarchies all state city

32 CS 33632 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

33 CS 33633 Interesting Hierarchy all years quarters months days weeks conceptual dimension table

34 CS 33634 Managing  Metadata  Warehouse Design  Tools Client Warehouse Source Query & Analysis Integration Metadata

35 CS 33635 Metadata  Administrative  definition of sources, tools,...  schemas, dimension hierarchies, …  rules for extraction, cleaning, …  refresh, purging policies  user profiles, access control,...

36 CS 33636 Metadata  Business  business terms & definition  data ownership, charging  Operational  data lineage  data currency (e.g., active, archived, purged)  use stats, error reports, audit trails

37 CS 33637 Design Summary  What data is needed?  Where does it come from?  How to clean data?  How to represent in warehouse (schema)?  What to summarize?  What to materialize?  What to index?

38 CS 33638 Tools  Development  design & edit: schemas, views, scripts, rules, queries, reports  Planning & Analysis  what-if scenarios (schema changes, refresh rates), capacity planning  Warehouse Management  performance monitoring, usage patterns, exception reporting  System & Network Management  measure traffic (sources, warehouse, clients)  Workflow Management  “reliable scripts” for cleaning & analyzing data

39 CS 33639 Current State of Industry  Extraction and integration done off-line  Usually in large, time-consuming, batches  Everything copied at warehouse  Not selective about what is stored  Query benefit vs storage & update cost  Query optimization aimed at OLTP  High throughput instead of fast response  Process whole query before displaying anything

40 CS 33640 State of Commercial Practice...  Connectivity to sources  Apertus  Information Builders  Informix Enterprise Gateway  Oracle Open Connect  CA-Ingres gateway  MS ODBC  Platinum InfoHub  Data extract, clean, transform, refresh  CA-Ingres Replicator  ETI-Extract  IBM Data Joiner, Data Propagator  Prism Warehouse manager  SAS Access  Sybase Replication Server  Trinzic InfoPump

41 CS 33641 … State of Commercial Practice...  Multidimensional Database Engines  Arbor Essbase  Oracle RIR Express  Comshare Commader  SAS System  Warehouse Data Servers  CA-Ingres  Oracle 8  RedBrick  Sybase IQ  Informix Dynamic Server  IBM DB2  ROLAP Servers  HP Intelligent Warehouse  Informix Metacube  MicroStrategy DSS Server  Information Advantage Asxys

42 CS 33642 … State of Commercial Practice  Query/Reporting Environments  IBM DataGuide  SAS Access CA Visual Express Platinum Forest&Trees  Informix ViewPoint  Multidimensional Analysis  Kenan Systems Acumate  Microsoft Excel  Arbor Essbase Analysis server  Cognos PowerPlay  IQ Software IQ/Vision  Lotus 123  SAS OLAP++  Business Objects  Lots and lots of consulting!!

43 CS 33643 Future Directions  Better performance  Larger warehouses  Easier to use  What are companies & research labs working on?

44 CS 33644 Research (1)  Incremental Maintenance  Data Consistency  Data Expiration  Recovery  Data Quality  Error Handling (Back Flush)

45 CS 33645 Research (2)  Rapid Monitor Construction  Temporal Warehouses  Materialization & Index Selection  Data Fusion  Data Mining  Integration of Text & Relational Data  Conceptual Modelling

46 CS 33646 Conclusions  Massive amounts of data and complexity of queries will push limits of current warehouses  Need better systems:  easier to use  provide quality information


Download ppt "Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,..."

Similar presentations


Ads by Google