Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

C6 Databases.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
April 30, Data Warehousing and OLAP Technology: An Overview  What is a data warehouse?  Data warehouse architecture  From data warehousing to.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Warehousing Overview
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
Data Warehousing and OLAP
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
10/30/2001Database Management -- R. Larson Data Warehousing University of California, Berkeley School of Information Management and Systems SIMS 257: Database.
CS2032 DATA WAREHOUSING AND DATA MINING
Introduction to Data Warehousing Enrico Franconi CS 636.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
ETL By Dr. Gabriel.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Introduction to OLAP / Microsoft Analysis Services
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
COLD FUSION Deepak Sethi. What is it…. Cold fusion is a complete web application server mainly used for developing e-business applications. It allows.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Foundations of Business Intelligence: Databases and Information Management.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
MIS 451 Building Business Intelligence Systems Data Staging.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Advanced Database Systems: DBS CB, 2nd Edition
Plan for Populating a DW
Data Warehousing Overview CS245 Notes 12
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Three tier Architecture of Data Warehousing
Data Warehouse.
Data Warehouse Design Enrico Franconi CS 636.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehousing and OLAP
Introduction to Data Warehousing
Data Warehousing: Data Models and OLAP operations
Data Warehouse.
Data Warehousing Concepts
Presentation transcript:

Data Warehouse Design Enrico Franconi CS 636

CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...  Managing: Metadata, Design,...

CS 3363 Monitoring  Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, …  How to get data out?  Replication tool  Dump file  Create report  ODBC or third-party “wrappers”

CS 3364 Monitoring Techniques  Periodic snapshots  Database triggers  Log shipping  Data shipping (replication service)  Transaction shipping  Polling (queries to source)  Screen scraping  Application level monitoring

CS 3365 Monitoring Issues  Frequency  periodic: daily, weekly, …  triggered: on “big” change, lots of changes,...  Data transformation  convert data to uniform format  remove & add fields (e.g., add date to get history)  Standards (e.g., ODBC)  Gateways

CS 3366 Wrapper Converts data and queries from one data model to another Extends query capabilities for sources with limited capabilities Data Model B Data Model A Queries Data Queries Source Wrapper

CS 3367 Wrapper Generation  Solution 1: Hard code for each source  Solution 2: Automatic wrapper generation Wrapper Generator Definition

CS 3368 Integration  Data Cleaning  Data Loading  Derived Data Client Warehouse Source Query & Analysis Integration Metadata

CS 3369 Data Integration  Receive data (changes) from multiple wrappers/monitors and integrate into warehouse  Rule-based  Actions  Resolve inconsistencies  Eliminate duplicates  Integrate into warehouse (may not be empty)  Summarize data  Fetch more data from sources (wh updates)  etc.

CS Data Cleaning  Find (& remove) duplicate tuples  e.g., Jane Doe vs. Jane Q. Doe  Detect inconsistent, wrong data  Attribute values that don’t match  Patch missing, unreadable data  Insert default values  Notify sources of errors found

CS Data Cleaning  Migration (e.g., yen to dollars)  Scrubbing: use domain-specific knowledge (e.g., social security numbers)  Fusion (e.g., mail list, customer merging) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

CS Loading Data in the Warehouse  Incremental vs. refresh  Off-line vs. on-line  Frequency of loading  At night, 1x a week/month, continuously  Parallel/Partitioned load

CS Warehouse Maintenance  Warehouse data  materialized view  Initial loading  View maintenance  Derived Warehouse Data  indexes  aggregates  materialized views  View maintenance

CS Materialized Views  Define new warehouse relations using SQL expressions does not exist at any source

CS Differs from Conventional View Maintenance...  Warehouses may be highly aggregated and summarized  Warehouse views may be over history of base data  Process large batch updates  Schema may evolve

CS Differs from Conventional View Maintenance...  Base data doesn’t participate in view maintenance  Simply reports changes  Loosely coupled  Absence of locking, global transactions  May not be queriable

CS Warehouse Maintenance Anomalies  Materialized view maintenance in loosely coupled, non-transactional environment  Simple example SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age) Sold = Sale Emp

CS Warehouse Maintenance Anomalies 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1)  integrator adds Sale (Mary,25) 4. (2)  integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age)

CS Maintenance Anomaly - Solutions  Incremental update algorithms (ECA, Strobe, etc.)  Research issues: Self-maintainable views  What views are self-maintainable  Store auxiliary views so original + auxiliary views are self-maintainable

CS Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age)  Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect  Inserts into Sale Maintain auxiliary view: Emp-  clerk,age (Sold)  Deletes from Emp Delete from Sold based on clerk

CS Self-Maintainability: Examples  Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant  Auxiliary views for self-maintainability  Must themselves be self-maintainable  One solution: all source data  But want minimal set

CS Partial Self-Maintainability  Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age)  Inserts into Sale  Check if clerk already in Sold, go to source if not  Or replicate all clerks over age 30  Or...

CS Warehouse Specification (ideally) Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Integrator Warehouse... Metadata Warehouse Configuration Module View Definitions Integration rules Change Detection Requirements

CS Processing  ROLAP servers vs. MOLAP servers  Index Structures  What to Materialize?  Algorithms Client Warehouse Source Query & Analysis Integration Metadata

CS ROLAP Server  Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

CS MOLAP Server  Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales

CS Index Structures (sketch)  Traditional Access Methods  B-trees, hash tables, R-trees, grids, …  Popular in Warehouses  inverted lists  bit map indexes  join indexes  text indexes

CS What to Materialize?  Store in warehouse results useful for common queries  Example: day 2 day total sales materialize

CS Materialization Factors  Type/frequency of queries  Query response time  Storage cost  Update cost

CS Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize

CS Dimension Hierarchies all state city

CS Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

CS Interesting Hierarchy all years quarters months days weeks conceptual dimension table

CS Managing  Metadata  Warehouse Design  Tools Client Warehouse Source Query & Analysis Integration Metadata

CS Metadata  Administrative  definition of sources, tools,...  schemas, dimension hierarchies, …  rules for extraction, cleaning, …  refresh, purging policies  user profiles, access control,...

CS Metadata  Business  business terms & definition  data ownership, charging  Operational  data lineage  data currency (e.g., active, archived, purged)  use stats, error reports, audit trails

CS Design Summary  What data is needed?  Where does it come from?  How to clean data?  How to represent in warehouse (schema)?  What to summarize?  What to materialize?  What to index?

CS Tools  Development  design & edit: schemas, views, scripts, rules, queries, reports  Planning & Analysis  what-if scenarios (schema changes, refresh rates), capacity planning  Warehouse Management  performance monitoring, usage patterns, exception reporting  System & Network Management  measure traffic (sources, warehouse, clients)  Workflow Management  “reliable scripts” for cleaning & analyzing data

CS Current State of Industry  Extraction and integration done off-line  Usually in large, time-consuming, batches  Everything copied at warehouse  Not selective about what is stored  Query benefit vs storage & update cost  Query optimization aimed at OLTP  High throughput instead of fast response  Process whole query before displaying anything

CS State of Commercial Practice...  Connectivity to sources  Apertus  Information Builders  Informix Enterprise Gateway  Oracle Open Connect  CA-Ingres gateway  MS ODBC  Platinum InfoHub  Data extract, clean, transform, refresh  CA-Ingres Replicator  ETI-Extract  IBM Data Joiner, Data Propagator  Prism Warehouse manager  SAS Access  Sybase Replication Server  Trinzic InfoPump

CS … State of Commercial Practice...  Multidimensional Database Engines  Arbor Essbase  Oracle RIR Express  Comshare Commader  SAS System  Warehouse Data Servers  CA-Ingres  Oracle 8  RedBrick  Sybase IQ  Informix Dynamic Server  IBM DB2  ROLAP Servers  HP Intelligent Warehouse  Informix Metacube  MicroStrategy DSS Server  Information Advantage Asxys

CS … State of Commercial Practice  Query/Reporting Environments  IBM DataGuide  SAS Access CA Visual Express Platinum Forest&Trees  Informix ViewPoint  Multidimensional Analysis  Kenan Systems Acumate  Microsoft Excel  Arbor Essbase Analysis server  Cognos PowerPlay  IQ Software IQ/Vision  Lotus 123  SAS OLAP++  Business Objects  Lots and lots of consulting!!

CS Future Directions  Better performance  Larger warehouses  Easier to use  What are companies & research labs working on?

CS Research (1)  Incremental Maintenance  Data Consistency  Data Expiration  Recovery  Data Quality  Error Handling (Back Flush)

CS Research (2)  Rapid Monitor Construction  Temporal Warehouses  Materialization & Index Selection  Data Fusion  Data Mining  Integration of Text & Relational Data  Conceptual Modelling

CS Conclusions  Massive amounts of data and complexity of queries will push limits of current warehouses  Need better systems:  easier to use  provide quality information