Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer.

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Management for Decision Support Session - 1 Prof. Bharat Bhasker.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
CSE6011 Data Warehouse and OLAP  Why data warehouse  What’s data warehouse  What’s multi-dimensional data model  What’s difference between OLAP and.
Distributed DBMSs A distributed database is a single logical database that is physically distributed to computers on a network. Homogeneous DDBMS has the.
SLIDE 1IS 257 – Spring 2004 Data Warehousing University of California, Berkeley School of Information Management and Systems SIMS 257: Database.
SLIDE 1IS 257 – Fall 2011 Data Warehousing University of California, Berkeley School of Information IS 257: Database Management.
11/1/2001Database Management -- R. Larson Data Warehouses, Decision Support and Data Mining University of California, Berkeley School of Information Management.
Data Warehousing and OLAP
SLIDE 1IS Fall 2002 Data Warehouses, Decision Support and Data Mining University of California, Berkeley School of Information Management.
10/30/2001Database Management -- R. Larson Data Warehousing University of California, Berkeley School of Information Management and Systems SIMS 257: Database.
Introduction to Data Warehousing Enrico Franconi CS 636.
SLIDE 1IS 257 – Spring 2004 Data Warehouses, Decision Support and Data Mining University of California, Berkeley School of Information Management.
Chapter 13 The Data Warehouse
11/2/2000Database Management -- R. Larson Data Warehouses, Decision Support and Data Mining University of California, Berkeley School of Information Management.
DATA WAREHOUSE (Muscat, Oman).
CS 345: Topics in Data Warehousing Tuesday, September 28, 2004.
A Comparsion of Databases and Data Warehouses Name: Liliana Livorová Subject: Distributed Data Processing.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Management for Decision Support Session-2 Prof. Bharat Bhasker.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
AN OVERVIEW OF DATA WAREHOUSING
Intro. to Data Warehouse
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
SLIDE 1IS 257 – Fall 2014 Data Warehousing University of California, Berkeley School of Information IS 257: Database Management.
1 Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing Fundamentals of Data Warehousing Dr. Akhtar Ali School of Computing,
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
Data Warehouses and OLAP Data Management Dennis Volemi D61/70384/2009 Judy Mwangoe D61/73260/2009 Jeremy Ndirangu D61/75216/2009.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
1 On-Line Analytic Processing Warehousing Data Cubes.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Advanced Database Concepts
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Data Warehousing/Mining 1 Data Warehousing/Mining Introduction.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Warehouse and OLAP
Data Warehousing University of California, Berkeley
Paper Presentation Prepared by Dindar Öz
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse.
Data Warehouse Design Enrico Franconi CS 636.
Instructor: Dan Hebert
Data Warehousing and OLAP
Introduction to Data Warehousing
Data Warehouse and OLAP
Introduction to Data Warehousing
Data Warehousing Concepts
Data Warehouse and OLAP Technology
Presentation transcript:

Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer

2 Heterogeneous Database Integration Integration System Collects and combines information Provides integrated view, uniform user interface Supports sharing World Wide Web Digital LibrariesScientific Databases Personal Databases

Joachim Hammer 3 The Traditional Research Approach Source... Integration System... Metadata Clients Wrapper Query-driven (lazy, on-demand)

Joachim Hammer 4 Disadvantages of Query-Driven Approach Delay in query processing –Slow or unavailable information sources –Complex filtering and integration Inefficient and potentially expensive for frequent queries Competes with local processing at sources Hasn’t caught on in industry

Joachim Hammer 5 The Warehousing ApproachDataWarehouse Clients Source... Extractor/ Monitor Integration System... Metadata Extractor/ Monitor Extractor/ Monitor Information integrated in advance Stored in wh for direct querying and analysis

Joachim Hammer 6 Advantages of Warehousing Approach High query performance –But not necessarily most current information Doesn’t interfere with local processing at sources –Complex queries at warehouse –OLTP at information sources Information copied at warehouse –Can modify, annotate, summarize, restructure, etc. –Can store historical information –Security, no auditing Has caught on in industry

Joachim Hammer 7 Not Either-Or Decision Query-driven approach still better for –Rapidly changing information –Rapidly changing information sources –Truly vast amounts of data from large numbers of sources –Clients with unpredictable needs

Joachim Hammer 8 Data Warehousing: Two Distinct Issues (1) How to get information into warehouse “Data warehousing” (2) What to do with data once it’s in warehouse “Warehouse DBMS” Terms coined by Jennifer Widom (WHIPS) Both rich research areas Industry has focused on (2)

Joachim Hammer 9 What is a Warehouse? Stored collection of diverse data –A solution to data integration problem –Single repository of information Subject-oriented –Organized by subject, not by application –Used for analysis, data mining, etc. Optimized differently from transaction-oriented db User interface aimed at executive

Joachim Hammer 10 What is a Warehouse (2)? Large volume of data (Gb, Tb) Non-volatile –Historical –Time attributes are important Updates infrequent May be append-only Examples –All transactions ever at WalMart –Complete client histories at insurance firm –Stockbroker financial information and portfolios

Joachim Hammer 11 Warehouse is a Specialized DB Standard DB Mostly updates Many small transactions Mb - Gb of data Current snapshot Index/hash on p.k. Raw data Thousands of users (e.g., clerical users) Warehouse Mostly reads Queries are long and complex Gb - Tb of data History Lots of scans Summarized, consol. data Hundreds of users (e.g., decision-makers, analysts)

Joachim Hammer 12 Warehousing and Industry Warehousing is big business –$2 billion in 1995 –$3.5 billion in early 1997 –Predicted: $8 billion in 1998 [Metagroup] WalMart has largest warehouse –4Tb – Mb per day

Joachim Hammer 13 Warehouse DBMS—Buzzwords Used primarily for decision support (DSS) –A.K.A. On-Line Analytical Processing (OLAP) –Complex queries, substantial aggregation –TPC-D benchmark May support multidimensional database (MDDB) –A.K.A. Data Cube –View of relational data: all possible groupings and aggregations

Joachim Hammer 14 Warehouse DBMS — Buzzwords (2) ROLAP vs. MOLAP Special purpose OLAP servers that directly implement multidimensional data and operations –Roll-up = aggregate on some dimension –Drill-down = deaggregate on some dimension ROLAP: Oracle, Sybase IQ, RedBrick MOLAP: Pilot, Essbase, Gentia

Joachim Hammer 15 Warehouse DBMS - Buzzwords (3) Clients: –Query and reporting tools –Analysis tools –Data mining: discovering patterns of various forms Poses many new research issues in: –Query processing and optimization –Database design –View management

Joachim Hammer 16 Data Warehousing: Basic Architecture DataWarehouse Clients Source... Extractor/ Monitor Integration System... Metadata Extractor/ Monitor Extractor/ Monitor Clients

Joachim Hammer 17 Data Warehousing: Current Practice Warehouse is standard or specialized DBMS Everything is relational Extraction and integration are done in batch, usually off-line, often “by hand” Integrator never queries sources –Everything is replicated at warehouse Generalizing introduce many research issues

Joachim Hammer 18 Research Issues in Data Warehousing Extraction Integration Warehousing specification Optimizations Miscellaneous

Joachim Hammer 19 Data Extraction Translation –Translate information to warehouse data model –Similar to wrapper in query-driven approach [Stanford Tsimmis project] Change detection –For on-line data propagation and incremental warehouse refresh –Different classes of information sources Cooperative Logged Queryable Snapshot

Joachim Hammer 20 Data Extraction (2) Efficient change detection algorithms for snapshot sources –Record-oriented –Hierarchical data Data scrubbing (or cleansing) –Discard or correct erroneous data –Insert default values –Eliminate duplicates and inconsistencies –Aggregate, summarize, sample Implementing extractors –Specification-based or toolkit approach

Joachim Hammer 21 Data Integration Warehouse data  materialized view –Initial loading –View maintenance But: differs from conventional materialized view maintenance –Warehouse views may be highly aggregated –Warehouse views may be over history of base data –Schema may evolve –Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable

Joachim Hammer 22 Warehouse Maintenance Anomalies Materialized view maintenance in loosely coupled, anon-transactional environment Simple example SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age) Sold = Sale Emp

Joachim Hammer 23 Warehouse Maintenance Anomalies 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1)  integrator adds Sale (Mary,25) 4. (2)  integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age)

Joachim Hammer 24 Warehouse Self-Maintainability Goal: No queries back to sources Research issues: –What views are self-maintainable –Store auxiliary views so original + auxiliary views are self-maintainable Examples: –Keep keys –Inserts into Sale, maintain auxiliary view: Emp -  clerk,age (Sold) Lots of issues...

Joachim Hammer 25 Warehouse Specification Ideal scenario: Info Source Info Source Integrator Data Warehouse Extractor Info Source... Metadata Warehousing Configuration Module Integration rules Change detection requirements View definitions

Joachim Hammer 26 Optimizations Update filtering at extractor Multiple view optimization –If warehouse contains several views –Exploit shared sub-views Others?

Joachim Hammer 27 Additional Research Issues Warehouse management –Schema design –Initial loading –Metadata management