Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

An Introduction to Data Warehousing
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
1 Multi-way Algorithm for Cube Computation CPS Notes 8.
Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações.
April 30, Data Warehousing and OLAP Technology: An Overview  What is a data warehouse?  Data warehouse architecture  From data warehousing to.
Data Warehousing Willem Visser RW334. Somebody is watching! Everybody seems to be recording your every move Loyalty cards Cookies – Facebook, Twitter,…
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Implementation & Computation of DW and Data Cube.
Data Warehousing and OLAP
Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Chapter 13 The Data Warehouse
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
CS346: Advanced Databases
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Dr. Bernard Chen Ph.D. University of Central Arkansas
Data Cube Computation Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Data Warehousing.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Efficient Methods for Data Cube Computation and Data Generalization
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 6: Data Warehousing & OLAP Defined in many different ways, but not rigorously. A decision support.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Data Mining Data Warehouses.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
By N.Gopinath AP/CSE.  The data warehouse architecture is based on a relational database management system server that functions as the central repository.
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Cubing Heuristics (JIT lecture) Heuristics used during data cube computation.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Data Mining: Data Warehousing
Introduction to Data Warehousing
Data Warehousing Overview CS245 Notes 12
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 —
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Efficient Methods for Data Cube Computation
Chapter 13 The Data Warehouse
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 —
Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube Introducing iceberg cubes will lessen the burden of computing trivial aggregate.
OLAP Concepts and Techniques
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Mining Data Warehousing
Data Warehousing and OLAP
Data Warehousing and Decision Support Chapter 25
DATA CUBES E0 261 Jayant Haritsa Computer Science and Automation
Data Mining: Concepts and Techniques
Data Warehouse.
Chapter 4: Data Cube Computation and Data Generalization
Slides based on those originally by : Parminder Jeet Kaur
Presentation transcript:

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining

Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels? How many cuboids in an n-dimensional cube with L levels? Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize Selection of which cuboids to materialize  Based on size, sharing, access frequency, etc.

Cube Operation Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () (item)(city) () (year) (city, item)(city, year)(item, year) (city, item, year)

Cube Computation: ROLAP-Based Method Efficient cube computation methods ROLAP-based cubing algorithms (Agarwal et al’96) ROLAP-based cubing algorithms (Agarwal et al’96) Array-based cubing algorithm (Zhao et al’97) Array-based cubing algorithm (Zhao et al’97) Bottom-up computation method (Bayer & Ramarkrishnan’99) Bottom-up computation method (Bayer & Ramarkrishnan’99) ROLAP-based cubing algorithms Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Grouping is performed on some sub-aggregates as a “partial grouping step” Grouping is performed on some sub-aggregates as a “partial grouping step” Aggregates may be computed from previously computed aggregates, rather than from the base fact table Aggregates may be computed from previously computed aggregates, rather than from the base fact table

Cube Computation: ROLAP-Based Method (2) This is not in the textbook but in a research paper Hash/sort based methods ( Agarwal et. al. VLDB’96 ) Smallest-parent: computing a cuboid from the smallest cubod previously computed cuboid. Smallest-parent: computing a cuboid from the smallest cubod previously computed cuboid. Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used Share-partitions: sharing the partitioning cost cross multiple cuboids when hash-based algorithms are used Share-partitions: sharing the partitioning cost cross multiple cuboids when hash-based algorithms are used

Multi-way Array Aggregation for Cube Computation Partition arrays into chunks (a small sub-cube which fits in memory) Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “mul-tiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost What is the best traversing order to do multi-way aggregation? A B a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C B

Multi-way Array Aggregation for Cube Computation A B a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C B

Multi-way Array Aggregation for Cube Computation A B a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C B

Multi-Way Array Aggregation for Cube Computation (2) Method: the planes should be sorted and computed according to their size in ascending order. Study an Example Study an Example Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored

Indexing OLAP Data: Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table Index on RegionIndex on Type

Indexing OLAP Data: Join Indices Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …) Traditional indices map the values to a list of record ids It materializes relational join in JI file and speeds up relational join — a rather costly operation It materializes relational join in JI file and speeds up relational join — a rather costly operation In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table. E.g. fact table: Sales and two dimensions city and product E.g. fact table: Sales and two dimensions city and product  A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city Join indices can span multiple dimensions Join indices can span multiple dimensions

Efficient Processing OLAP Queries Determine which operations should be performed on the available cuboids: transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection Determine to which materialized cuboid(s) the relevant operations should be applied. Exploring indexing structures and compressed vs. dense array structures in MOLAP

Metadata Repository Meta data is the data defining warehouse objects. It has the following kinds Description of the structure of the warehouse Description of the structure of the warehouse  schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents Operational meta-data Operational meta-data  data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) The algorithms used for summarization The algorithms used for summarization The mapping from operational environment to the data warehouse The mapping from operational environment to the data warehouse Data related to system performance Data related to system performance  warehouse schema, view and derived data definitions Business data Business data  business terms and definitions, ownership of data, charging policies

Data Warehouse Back-End Tools and Utilities Data extraction: get data from multiple, heterogeneous, and external sources get data from multiple, heterogeneous, and external sources Data cleaning: detect errors in the data and rectify them when possible detect errors in the data and rectify them when possible Data transformation: convert data from legacy or host format to warehouse format convert data from legacy or host format to warehouse formatLoad: sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions sort, summarize, consolidate, compute views, check integrity, and build indicies and partitionsRefresh propagate the updates from the data sources to the warehouse propagate the updates from the data sources to the warehouse