Chapter 4 - Data Warehouses

Chapter 4 - Data Warehouses
4.1 Data Warehouse Introduction 4.2 Conceptual Structure & Architecture 4.3 Data Modelling for Data Warehouses 4.4 Functionality of a Data Warehouse 4.5 Views and Decision Support 4.6 Building a Data Warehouse 4.7 Implementation Techniques

4.1 Data Warehouse Introduction (1)
Traditional database usage: databases are used by organisations for maintaining data that documents their everyday operations transactions typically make small changes to this operational data example: adding a reservation logging sales of goods these systems are called online transaction processing (OLTP) applications DBMS have been optimised extensively to perform well in such applications Recently, there has been an increased emphasis on analysis and exploration of data CA306 Data Warehouses

Data Warehouse Introduction (2)
The key functionality of data warehouses is data analysis. They provide storage of data query functionality for queries and analyses beyond the capabilities of traditional transaction-oriented databases (OLTP) When these queries and analyses become recurrent and predictable, then the development of special read-only databases with efficient data access becomes sensible. These data warehouses are mainly used as decision support systems in organisations. CA306 Data Warehouses

Terminology and Definitions
Data warehouses provide access to data for complex analysis knowledge discovery decision making Types of applications: OLAP (on-line analytical processing) - extraction, processing, and presentation of complex data for analytic and decision-making purposes. DSS (decision support systems) - support for an organisation’s decision makers with higher-level data. data mining - knowledge discovery, search for unexpected knowledge. CA306 Data Warehouses

Characteristics of Data Warehouses
Data sources: Data warehouses usually store integrated data from multiple sources - multidatabases on the other hand provide uniform access to disjoint and heterogeneous databases. Time and historical data: Data warehouses support time-series and trend analysis - transactional databases usually do not offer specific support for historical data. Updates: Data warehouses are updated incrementally - in transactional databases single transactions are the unit of change. A data warehouse is a collection of decision support technologies, aimed at enabling the knowledge worker to make better and faster decisions. (Chaudhuri and Dayal, in ACM SIGMOD Record 26(1), 1997) CA306 Data Warehouses

Characteristics of Data Warehouses
Multi-dimensional conceptual view generic dimensionality unlimited dimensions and aggregation levels unrestricted cross-dimensional operations dynamic sparse matrix handling client-server architecture Multi-user support accessibility transparency intuitive data manipulation Consistent reporting performance Flexible reporting after E. Codd et.al. (1993): Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate. White paper, 1993. see CA306 Data Warehouses

Data Analysis Inadequacies of SQL for data analysis:
queries with extensive use of AND and in particular OR do not perform well in some DBMS implementations of SQL extensive use of statistical functions is not well supported time series analysis (e.g. aggregating values over time) is poorly supported Systems/tools that support these specific forms of queries are categorised as OLAP-tools Data mining tools and other forms of exploratory data analysis are concerned with finding trends or patterns, rather than complex queries. CA306 Data Warehouses

Summary Data Warehousing: A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process. Subject-Oriented: The warehouse is organised around the major subjects of the enterprise (customer, products, sales) rather than the major application areas (invoicing, stock control). This is as a result of the need to store decision-support data rather than application-oriented data. Integrated: The integration of source data from different enterprise-wide application systems. As source data will reside in different formats, it must be represented in a canonical formal for users. CA306 Data Warehouses

Time-Variant: Data in the warehouse in only accurate and valid at some point in time, or over a specified time-interval. The time-invariance of the D.W. is due to the implicit or explicit association of time with all data, and the fact that data represents a series of snapshots. Non-Volatile: Data is not updated in real-time but is refreshed from operational systems on a regular basis. New data is always added as a supplement to the database (not a replacement). The D.W. continually absorbs new data, integrating it with previous data. CA306 Data Warehouses

Sections Covered 4.1 Data Warehouse Introduction
4.2 Conceptual Structure & Architecture 4.3 Data Modelling for Data Warehouses 4.4 Functionality of a Data Warehouse 4.5 Views and Decision Support 4.6 Building a Data Warehouse 4.7 Implementation Techniques CA306 Data Warehouses

4.2 Conceptual Structure and Architecture of a Data Warehouse
The overall process of data warehousing. data acquisition: populating and updating the data warehouse analysis of warehouse data feedback of analysis results CA306 Data Warehouses

Data Warehouse Architecture
EXTERNAL DATA SOURCES Data Warehouse Architecture EXTRACT TRANSFORM LOAD REFRESH External data sources provide input for data warehouse The data warehouse stores data and metadata Data warehouses underlie data analyses such as OLAP data mining DATA WAREHOUSE Metadata Repository SUPPORTS OLAP DATA MINING CA306 Data Warehouses

OLTP v Data Warehousing
A DBMS built for OnLine Transaction Processing (OLTP) is generally regarded as unsuitable for data mining, as both have different requirements. OLTP are designed to maximise the transaction processing capacity, while data warehouses are designed to support ad hoc query processing. Organisations have a large number of OLTP systems which are detailed, current, and subject to change. These systems are optimised for a high volume number of transactions, that are predictable, repetitive and update intensive. An organisation will have a single data warehouse, with data that is historical, detailed, summarised to various levels, and rarely subject to change. It supports a relatively low number of transactions that are unpredictable in nature, unstructured, and heuristic. CA306 Data Warehouses

Comparison OLTP SYSTEMS Holds Current Data Stores Detailed Data
Data is dynamic Repetitive Processing High volume of transactions Predictable pattern of usage Transaction-driven Application-oriented Supports day-to-day decisions Serves a large number of users DATA WAREHOUSING SYSTEMS Holds historical data Stores detailed and summarised data Data is generally static Ad hoc, unstructured, heuristic proc. Medium to low volume of transactions Unpredictable pattern of usage Analysis Driver Subject-oriented Supports strategic decisions Serves low number of managerial users CA306 Data Warehouses

4.3 Data Modelling for Data Warehouses
The most typical OLAP-queries are complex ad-hoc queries. In SQL-terms these are queries with group-by and aggregation operators. A multidimensional data model is the most natural model for OLAP. Multidimensional modelling can enhance query performance. Principle: In the multidimensional data model, data is described as a collection of numeric measures. Each measure depends on a set of dimensions CA306 Data Warehouses

Data Warehouses - Data Modelling
Example: sales revenues in a corporate data warehouse is a measure that depends on several dimensions (e.g. product, region, fiscal period, etc) two-dimensional: a two-dimensional matrix based on product and region CA306 Data Warehouses

3D Modelling three-dimensional: a three-dimensional matrix (data cube) based on product, region, and fiscal period for a given product, region and period there is at most one sales value higher dimensions: hypercubes …difficult to present graphically Not all data lends itself to be represented in a multidimensional format. How did product P125 do in The 3rd quarter for region 2 ? CA306 Data Warehouses

Storage of Multidimensional Models
The storage of multidimensional models involves two types of tables: dimension tables and fact tables. A dimension table is a relation that contains tuples of attributes of a dimension. example: a product dimension table with tuples containing values for ProductNo, ProductName, etc. A fact table is where each tuple in the table contains a fact, which is collection of measured/observed variables. It represents a multi-dimensional array as a relation - relating one value for each dimension with the measured value for the input dimensions. It might also contain pointers to dimension tables. example: a SALES fact table with attributes regionID, productID, periodID, and sales. A dimension table can be regarded as metadata CA306 Data Warehouses

Dimension and Fact Tables
Dimension table: Fact table: Data cube: periodID productID In the Fact table, only products 11, 12 and 13 may have associated facts. All columns in the dimension should be unique or you may have a problem. For example, if weight=22 occurs twice, how do we know which product the fact in the weight table is for? Unless it is for all weight = 22. Then it’s ok. regionID (ID = 1 displayed) CA306 Data Warehouses

Dimension Hierarchies
For each dimension, the set of values can be organized in a hierarchy: PRODUCT PERIOD REGION year quarter country category week month state pname date city CA306 Data Warehouses

Multidimensional Schemas: Star
Star Schema: one fact table with a single table for each dimension. The fact table is the central entity; it represents the facts on which the analysis is based Example: a central sales-table with dimension tables product, region and period. Give example in class. Tell students to generate their own examples for possible exam questions. CA306 Data Warehouses

Multidimensional Schemas: Snowflake
Snowflake Schema: special form of a star schema, where dimension tables are organised into a hierarchy (normalisation). Example: a period attribute is structured into beginDate and endDate. As Star schema but both dimension and fact tables are properly normalised. CA306 Data Warehouses

Star Schema (Fact table) periodID date week month quarter year
holiday_flag (Fact table) productID periodID regionID sales SALES PRODUCTS REGION productID pname category price regionID city state country Fact table in BCNF; dimension tables not normalized. Dimension tables are small; updates/inserts/deletes are rare. So, anomalies less important than good query performance. This kind of schema is very common in OLAP applications, and is called a star schema. Computing the join of all these relations is called a star join. Students: explain how dimension table is not normalised. CA306 Data Warehouses

Storage Strategies In OLAP applications, data can be represented in multidimensional arrays (similar to programming languages) These OLAP systems are called multidimensional OLAP (MOLAP) systems Other OLAP systems represent data in the form of fact and dimension tables, i.e. as relations. These OLAP systems are called relational OLAP (ROLAP) systems CA306 Data Warehouses

4.4 Functionality of Data Warehouses
Queries, Aggregation Pivoting Comparison with SQL Queries Hierarchical Views Cube Operator CA306 Data Warehouses

Functionality Typical functionality of data warehouses is to facilitate complex, data-intensive and frequent ad-hoc queries enhanced spreadsheet functionality efficient query processing structured queries ad-hoc queries data mining CA306 Data Warehouses

OLAP Queries Operations supported by OLAP systems are influenced by spreadsheet programs. Some examples on how multidimensional data can be queried: find the total sales (for each city, for each state) find the top five products ranked by total sales The first query can be expressed using SQL (query fact and dimension tables), the second can’t (only a complete ordered list can be produced). Aggregation is an important construct: measures on one or more dimensions can be aggregated. Example: total sales by city depends only on region, whereas the original sales measure depends on product, region and period. CA306 Data Warehouses

Pre-programmed Functionality
roll-up - data is summarised with increasing generalisation (higher level of abstraction) drill-down - increasing levels of data details are revealed pivoting - cross-tabulation (rotation) slice and dice - perform projection operation on the dimensions sorting - data is sorted by ordinal value selection - data is available by value or range derived attributes - attribute values are computed Users: knowledge workers decision makers Sorted by ordinal value means everything is given a set position in the list. CA306 Data Warehouses

Pivoting Data can be queried in any combination of dimensions. Tools exist for viewing data according to the user’s choice of dimensions. Given a fact table - such as the sales table - the user can pivot on specific dimensions Pivoting is also called rotation. (see next slide & compare with 4.18) Pivoting can be combined with aggregation on selected dimensions. Example: Pivoting on Region and Period yields a cross-tabulation WI CA Total 1995 1996 1997 Total CA306 Data Warehouses

CA306 Data Warehouses

Comparison with SQL Queries
The cross-tabulation obtained by pivoting can also be computed using a collection of SQL-queries: SELECT SUM(S.sales) FROM Sales S, Period P, Region R WHERE S.periodID=P.periodID AND S.regionID=R.regionID GROUP BY P.year, R.state SELECT SUM(S.sales) FROM Sales S, Period P WHERE S.periodID=P.periodID GROUP BY P.year SELECT SUM(S.sales) FROM Sales S, Region R WHERE S.regionID=R.regionID GROUP BY R.state CA306 Data Warehouses

Hierarchical Views - Roll-up
Roll-up display: elements of one dimension are grouped into classes/categories, e.g. creating product groups or grouping weeks into months. Only look a aggregation for all category 1 products, and then all category 2 products, and then all category 3 etc. Move from individual products to (the coarser grained of) product categories. CA306 Data Warehouses

Hierarchical Views - Drill-Down
Drill-down display: an element of one dimension is split up into several elements, e.g. a region is split into sub-regions. The drill-down operation offers a finer-grained view: dis-aggregating country sales by region, and subsequently by sub-region, and breaking products into styles. CA306 Data Warehouses

The CUBE Operator SELECT SUM(S.sales) FROM Sales S
Generalizing the previous example, if there are k dimensions, we have 2k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions. CUBE productID, regionID, periodID BY SUM Sales Equivalent to rolling up Sales on all eight subsets of the set {productID, regionID, periodID} each roll-up corresponds to an SQL query of the form: SELECT SUM(S.sales) FROM Sales S GROUP BY grouping-list CUBE offers abstraction. To calculate 8 subsets, think of binary values 000 to 111 CA306 Data Warehouses

Grouping List The eight SQL-queries only differ on the grouping list (GROUP BY) They can be arranged in form of a lattice (subset-based) (productID, periodID, regionID) (productID, periodID) (productID, regionID) (periodID, regionID) (productID) (periodID) (regionID) ( ) Or think in binary terms: {0,0,0}, {0,0,1}, {0,1,0}, {0,1,1}, {1,0,0}, {1,0,1} etc CA306 Data Warehouses

4.5 Views and Decision Support
OLAP queries are typically aggregate queries. Precomputation is essential for reduced interactive response times. The CUBE is in fact a collection of aggregate queries, and precomputation is especially important: what is best to precompute given a limited amount of space to store precomputed results? The idea is to choose a subset of aggregate queries for (view) materialisation (precomputation) Warehouses can be thought of as a collection of asynchronously replicated tables and periodically maintained materialised views. CA306 Data Warehouses

Query Modification (Evaluate On Demand)
View: sales of products by category and state and query: total sales for each category by state Query modification: replace occurrence of RegionalSales by its definition CREATE VIEW RegionalSales(category,sales,state) AS SELECT P.category, S.sales, R.state FROM Product P, Sales S, Region R WHERE P.productID=S.productID AND S.regionID=R.regionID SELECT RS.category, RS.state, SUM(RS.sales) FROM RegionalSales AS RS GROUP BY RS.category, RS.state Final part is simply a modification of the original view (query). SELECT RS.category, RS.state, SUM(RS.sales) FROM (SELECT P.category, S.sales, R.state FROM Product P, Sales S, Region R WHERE P.productID=S.productID AND S.regionID=R.regionID) AS RS GROUP BY RS.category, RS.state CA306 Data Warehouses

View Materialisation (Precomputation)
Computing on demand can cause unacceptable response times in case of complex view definitions. Suppose we precompute the view RegionalSales and store it with a clustered B+ tree index on search key [category,state,sales]. Then, the previous query can be answered by an index-only scan. The disadvantage is that these views have to be maintained. SELECT RS.state, SUM(RS.sales) FROM RegionalSales RS WHERE RS.category=“Laptop” GROUP BY RS.state SELECT RS.state, SUM(RS.sales) FROM RegionalSales RS WHERE RS. state=“Wisconsin” GROUP BY RS.category Index on precomputed view is useful: locate first index leaf entry for “Laptop”, then scan leaf level Index is less useful: must scan entire leaf level. CA306 Data Warehouses

Issues in View Materialisation
What views should we materialise, and what indexes should we build on the pre-computed results? The previous example illustrates this problem. A small set of views should be chosen that support the most important queries. Given a query and a set of materialised views, can we use the materialised views to answer the query? It needs to be made sure that the materialised views are consistent with underlying tables How frequently should we refresh materialised views to make them consistent with the underlying tables? (And how can we do this incrementally?) A view maintenance policy determines the refreshment approach (e.g. immediate vs. deferred) CA306 Data Warehouses

Interactive Queries: Beyond Materialisation
Top N Queries: If you want to find the 10 (or so) cheapest cars, it would be nice if the DB could avoid computing the costs of all cars before sorting to determine the 10 cheapest. The user wants to see answers quickly (compare Internet queries) Idea: Guess at a cost c such that the 10 cheapest all cost less than c, and that not too many more cost less - guess is possible if distribution of sales values is known. Then add the selection cost<c and evaluate the query. If the guess is right, we avoid computation for cars that cost more than c. If the guess is wrong, we need to reset the selection and recompute the original query. CA306 Data Warehouses

Top N Queries SELECT P.productID, P.productName, S.sales
FROM Sales S, Products P WHERE S.productID=P.productID AND S.regionID=1 AND S.periodID=3 ORDER BY S.sales DESC OPTIMIZE FOR 10 ROWS OPTIMIZE FOR construct is not in SQL:1999 Cut-off value c is chosen by optimizer. SELECT P.productID, P.productName, S.sales FROM Sales S, Products P WHERE S.productID=P.productID AND S.regionID=1 AND S.periodID=3 AND S.sales > c ORDER BY S.sales DESC For students: Notes demonstrate the D.W. = D.B. + tools + materialised views. Exam Q: Describe three operators or procedures/functions that transform a DB into a DW CA306 Data Warehouses

Interactive Queries: Online Aggregation
Online Aggregation: Consider an aggregate query, e.g. finding the average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for each state as the computation proceeds. Even better, if we use statistical techniques and sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average for Wisconsin is 2000±102 with 95% probability. Non-blocking algorithms should be used for implementation - an algorithm is said to block if it does not produce output tuples until it has consumed all input tuples. CA306 Data Warehouses

4.6 Building a Data Warehouse
Constructing a data warehouse analysis of requirements design of data model acquisition of data Design criteria: an appropriate schema should reflect the anticipated usage - might change from organisation to organisation the design of a data warehouse should specifically support ad-hoc, complex querying (OLAP) of the fact and dimension tables Not all possible queries and analyses can be anticipated. CA306 Data Warehouses

Data Acquisition Acquisition of data:
Sources: data is usually extracted from multiple, heterogeneous sources - databases and other data sources Consistency: data must be formatted for consistency - reconciliation of names, meanings and domains Validity: data must be cleaned to ensure validity - identification of erroneous and incomplete data Data model: data must be fitted (transformed) into the data model of the data warehouse - conversion from relational/object-oriented/legacy data model into multidimensional data model Loading: data must be loaded into the warehouse - the huge volume of data can cause problems; a refresh policy is needed CA306 Data Warehouses

Requirements How up-to-date must the warehouse data be?
Can the warehouse go offline for loading? For how long? What is the loading time (including cleaning, formatting, etc.)? What are the data interdependencies? What is the storage availability? What are the distribution requirements (consider replication and partitioning)? CA306 Data Warehouses

Building Process Data storage in a warehouse involves the following processes: storing the data according to the data model of the warehouse creating and maintaining the required data storage structures creating and maintaining appropriate access paths providing for time-variant data as new data are added supporting the updating of warehouse data refreshing the data periodically purging the data (perhaps onto archival media) CA306 Data Warehouses

Specific Architectures (1)
Metadata repository includes both technical and business metadata technical: details of storage structures, acquisition processing, etc business: business rules, organisational details that support the warehouse due to its size metadata might be stored in a separate database information that is stored in addition to table descriptions includes the sources of each data warehouse table the last time the warehouse data was refreshed CA306 Data Warehouses

Specific Architectures (2)
Distributed architectures: distributed warehouse: problems: replication, partitioning, communications, consistency benefits: load balancing, scalability of performance, higher availability federated warehouse: decentralised confederation of autonomous data warehouses each with its own metadata repository Note that the problem of refreshing data warehouse tables and asynchronously maintaining replicas of tables in a distributed DBMS is similar. CA306 Data Warehouses

Operational Difficulties
Operational issues: construction, administration and quality control Construction: project management (design, construction, implementation) is important Administration: even though a read-only system, a data warehouse is not static and evolves over time Quality control: quality and consistency must remain after acquisition and refreshing CA306 Data Warehouses

4.7 Implementation Techniques
Response time in data warehouses is essential Maintenance costs of indexes can be neglected/forgotten Indexing: B+ trees from the basis of supporting the execution of complex queries Other forms of support (e.g. in systems such as Oracle 8) include bitmap indexes join indexes CA306 Data Warehouses

Indexing (1) Bitmap indexing CAR-table Name Size Golf midsize
constructs a bit vector for each value in a domain (column) being indexed 1-bit placed in the j-th position in the vector if the j-th row contains the value example: car size classifications (compact, midsize, etc) as vectors CAR-table Name Size Golf midsize Polo compact Astra midsize Corsa compact compact-Vector midsize-Vector 0 1 1 0 0 1 CA306 Data Warehouses

Indexing (2) Join indexing
the efficiency of executing join queries can be improved by index structures a join index is a traditional index relating primary and foreign key values - a fact table is likely to be joined with one or more dimension tables a value of a dimension is related to a row in the fact table example: a sales table with dimensions city and period support queries involving a city: a join index on city maintains the list of tuple ids t1, .. , tn of the fact table that contain the city c => index contains <c, t1>, ... , <c, tn> support queries involving a city c and a period p => the index contains <c,p, t1>, ... , <c,p, tn> CA306 Data Warehouses

Chapter 4 - Data Warehouses

Similar presentations

Presentation on theme: "Chapter 4 - Data Warehouses"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4 - Data Warehouses

Similar presentations

Presentation on theme: "Chapter 4 - Data Warehouses"— Presentation transcript:

Similar presentations

About project

Feedback