Presentation on theme: "1 Data Warehousing Denis Manley Enterprise Systems FT228/3."— Presentation transcript:
1 Data Warehousing Denis Manley Enterprise Systems FT228/3
2 Data Warehousing A data warehouse is a repository for all an organisations data. It is designed to present data in very user-friendly formats to end-users. Why is Data Warehousing necessary? –think of a typical company that uses computers for order processing, invoicing, etc. –Data is gathered for every transaction that occurs. Different departments probably use different systems, each designed for a specific task.
3 Data Warehousing (contd) –There will be lots of different database Tables, in different formats, on different machines. There will be tie-ups between data items, but those tie-ups may not be obvious outside the applications that manipulate the tables. –Also, the amount of data is constantly increasing. Some estimates state that the size of a typical data processing system doubles every two years. –Both standard, and ad-hoc, reports must be catered for, and they must run in a reasonable time-frame.
4 Data Warehousing (contd) Some solutions to the issues listed can be implemented in the applications that manage specific data. –E.G. if you want to ask How much does this customer owe? then the original package is probably the one to use. –But if you want to ask Was this ad campaign more successful than that one?, you require data from more disparate sources, and one application may not provide all of it.
5 Data Warehousing (contd) The new alternative is a Data Warehouse: –a D.W. is a way of organising data from a wide range of formats, possibly from different locations, so its all available in a single location. –Once this stage is complete, the collection of data is usually frequently replicated around multiple locations. –This means users have a local copy of the data they need to inspect. This improves query run-times, and reduces communications overheads.
6 Data Warehousing (contd) –The data is not only collected and joined, however. What the data means is described in a way that the end-users can easily understand, e.g. using the name Customer Account Number instead of order>cacno for a field. –The data must be quickly accessible, so that users can find data quickly and easily.
7 Data Warehousing process Operational data in legacy systems e.g. OLTP apps. Data fusion: Assembles diverse data Data Cleansing: fixing data Meta Data: shows transformations of data,origins of data, etc. Data Migration: load data and meta-data into warehouse periodically Data Warehouse Decision-support analyst queries
8 Data Warehousing Architecture A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores. It comprises a number of components illustrated in figure 1
9 Data Warehousing Architecture Figure 1 : A datawarehouse architecture
10 Data Warehousing Architecture Data Warehouse database –This is a cornerstone of the data warehouse (item 2 in figure 1) –Almost always implemented on a relational database management system –However very large databases, ad hoc processing and the need for flexible user views e.g. aggregates and drill downs are calling for different technological approaches: Parallel relational database designs requiring the use of symmetric multiprocessors, Massively parallel processors and clusters
11 Data Warehousing Architecture Speeding up traditional RDBMS The use of Multidimensional databases (MDDBs) which are tightly coupled to on-line analytical processing
12 Data Warehousing Architecture Sourcing, acquisation, cleanup and transformation of data –Implementing data warehouses involves extracting data from operational systems including legacy systems and putting it into a suitable format. –The various tools are illustrated as item 1 in figure 1 –These tools perform all the conversions, summarisations, key changes, structural changes, and condensations needed to transform disparate data into information can be used by decision support tools –
13 Data Warehousing Architecture –It produces programs and control statements required to move data into the data warehouse form multiple operational systems –Maintains the metadata –Remove unwanted data –Converts to common data names and definitions –Calculates summaries –Establish defaults for missing data –Keep track of source data definition changes
14 Data Warehousing Architecture The tools have to deal with: –Database hetergeneity: DBMS can be very different in data models, data access language etc. –Data hetergeneity: the difference in the way data is defined and used e.g. synonyms and different attributes for the same entity etc.
15 Data Warehousing Architecture Metadata (item 3 figure1): data about data that describes the datawarehouse –Technical metadata: contains information for warehouse designers and administrators Information about data sources Transformation descriptions Rules used to perform data clean up Access authorisation, information delivery history, data acquisation history, data access etc –Business metadata: information that gives users an understanding of the information stored in the data warehouse. Queries, reports images Data warehouse operational information e.g. data history and ownership
16 Meta Data Versioning In the operational environment, there tends to be a single instance of data and meta data at any one moment in time. In the data warehouse environment, there are multiple definitions of data and meta data over an historically long period of time. Therefore, versioning of data warehouse meta data is crucial to the success of the data warehouse vis-a- vis the end users ability to access and understand the data in the data warehouse.
17 Guidelines for Metadata Management Develop an Information Directory that integrates technical and business metadata. Keep the metadata current and accurate! Maintain the time variant history of the metadata! Provide meaningful descriptions and definitions (use business definitions, not technical definitions, where possible). The end users must be educated about what metadata is, how to access it, how to use it, etc.
18 Meta Data Answers Questions for Users of the Data Warehouse How do I find the data I need? What is the original source of the data? How was this summarization created? What queries are available to access the data? ?? n How have business definitions and terms changed over time? n How do product lines vary across organizations? n What business assumptions have been made?
19 The Role of Meta Data in the Data Warehouse Architecture Know what data you have and You can trust it! Meta Data enables data to become information, because with it you
20 Data marts (item 4 figure 1) A data mart is a data store that is subsidary to a data warehouse of intergrated data. The data mart is directed at a partition of data (subject area) that is created for the use of a dedicated group of users and is sometimes termed a subject warehouse The data mart might be a set of denormalised, summarised or aggregated data that can be placed on the data warehouse database or more often placed on a separate physical store. Data marts can be dependent data marts when the data is sourced from the data warehouse. Independent data marts represent fragmented solutions to a range of business problems in the enterprise, however, such a concept should not be deployed as it doesnt have the data intergration concept thats associated with data warehouses.
21 Independent Data Marts Systems of Record Extract, Transform, Clean, Integrate, Summarize, etc.... Data Mart
22 Independent Data Marts Three Times! Extract, Transform, Clean, Integrate, Summarize, etc....
23 Independent Data Marts $ Significant and expensive duplication of effort and data.
24 Independent Data Marts $ Maintenance of proliferating unarchitected marts expensive and cumbersome. $
25 Unarchitected Data Marts ? There may be metadata for some marts, but what about consistency & history? md
26 md Contrast Architected Data Warehouse Data Warehouse Deptl Atomic Systems of Record Data Access Dependent (Architected) Departmental Marts with the Appropriate Subset of Metadata Information Directory md
27 Independent Data Marts vs. The Real Thing sArchitected to meet organizational as well as departmental requirements sData and results consistent sRedundancy is managed sDetailed history available for drill-down sMetadata is consistent! Unarchitected Data Marts *Easy to do, but... ?Are the extracts, transformations, integrations & loads consistent? ?Is the redundancy managed? ?What is the impact on the sources? Data Warehouse Atomic Deptl md
28 Independent Data marts However, such marts are not necessarly all bad. Often a valid solution to a pressing business problem: –Extremely urgent user requirements –The absence of a budget for a full data warehouse –The decentralisation of business units
29 Data Warehousing Architecture Access Tools (item 5 figure 1) –The principal purpose of the data warehouse is to provide information for strategic decision making. –The main tools used to achieve this objective are: Data query and reporting tools Executive information system tools On-line analytical processing tools Data mining tools
30 Data Warehousing Architecture Query and reporting tools –Reporting tools Production reporting tools e.g. generate operational reports Report writers: inexpensive desktop tools –Managed Query tools Shield users from the complexities of SQL and database structures by inserting a metalayer between the users and the database
31 A Few Definitions OLAP –On-Line Analytical Processing –A set of functionality that attempts to facilitate multidimensional analysis Multidimensional Analysis –The ability to manipulate information by a variety of relevant categories or dimensions to facilitate analysis and an understanding of that data –Has also been called as drill-down, drill- across and slice and dice
32 A Few Definitions Hypercube –A means of visually representing multidimensional data Star Schema –A means of aggregating data based on a set of known dimensions, attempting to store data multidimensionally in a two dimensional RDBMS Snowflake Schema –An extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment
33 A Few Definitions Multidimensional Database –Also known as MDDB or MDDBS –A class of proprietary, non-relational database management tools that store and manage data in a multidimensional manner, as opposed to the two dimensions associated with traditional relational database management systems OLAP Tools –A set of software products that attempt to facilitate multidimensional analysis –Can incorporate data acquisition, data access, data manipulation, or any combination thereof
34 A Few Definitions ROLAP –Relational OLAP –Using an RDBMS to implement an OLAP environment –Typically involves a star schema to provide the multidimensional capabilities –OLAP tool manipulates RDBMS star schema data
35 A Few Definitions MOLAP (1) –Multidimensional OLAP –Using an MDDBS to store and access data –MDDBS directly manages data multidimensionally –Usually requires proprietary (non-SQL) access tools MOLAP (2) –OLAP tool facilitates multidimensional capabilities without the need for a star schema –Often utilizes a 3-Tier environment, where middle tier server preprocesses data from an RDBMS –Some OLAP tools access an RDBMS directly and build cubes as a fat client
36 How Can OLAP Be Accomplished? Use the Data Warehouse as the architected foundation for the organizations informational processing requirements Use the appropriate design techniques to ensure that the data required is at the appropriate degree of granularity at the atomic level, and the appropriate degree of summarization at the departmental level Use the appropriate tools to either access relational data in a multidimensional manner, or manage multidimensional data
37 Data mining A CSF for any business is its ability to effectively use information. This strategic use can occur from discovering hidden, undected and frequently valuable information about customers suppliers retailers etc. This information can be used to formulate effective business, marketing and sales strategies. data miningA relatively new technology that can be used to achieve this strategic advantage is data mining
38 Data visualisation Data visualisation is a method of presenting the output of the previously mentioned methods in such a way that the problem or solution is clearly visible to domain experts and even casual observers. It goes way beyond simple bar and pie charts It is a collection of complex techniques that focus on determining the best way to display complex patterns on a 2-D computer monitor. Such techniques involve experimenting with various colours, shapes 3-d images, sound and vittual reality to help users really see and feel the problem and solution.
39 DW administration and management (Item 6 figure 1) Managing data warehouses involves: –Security and priority management –Monitoring updates from multiple sources –Data quality checks –Managing and updataing metadata –Replicating, subsetting and distributing data –Backup and recovery –Data warehouse storage management e.g. capacity planning, hierarchical storage management, purging of aged data.
40 Information delivery system (Item 7 figure 1) The IDS is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specified scheduling algorithm IDS may be based on time of day or on completion of an external event. IDS can be achieved by Client/Server architecture and now by Interner/intranet and World Wide Web.
41 Top 10 Donts of Data Warehousing
Pre-selecting Your Technical Environment This is a very common trap in which many organizations find themselves. It is traditional to select the hardware, software and other physical, technical components of a system as one of the earliest activities. However, a data warehouse is an unstable environment from a sizing perspective. How do you know the hardware/RDBMS/end user tool is appropriate for your data warehouse before conducting even the first round of analysis? If at all possible, wait to select your technical environment until after you have analyzed the business requirements for information, data, and potential systems of record.
43 9. Allowing Access Tool to Determine Data Architecture This is an extension of #10, but is important enough to list by itself. If you select an end user tool before developing your data architecture, it is very likely that that architecture will suffer at the hand of design requirements delivered by the tool. If you have to sacrifice design requirements in order to meet functional requirements of a tool, it is probably time to put that tool aside and select another one.
44 8. Unarchitected Data Marts OK. Data marts are good; they are an essential part of the data warehouse architecture. But to build only a data mart and to ignore the rest of the data warehouse (specifically the atomic level data and centralized meta data) will lead you down a path that will be more expensive and deliver less quality of data than the alternative. The alternative is to architect and build the data warehouse incrementally, iteratively. Include data marts as departmental instances of the architecture, and populate them from the atomic level data. This will ensure accuracy across the architecture, and reduce costs by eliminating unnecessary population of stand-alone data marts.
45 7. Boiling the Ocean It is more efficient to implement the data warehouse in small, achievable and palatable chunks than to try to implement it all at once. When I say boil the ocean, I mean trying to do too many things for too many people all at the same time. There is an old adage: You can have everything; where would you put it all? The same holds true for a data warehouse. If you try to design, develop and implement a data warehouse that is all-encompassing as your first iteration, how will the users be able to use all that you delivered? And in the mean time, while youve been trying to meet all of their needs, you have failed to meet any needs. And users wont forget that for a long time.
46 6. If you build it they will come If you design, develop and implement an operational system, such as an order processing system, that new system is typically going to replace an existing system. In other words, the old system goes away and the users have no choice but to use the new one. Not so with the d/w. If you build it… implies an analysis that includes only bottom-up activities. It is crucial to the success of a data warehouse that a top-down analysis of user requirements for information be conducted. After that, users must be tutored, mentored and otherwise have their hands held as part of the implementation of the data warehouse. Existence does not guarantee utilization and, therefore, value.
47 5. Putting ROI before RFI (Requirements for Information) It is very difficult to quantify the intangible benefits that a data warehouse can provide to an organization. How can you put a price on increased customer loyalty. Somewhere, sometime, someone has probably made this calculation. In most cases, however, the determination of how beneficial the data warehouse will be is based on criteria that was developed for operational systems. Just as you cannot use operational data to meet your strategic informational requirements, it is difficult to calculate the return on investment (ROI) of a data warehouse. In terms of benefits to the organization, it is more appropriate to concentrate on how well the data warehouse addresses the target users requirements for information.
48 4. No Committed User Involvement Write this down: The success of any data warehouse is directly proportional to the amount of end user participation! A data warehouse cannot be successful without active participate on the part of the target users. Period. If you do not have user participation, you will find yourself in a situation where you will build it and hope that they will come. If there is no serious user participation in a data warehouse project, you have to seriously question whether or not the organization truly needs a data warehouse.
49 3. No Dedicated DBA In many situations the lack of a dedicated database administrator (DBA) has prevented a data warehouse project to be complete 1) on time, or 2) successfully. Borrowing a DBA from the operational pool will only result in questions about the nature of the data warehouse data models and database design. Its too flat, not normalized properly, too much redundancy, and other criticisms are well suited for an operational systems database design, but not a data warehouse. Considering that data is the first word in data warehouse, be sure you have a dedicated database administration resource committed to this important project.
50 2. No Meta Data Meta data is like documentation and training: Everyone knows it is necessary, but it usually gets dropped somewhere along the route to implementation. For the data warehouse, meta data is more important than just your typical documentation. Remember, in order to turn data into information you have to have the data, know that you have it, be able to access it, and trust it. Meta data is the means by which the users will be able to understand and trust the data. A time-variant record of where data came from, what happened to it along the way, where it is in the data warehouse, and what vehicles exist to access it will spell the difference between success and frustration.
51 1. Analysis Paralysis The inability to proceed past a sticking question. Wanting to boil the ocean and model/design everything before proceeding with development. Having to resolve political issues surrounding a standard or common definition. All of these things (and more!) will result in analysis paralysis. The 80/20 rule is very applicable to the development of a data warehouse. Execute 20% effort to get 80% of the total outcome, then move on to the next set of challenges and opportunities. Many data warehouse failures started when the development team stopped. Get your hands around an idea, understand what the users requirements for information are, and build something that produces something that can be evaluated. Dont just stand there…do something!
52 Parallel Data Management A topic thats closely linked to Data Warehousing is that of Parallel Data Management. The argument goes: –if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing). –Oracle uses this strategy in its warehousing products. There are two types of parallel processing - Symmetric Multiprocessing (SMP), and Massively Parallel Processing (MPP)
53 Parallel Data Management (contd) SMP - means the O.S. runs and schedules tasks on more than one processor without distinction. –in other words, all processors are treated equally in an effort to get the list of jobs done. MPP - more varied in its design, but essentially consists of multiple processors, each running their own program. –the problem with MPP is to harness all these processors to solve a single problem.
54 Parallel Data Management (contd) Regardless of the architecture used, there are still alternatives regarding the use of the parallel processing capabilities. –In normal Transaction processing, each transaction runs on a separate processor, because transactions are small units of work that run in a reasonable time-span. –However, the type of analysis carried out in data warehouse applications isnt like that. Typically you want to run a query that looks at all the data in a set of tables. The problem is splitting that into chunks that can be assigned to the multiple processors.
55 Parallel Data Management (contd) There are two possible solutions to this problem: Static and Dynamic Partitioning. –In Static Partitioning you break up the data into a number of sections. Each section is placed on a different processor with its own data storage and memory. The query is then run on each of the processors, and the results combined at the end to give the entire picture. –This is like joining a queue in a supermarket. You stay with it until you reach the check-out.
56 Parallel Data Management (contd) –The main problem with Static Partitioning is that you cant tell how much processing the various sections need. If most of the relevant data is processed by one processor you could end up waiting almost as long as if you didnt use parallel processing at all. –In Dynamic Partitioning the data is stored in one place, and the data server takes care of splitting the query into multiple tasks, which are allocated to processors as they become available. –This is like the single queue in a bank. As a counter position becomes free the person at the head of the queue takes that position
57 Parallel Data Management (contd) With Dynamic Partitioning the performance improvement can be dramatic, but the partitioning is out of the users hands.
58 2 Tiered Architecture s Shared Global Data Storage s Shared Application Logic s End User Functionality s Data Display s Personal Data Storage s Personal Application Logic Tier 1 Enterprise Server Tier 2 Client This slide defines the fundamentals of a two-tiered architecture
59 2 Tiered Architecture in the DW Environment s Atomic Data Acquisition s Organizational Level Data s Secondary Data Acquisition s Departmental Level Data s Data Access Processing s Individual Level Data s Data Manipulation Tier 1 Enterprise Server Tier 2 Client This slide depicts where the various components of the data warehouse architecture would fall in a two- tiered technical architecture.
60 2 Tiered Architecture in the DW Environment Tier 1 Enterprise Server Tier 2 Client Deptl Atomic Indl Information Directory Indl This is a graphical representation of the two-tiered data warehouse architecture. The atomic level, metadata and departmental levels would reside on the Enterprise Server, while any individual level data would be stored on the client. In this scenario the client is a thin one, with little processing beyond query and simple data manipulation.
61 2 Tiered Architecture in the DW Environment Deptl Atomic Indl Information Directory Tier 1 Enterprise Server Tier 2 Fat Client In this variation, a fat client at the second tier facilitates a considerable amount of data manipulation, freeing up capacity on the Enterprise Server for other, more global activities.
62 2 Tiered Architecture in the DW Environment Tier 1 Enterprise Server Tier 2 Thin Client Deptl Atomic Indl Information Directory Indl Multidimensional Preprocessing In this variation the second tier is a thin client, and volumes of data are pre-processed on the Enterprise Server. This enables access to sophisticated analyses by more than one thin client.
63 3 Tiered Architecture s Shared Global Data Storage s Shared Application Logic s Shared Local Data s Shared Application Logic Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Client s End User Functionality s Personal Data Storage s Personal Application Logic In a three-tiered technical environment a middle tier has been added for shared data and application logic.
64 3 Tiered Architecture in the DW Environment s Atomic Data Acquisition s Organizational Level Data s Secondary Data Acquisition s Departmental Level Data s Data Access Processing s Individual Data s Data Manipulation Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Client In the data warehouse, this middle tier can take on the responsibility for departmental level (data mart) processing. There is little or no impact on the third tier, while the first tier is relieved of some mid-level functionality.
65 3 Tiered Architecture in the DW Environment Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Client Deptl Atomic Indl Local m/d Information Directory Indl In this graphical representation of the three-tiered technical architecture, the departmental data has been moved to the middle tier. In reality, departmental instances can still exist on the first tier. In cases where the middle tier is geographically distant from the first tier, it is advisable to provide a subset of local metadata for the users of the middle tiers data.
66 3 Tiered Architecture in the DW Environment Deptl Atomic Indl Local m/d Information Directory Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Fat Client This scenario also supports a fat client at the third tier, where the individual level data is maintained and manipulated.
67 3 Tiered OLAP Arch. in the DW Environment s Secondary Data Acquisition s Departmental Level Data s Shared Data Access Processing s Data Access s Individual Data s Data Manipulation Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Thin Client s Atomic Data Acquisition s Organizational Level Data One of the biggest benefits of a three-tiered technical architecture for a data warehouse is the ability to facilitate tremendous amounts of pre-processing of data on the middle tier.
68 3 Tiered OLAP Arch. in the DW Environment Deptl Atomic Multidimensional Preprocessing Information Directory Local m/d Indl Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Thin Client In this case, the middle tier pre-processes data for use by many thin clients, rather than requiring the individual levels on the third tier to repetitively processes the same data with complex algorithms.
69 Product Code ProdRef Eff. Date ProdRef End Date Product Name Unit Price Product Category Product Type Product Sub Type Product Ref Customer ID Activity Date Product Code Product Name Sales Rep ID Qty Purchased Total Dollars Promotion Flag Cust PurchasesCustomer Customer ID Status Date Cust Addr State Cust ZIP Code Customer Type Customer Status... Sales Rep ID Sales Person Name Store ID Sales Rep Ref Store ID Store Name Store Location Distribution Channel Outlet Reference Cust Averages Customer ID Cust Average Date Cust Avg. End Date Cust Avg. Rev. Cust Longevity The Atomic Schema
71 Dimension Table 1 Dimension Key 1 Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl 1.2 Aggregatn Lvl 1.n Dimension Table Describes the data that has been organized in the Fact Table Key should either be the most detailed aggregation level necessary (e.g. country vs. county), if possible, or... Surrogate keys may be necessary, but will decrease the natural value of the key Manageable number of aggregation levels
72 Dimension Key 1 Dimension Key 2 Dimension Key 3 Dimension Key 4 Fact 1 Fact 2 Fact 3 Fact 4. Fact n Fact Table Quantifies the data that has been described by the Dimension Tables Key made up of unique combination of values of dimension keys –ALWAYS contains date or date dimension Fact values should be additive –Aggregations of quantities or amounts from atomic level –No percentages or ratios –May be non-additive, time-variant data
73 Purchases 1 Days of Activity Unit Price Total Quantity Total Dollars Returned Qty Returned Dollars Promotion Qty Sales Rep ID Product Code Cust ZIP Code Customer Type Week Ending Date Cust ZIP Code City State/Province Country Customer Location Product Product Code Product Name Prod. Category Product Type Prod Sub Type Week Ending Date Month Quarter Year Date Information Customer Type Cust Type Desc Customer Type Selling Responsibility Sales Rep ID Sales Rep Name Store ID Store Name Store Location Sales Channel For Example:
74 Select E.Month, B.Customer_Type, C.Product_Type, D.Store_Location, sum(A.Total_Quantity) FromPurchases_1 A, Customer_Type B, Product C, Selling_Responsibility D, Date_Information E WhereB.Customer_Type = A.Customer_Type and C.Product_Code = A.Product_Code and D.Sales_Rep_ID = A.Sales_Rep_ID and E.Week_Ending_Date = A.Week_Ending_Dateand E.Year = 1996 and C.Product_Category = V Group byE.Month, B.Customer_Type, C.Product_Type, D.Store_Location; Star Schema Query
75 Caution!: Overly Complex Dimension Number of aggregation levels within the dimension becomes unmanageable Logically or functionally incorrect combination of aggregation levels within a dimension
76 Split overly complex dimension into several logical, more manageable dimensions, based on business function of aggregation levels Answer #1: Split Dimension
77 Answer #2: The Snowflake Identify hierarchies of aggregation levels and dimensionalize the primary dimension Secondary dimensions descriptive of the primary dimension
78 Weekly Date D1 D2 D3 D4 Monthly Date D1 D2 D3 D4 Answer: Distinct Time Period Fact Tables Create separate fact tables to account for different time periods Date still part of each fact table key Same dimension tables used by both fact tables Improves overall performance (loading and accessing) for each time period Will not increase amount of managed redundancy