Data Warehousing Concepts
Introduction to Data Warehousing Data Warehouse Architecture Contents Introduction to Data Warehousing Data Warehouse Architecture Dimensional Modeling Data warehouse techniqued Data mining OLAP
1. Introduction to Data Warehousing Data Warehousing Definition Online Transaction Processing (OLTP) System Data Warehousing System Difference between OLTP and DW System Reasons for Building a Data Warehouse Benefits of Data Warehousing
1.1 What is a Data Warehouse? Data Warehouse is - primarily a centralized repository of an organization’s data. - holds large amount of data including historical info. - designed to support efficient data analysis and reporting.
1.2 OLTP Systems Focus: Designed to get data in quickly and to analyze the current events. Transaction Oriented. Organized around business processes such as Order Entry, Purchasing, Campaign Management, Trading etc. Avoidance of data duplication, maintainability etc. Characteristics: Process Oriented. Normalized Data. Current Data. Volatile Data. Real Time Updates.
1.3 Data Warehousing Systems Focus: Designed to get data out and quickly analyze. Concerned with customer, product etc. rather than order entry, campaign management. Focus on easy data access . Contains slices of data across different periods of time. Historical data supports trending, forecasting and time based performance reporting. Characteristics: Subject oriented rather than process oriented. Integrated across subjects and entire enterprise. De-Normalized Data. Time-Variant. Historical Data. Non Volatile Atomic and Summary Data.
1.4 OLTP Vs Data Warehouse OLTP Systems Data Warehouse Normalized Data De-Normalized Data Used to run the business Used to analyze the business Real-Time data update Updated on a predefined schedule Volatile Data Non-Volatile Data Current Data Historical Data Wider Audience. Transaction throughput Limited Audience. Fast Query Response Small to large database Large to Very Large Database
1.5 Why Build a Data Warehouse? No Single Version of Truth. Lack of standardized data across the enterprise for easy understanding and further decision-making. Absence of historical data for the purpose of analysis and decision making.
1.6 Benefits of Data Warehousing Rapid Access to data. Integrated data. Reliable Reporting. Better Decision making.
2. Data Warehouse Architecture Logical Architecture Elements of Data Warehouse
Data Warehouse - Logical Architecture BI Tools ETL Staging Area DQ Query Tools Datamarts OLAP Tools Data Mining Data Visualization
Elements of A Data Warehouse 1 2 3 ETL Tool or Process SAP CRM Inventory Manufacturing Staging 70% of Effort in a Data Warehousing solution is in developing a successful ETL strategy Operational Data Storage ETL & Staging ETL tool will interface with all the sources in the enterprise and extract data in a batch cycle or in real time Data Warehouse Quality Accounts Inventory Data Storage Enterprise Information is stored in the warehouse structure ETL Tool BI Tools, Portals Quality Finance Marktng Secured Access BI Tools interface with the databases to generate reports Reporting Layer METADATA Extracting The extract step is the first step involved in getting data into the data ware house environment. Extracting means reading and understanding the source data, and copying the parts that are needed to the data staging area for further work Extracting data needs to be done carefully so as not to effect production environments
Staging Area Transforming Once the data is extracted into the data staging area, there many possible transformation steps, including: Cleaning the data by correcting misspellings, resolving domain conflicts (such as a city name that is incompatible with a postal code), dealing with missing data elements, and parsing into standard formats Staging Area A storage area and set of processes that clean, transform, combine, duplicate, household, archive, and prepare source data for use in the data warehouse The data staging area is everything in between the source system and the presentation server The data staging area is not part of the physical data warehouse The staging area is dominated by the simple activities of sorting and sequential processing
Loading Data At the end of the transformation process, the data is in a position to be loaded across to the target warehouse First time bulk load to get the historical data into the Data Warehouse Periodic Incremental loads to bring in modified data Loading in the data warehouse environment usually takes the form of inserting data into dimension tables and fact table. These are the tables that are typically queried on by the users/tools while executing reports Bulk loading is a very important capability that is to be contrasted with record-at-a-time loading, which is far slower and can cause load times to be in the 10 hours+ range It may be required to drop and recreate indexes on the target warehouse structure each time data loading occurs
Data warehouse techniques Data Mining OLAP Data MINING Data mining access of a database differs from this traditional accesses in several ways: Query: The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what he wants to see. Data: The data accessed is usually a different version from that of the original operational database. The data have been cleaned and modified to better support the mining process. Output: The output of the data mining query probably is not a subset of the database. Instead it is the output of some analysis of the contents of the database.
Data mining algorithms can be characterized as consisting of three parts: Model – The purpose of the algorithm is to fit a model to the data. Preference – Some criteria must be used to fit one model over another. Search – All algorithms require some technique to search the data.
4. OLAP
A B C Time D 1 2 3 4 5 Product SALES CUBE Q2 Q1 Dimensions Sales CUSTOMER Time Sales The general activity of querying and presenting text and number data from data warehouses in a dimensional format is known as OLAP The OLAP vendors’ technology is non relational and is almost always based on an explicit multidimensional cube of data OLAP databases are also known as multidimensional databases, or MDDBs. OLAP installations would be classified as small, individual data marts when viewed against the full range of data warehouse application SALES CUBE CUSTOMER A 11 43 12 49 71 B 33 15 65 94 45 C 59 77 37 78 12 Time Q2 D 09 53 20 73 32 Q1 1 2 3 4 5 Product
Thank You