Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehousing and Business Intelligence

Similar presentations


Presentation on theme: "Data Warehousing and Business Intelligence"— Presentation transcript:

1 Data Warehousing and Business Intelligence
Schedule: Timing Topic 45 minutes Lecture 10 minutes Practice 55 minutes Total

2 Oracle 10g: Data Warehousing Fundamentals 1 - 2
Introductions Tell us about yourself: What is your name and company? What is your role in the organization? What is your level of Oracle expertise? Why are you considering building a data warehouse? What are your expectations for this class? Instructor Note Ask each student to introduce himself or herself and answer the following questions: What is your name and company? What is your role in your organization? What is your level of Oracle expertise? Why are you building a data warehouse or data mart? What do you hope to get out of this class? As the students introduce themselves, assess their level of database design background and data warehouse knowledge. Ensure that they understand that the course is an introductory survey course, which covers the fundamentals, and briefly touches upon the various Data Warehousing and Business Intelligence tools from Oracle. Oracle 10g: Data Warehousing Fundamentals

3 Oracle 10g: Data Warehousing Fundamentals 1 - 3
Course Objectives After completing this course, you should be able to do the following: Describe the role of data warehousing and business intelligence (BI) in today’s marketplace Define the terminology and explain the basic concepts of data warehousing Define the decision support purpose and end goal of a data warehouse Develop familiarity with the various technologies required to implement a data warehouse Identify the technology and tools from Oracle to implement a successful data warehouse Identify data warehouse modeling concepts Oracle 10g: Data Warehousing Fundamentals

4 Oracle 10g: Data Warehousing Fundamentals 1 - 4
Course Objectives Describe methods and tools for extracting, transforming, and loading data Identify the tools for accessing and analyzing warehouse data Identify the features of Oracle Database 10g that aid in implementing the data warehouse Describe the OLAP and data mining techniques and tools Explain the implementation and organizational issues surrounding a data warehouse project Oracle 10g: Data Warehousing Fundamentals

5 Oracle 10g: Data Warehousing Fundamentals 1 - 5
Lessons Data Warehousing and Business Intelligence Defining Data Warehouse Concepts and Terminology Business, Logical, and Dimensional Modeling Physical Modeling: Sizing, Storage, Performance, and Security Considerations The ETL Process: Extracting Data The ETL Process: Transforming Data The ETL Process: Loading Data Oracle 10g: Data Warehousing Fundamentals

6 Oracle 10g: Data Warehousing Fundamentals 1 - 6
Lessons Refreshing Warehouse Data Summary Management Leaving a Metadata Trail OLAP and Data Mining Data Warehouse Implementation Considerations Oracle 10g: Data Warehousing Fundamentals

7 Oracle 10g: Data Warehousing Fundamentals 1 - 7
Let’s Get Started Lesson 1 Oracle 10g: Data Warehousing Fundamentals

8 Oracle 10g: Data Warehousing Fundamentals 1 - 8
Lesson 1 Objectives After completing this lesson, you should be able to do the following: Describe the evolution of data warehouses from management information systems (MIS) Describe why an online transaction processing (OLTP) system is not suitable for analytical reporting Describe how extract processing for decision support querying led to data warehouse solutions Identify the role of business intelligence (BI) in today’s market Identify the BI tools and technology from Oracle Identify the business drivers for data warehouses Explain why businesses are driven to employ data warehouse technology Identify the components of Oracle E-Business Intelligence Lesson Aim This lesson examines how data warehousing has evolved from early management information systems to becoming the cornerstone for business intelligence processes. The lesson also explores the primary motivating factors for implementing a data warehouse. Oracle 10g: Data Warehousing Fundamentals

9 Oracle 10g: Data Warehousing Fundamentals 1 - 9
Evolution of BI Executive information systems (EIS) Decision support systems (DSS) Data warehousing (DW) and business intelligence (BI) DW&BI DSS EIS Evolution of Data Warehousing and Business Intelligence The drive to enterprisewide information and data analytics started about twenty years ago. The associated technology was referred to as decision support systems (DSS) and executive information systems (EIS). In the last decade, technology has evolved from homegrown applications for EIS to packaged applications that allow the paradigm of data exploration, analysis, and mining to shift from information systems (IS) to the individual end user. EIS applications were generally developed by the IS team and written in 3GL, 4GL, C++, or some other structured programming language. These were predefined, somewhat restrictive queries that were delivered in tabular or chart form. Generally, information provided was limited to sales totals, units produced, and so on. DSS applications were the first generation of packaged software that provided dynamically generated SQL enabling users to extract data from relational databases. This data was relevant to their business needs and focus. For the past decade, there has been a transition from decision support systems to data warehouses. BI, the next generation of DSS, provides the capability to create and format reports easily. Additionally, multiple sources and multiple subject matters can be used simultaneously to provide an accurate assessment of the business. BI is enabling a rapid evolution in customer relationship management (CRM), supply chain analysis, sales force automation, technology forecasting, and so on. Oracle 10g: Data Warehousing Fundamentals

10 Early Management Information Systems
MIS systems provided business data. Reports were developed on request. Reports provided little analysis capability. Decision support tools gave personal ad hoc access to data. Ad hoc access Production platforms Early Management Information Systems Early management information systems (MIS) provided management with reports to assess the performance of the business. Report requirements were submitted as a request to the MIS development team, who developed the report and made it available to the user some time afterward—days, weeks, or even months later. The data in the reports was made available in a way that was difficult to use for analysis and forecasting. With the advent of personal computing and 4GL programming techniques, MIS became known as decision support (decision support systems or DSS). DSS was judged to support business users better, by giving them direct access to the operational data for additional ad hoc querying, which provided more flexible reporting as the information was needed. Operational reports Decision makers Oracle 10g: Data Warehousing Fundamentals

11 Analyzing Data from Operational Systems
Data structures are complex. Systems are designed for high performance and throughput. Data is not meaningfully represented. Data is dispersed. OLTP systems may be unsuitable for intensive queries. Production platforms Analyzing Data from Operational Systems Although decision support tools are friendly, intuitive, and easy to use, often the structure of data in the online transaction processing systems does not support the user’s real analytical requirements. The structure of the operational data is often complex and too highly structured (3NF). The system was designed for high performance, high throughput online transaction processing, rather than CPU-intensive analysis of information. The data is not always meaningfully presented to the end-user query tool. The same data elements may be defined differently for each operational system. For example, a customer record may hold the customer telephone number. On one system, this number is stored as a 15-digit number, and on another as a 20 alphanumeric character value. Data is dispersed on multiple and diverse systems, leading to data redundancy and the inability to coordinate data between systems to provide a global picture of the business. Running online transaction processing and decision support concurrently on one machine degrades performance of the operational system, response time to users, and performance of networks. Operational reports Oracle 10g: Data Warehousing Fundamentals

12 Why OLTP Is Not Suitable for Analytical Reporting
Database design: Denormalized, star schema Database design: Normalized Data needs to be integrated Data stored at transaction level Historical information to analyze Information to support day-to-day service Analytical Reporting OLTP Why OLTP Is Not Suitable for Analytical Reporting Operational systems largely exist to support transactions—for example, the booking of an airline ticket. Decision support, which is a type of complex analysis, is very different from OLTP. Most OLTP transactions require a single record in a database to be located and updated or an addition of one or more new records. Even a simple decision support query such as “How many luxury cars did we sell in Boston for January 2001” requires very different operations at the database level than an OLTP transaction. A potentially large number of records must be located, and there are no update operations at all. OLTP databases are fully normalized and are designed to consistently store operational data, one transaction at a time. Analytical reporting, on the other hand, requires database design that even business users find directly usable. To achieve this, different database design techniques are required (for example, the use of dimensional and star schemas with highly denormalized dimension tables). Oracle 10g: Data Warehousing Fundamentals

13 Data Extract Processing
End-user computing offloaded from the operational environment User’s own data Operational systems Extracts Decision makers Data Extract Processing DSS and Degradation The problem of performance degradation was partially solved by using extract processing techniques that select data from one environment and transport it to another environment for user access (a data extract). Data Extract Program The data extract program searches through files and databases, gathering data according to specific criteria. The data is then placed into a separate set of files, which may reside on another environment, for use by analysts for decision support activities. Extract processing was a logical progression from decision support systems. It was seen as a way to move the data from the high-performance, high-throughput online transaction processing systems onto client machines that are dedicated to analysis. Extract processing also gave the user ownership of the data. Oracle 10g: Data Warehousing Fundamentals

14 Issues with Data Extract Programs
Operational systems Extracts Decision makers Issues with Data Extract Programs Although the principle of extracts appears logical, and to some degree represents a model similar to the way a data warehouse works, there are problems with processing extracts. Extract programs may become the source for other extracts, and extract management can become a full-time task for information systems departments. In some companies, hundreds of extract programs are run at any time. Extract explosion Oracle 10g: Data Warehousing Fundamentals

15 Productivity Issues with Extract Processing
Duplicated effort Multiple technologies Obsolete reports No common metadata Productivity Issues with Extract Processing Following are the productivity issues in an extract processing environment: Extract effort is duplicated because multiple extracts access the same data and use mainframe resources unnecessarily. The program designed to access the extracted data must encompass all technologies that are employed by the source data. A report cannot always be reused because business structures change. There is different types of metadata, and no common metadata providing a standard way of extracting, integrating, and using the data. Oracle 10g: Data Warehousing Fundamentals

16 Data Quality Issues with Extract Processing
No common time basis Different calculation algorithms Different levels of extraction Different levels of granularity Different data field names Different data field meanings Missing information No data correction rules No drill-down capability Data Quality Issues with Extract Processing Following are the data quality issues in an extract processing environment: The data has no time basis and users cannot compare query results with confidence. The data extracts may have been taken at a different point in time. Each data extract may use a different algorithm for calculating derived and computed values. This makes the data difficult to evaluate, compare, and communicate by managers who may not know the methods or algorithms that are used to create the data extract or reports. Data extract programs may use different levels of extraction. Access to external data may not be consistent, and the granularity of the external data may not be well defined. Data sources may be difficult to identify, and data elements may be repeated on many extracts. The data field names and values may have different meanings in the various systems in the enterprise (lack of semantic integrity). There are no data correction rules to ensure that the extracted data is correct and clean. The reports provide data rather than information, and no drill-down capability. Oracle 10g: Data Warehousing Fundamentals

17 Data Warehousing and Business Intelligence
Enterprise Data Warehouse Legacy data Operations data Analytical reporting Data Warehousing and Business Intelligence A data warehouse is a strategic collection of all types of data in support of the decision-making process at all levels of an enterprise. It is a single data store created for two primary reasons: analytical reporting and decision support. Companies require business intelligence to direct business process improvement and monitor time, cost, quality, and control. External data Data marts Oracle 10g: Data Warehousing Fundamentals

18 Technological Advances Enabling Data Warehousing
Hardware Operating system Database Query tools Applications Large databases 64-bit architectures Indexing techniques Affordable, cost-effective open systems Robust warehouse tools Sophisticated end-user tools Technology Needed to Support the Business Needs Today’s information technology climate provides you with cost-effective computing resources in the hardware and software arena, Internet and intranet solutions, and databases that can hold very large volumes of data for analysis, using a multitude of data access technologies. Technological Advances Enabling Data Warehousing Technology (specifically, open systems technology) is making it affordable to analyze vast amounts of data, and hardware solutions are now more cost effective. Parallelism Recent advances in parallelism have benefited all aspects of computing: Hardware environment Operating system environment Database management systems and all associated database operations Query tools and techniques Applications Oracle 10g: Data Warehousing Fundamentals

19 Advantages of Warehouse Processing Environments
Controlled Reliable Quality information Single source of data Internal and external systems Data warehouse Decision makers Advantages of Warehouse Processing Environments The data warehouse environment is more controlled and therefore more reliable for decision support than an extract environment. The data warehouse environment supports your entire decision support requirements by providing high-quality information, which is made available by accurate and effective cleansing routines, and using consistent and valid data transformation rules and documented presummarization of data values. It contains a single source of accurate, reliable information that can be used for analysis. Oracle 10g: Data Warehousing Fundamentals

20 Advantages of Warehouse Processing Environments
No duplication of effort No need for tools to support many technologies No disparity in data, meaning, or representation No time period conflict No algorithm confusion No drill-down restrictions Advantages of Warehouse Processing Environments (continued) Other advantages of the warehousing processing environment are: No duplication of effort No need to consider using a query and reporting tool that supports more than one technology No disparity with the data and its meaning No disparity with the way data is represented No conflict over the time periods employed No contention over the algorithms that have been used No restriction on drill-down capabilities Oracle 10g: Data Warehousing Fundamentals

21 Business Intelligence: Definition and Purpose
“Business intelligence is the process of transforming data into information and through discovery transforming that information into knowledge.” – Gartner Group The purpose of business intelligence is to convert the volume of data into business value through analytical reporting. Value Decision Knowledge Information Business Intelligence: Definition and Purpose Definition Howard Dressner, analyst with the Gartner Group, defines business intelligence as a process of turning data into information and through iterative discoveries turning that information into business intelligence. The key is that business intelligence is a process—cross-functional, in line with current management thinking, and not presented in IT terms. From an information systems standpoint, BI provides users with online analytical processing or data analysis capabilities to predict trends, evaluate business questions, and so on. From a BI analyst viewpoint, it is the process of gathering high-quality, meaningful information about a subject, which enables the analyst to draw conclusions. Data warehousing creates the infrastructure for providing successful enterprise-level business intelligence. Purpose The purpose of business intelligence is to turn large volumes of data into information, linking bits of information together within a decision context that turns it into knowledge that can be used to aid decision making. Data Volume Oracle 10g: Data Warehousing Fundamentals

22 Success Factors for a Dynamic Business Environment
Know the business. Reinvent to face new challenges. Invest in products. Invest in customers. Retain customers. Invest in technology. Improve access to business information. Provide superior services and products. Be profitable. Success Factors for a Dynamic Business Environment To succeed in an ever-changing business environment, a company must: Know both the market they are in and their business (internally and externally) Reinvent themselves to face new challenges. This may be changing product requirements, diverse and effective services, or even changes in internal organizational structures. Invest in research and development of new product channels Invest in high-value customers who contribute greater returns to the business Retain existing customers and attract new customers Invest in new technology to support business needs Improve access to information so that they can make rapid decisions, based on an accurate picture of the business Provide superior services and products to keep market share and maintain income Be profitable—at the same time, they must be able to invest in resources for the future, such as technology and people Oracle 10g: Data Warehousing Fundamentals

23 Business Drivers for Data Warehouses
Provide supporting information systems. Get quality information: Reduce costs. Streamline the business. Improve margins. Business Drivers for Data Warehouses Businesses today face challenges such as regulatory control, competition, market maturity, product differentiation, customer behavior, and accelerated product life cycles, all of which require businesses to develop market awareness, responsiveness, adaptability, innovation, efficiency, and quality. To meet these challenges, a business needs to have: Access to consistent and high-quality information about the behaviors of the business and the external markets, so that they can constantly monitor the state of the business Information that can help to reduce costs, streamline the business, and improve margins Oracle 10g: Data Warehousing Fundamentals

24 Business Intelligence: Requirements
Efficient design of data warehouses Enterprise reporting Ad hoc query and analysis (relational and multidimensional) Advanced analytics Integration with portals Easy administration Integrated environment and/or tools Business Intelligence: Requirements To address the changing requirements of today’s business economy, business intelligence systems require that the following business requirements be addressed: Efficient design of data warehouses Enterprise reporting Ad hoc query and analysis Advanced analytics Integration with portals Easy administration Integrated environment or tools Oracle 10g: Data Warehousing Fundamentals

25 Problem: Multivendor, Unintegrated Environment
ETL tool Lineage OLAP engine Analytic apps. Transformation engine Mining engine Portal Query & analysis ETL tool Database Transformation engine Reporting engine Name/address scrubbing Enterprise reporting Problem: Multivendor, Unintegrated Environment Now that you have understood the requirements to design and implement a successful business intelligence (BI) solution, consider a common architecture found in businesses today. Most BI solutions are incomplete, requiring you to integrate disparate BI systems from multiple vendors for capabilities such as ad hoc query, Web analysis, reporting, personalized recommendations, custom application development, portal, online analytical processing (OLAP), data mining, and extraction, transformation, and loading (ETL). The result is high costs associated with initial purchases from multiple vendors; and integration, administration, and maintenance of the BI solutions. Performance is another factor as data is moved from one system to the next for their respective operations. For example, to calculate year-to-year growth and rank the results in ascending order, multiple copies of the same data are required to perform each sequential operation. This may lead to data latency, which hinders the accuracy of your information and may result in conflicting reports. As the complexity of the business intelligence system increases via multivendors and integration points, maintenance costs escalate with the business changes and mandate new BI requirements. Therefore, a complete BI solution built today may not be as complete tomorrow. Oracle 10g: Data Warehousing Fundamentals

26 Oracle Business Intelligence
Oracle Business Intelligence tools and applications Build DW Ad hoc query Analytics BI Beans Oracle Application Server Integration Portal HTTP server, J2EE, Web services Oracle Database with OLAP, data mining, and ETL features Oracle Business Intelligence Oracle provides the technology framework that is needed to build a complete and integrated solution for business intelligence and data warehousing. The three well-integrated product suites—Oracle Database 10g, Oracle Application Server 10g, and Oracle Business Intelligence 10g—enable users to rapidly develop and deploy data warehouses and data marts with a complete array of reporting, querying, and analytic capabilities. Oracle Database 10g is an analysis-ready database with ETL, OLAP, and data mining features built right into the data server. Oracle Application Server 10g comes with built-in Portal and other Web services that allow easy development and delivery of customized intelligent information to all. Oracle Business Intelligence 10g offers a complete set of applications and tools to build and manage the data warehouse, to perform ad hoc queries and advanced analytics, to build custom BI applications, and to provide easy access and distribution of information across the enterprise. Note: A detailed discussion about BI Applications and Tools is provided on the following page. Wireless Business intelligence Oracle 10g: Data Warehousing Fundamentals

27 Introduction to Oracle Business Intelligence Tools and Applications
Oracle Warehouse Builder 10g (OWB) OracleBI Discoverer OracleBI Spreadsheet Add-In OracleBI Beans Oracle Reports 10g Analytical Workspace Manager 10g (AWM) Oracle Data Miner 10g (ODM) Introduction to Oracle Business Intelligence Tools and Applications Oracle Corporation provides a complete BI solution by offering the following BI products: Oracle Warehouse Builder: Build and Manage the Data Warehouse Developers design and build the data warehouse by using Oracle Warehouse Builder (OWB). OWB enables the complete extraction, transformation, and loading (ETL) cycle, and data preparation of both relational and multidimensional data sources. That is, OWB is also used to cleanse and transform data from relational, nonrelational (flat files, mainframes, and legacy structures), and multidimensional data sources, ensuring high quality of data during the data loading process. Note: The current release of OWB can also be used to generate the End User Layer (the metadata repository for relational data source) and an analytic workspace (metadata repository for multidimensional analysis). These repositories enable the data access for OracleBI Discoverer, OracleBI Spreadsheet Add-In, and OracleBI Beans. OracleBI Discoverer: Ad Hoc Query and Analysis It is an intuitive tool for ad hoc query, reporting, analysis, and Web publishing that empowers business users to gain immediate access to information from data marts, data warehouses, online transaction processing (OLTP), online analytical processing (OLAP) systems, and also Oracle E-Business Suite. Oracle 10g: Data Warehousing Fundamentals

28 Oracle’s Complete and Integrated Solution
Publish BI content on portal OracleAS Portal Ad hoc query/OLAP analysis OracleBI Discoverer Oracle Reports 10g Enterprise reporting OracleBI Beans Develop custom BI applications Oracle Warehouse Builder/ Analytical Workspace Manager Manage metadata Oracle Warehouse Builder Analytical Workspace Manager ETL/ Design EUL, AW Oracle’s Complete and Integrated Solution Oracle Database 10g has integrated relational, OLAP, ETL, and data mining capabilities to provide a complete solution to every simple as well as complex analysis problem. Oracle Application Server 10g provides all the middle-tier services that you need to deploy and manage applications and Web services. It also has components such as OracleAS Portal, which helps you to develop personalized applications through enterprise portals. Oracle Application Server 10g provides business intelligence with a set of integrated applications and automates business processes. Oracle Business Intelligence tools and applications are well integrated, and are interoperable across the entire analytical spectrum of BI requirements. These tools and applications provide solutions for every possible business requirement, such as: How to cleanse and transform data, design a data warehouse, and perform ETL tasks? Oracle Warehouse Builder How to create an analytical workspace for multidimensional analysis? Oracle Warehouse Builder, Oracle Analytical Workspace Manager 10g What is the driving increase in North American sales? (Ad hoc query analysis OLAP/relational) OracleBI Discoverer Database Oracle Database Oracle 10g: Data Warehousing Fundamentals

29 Oracle E-Business Intelligence
Oracle Daily Business Intelligence (DBI) Oracle Corporate Performance Management (CPM) Oracle Enterprise Reporting and Delivery Oracle XML Publisher PeopleSoft Enterprise Performance Management (EPM) Develop Contracts Market E-Business Suite Projects Sell HR Order Finance Customers, Suppliers, Products, … Plan Maintain Procure Service Make Fulfill Daily Business Intelligence Oracle E-Business Intelligence Oracle Applications leverage Oracle BI technology, and are optimized to use with Oracle transactional applications. OracleBI Applications include Oracle Daily Business Intelligence (DBI), Oracle Corporate Performance Management (CPM), Oracle Enterprise Reporting and Delivery, PeopleSoft Enterprise Performance Management (EPM), and so on. Oracle Daily Business Intelligence (DBI) DBI is a set of reporting and analysis applications that deliver accurate, timely, actionable information to the executives, managers, and frontline workers. Oracle Daily Business Intelligence applications are embedded into Oracle Financials, Human Resources, Supply Chain Management (SCM), Customer Relationship Management (CRM), Project Management, and so on. These DBI applications are ready to run, requiring minimal setup. With a radically simplified architecture and a single data model for a single source of truth, Oracle Daily Business Intelligence delivers greater business insight to end users faster and at a cost much lower than any other E-Business reporting solution. Oracle Corporate Performance Management (CPM) Oracle’s comprehensive Corporate Performance Management (CPM) products deliver business intelligence, planning and budgeting (Enterprise Planning and Budgeting or EPB), consolidation, profitability management, and analysis and reporting capabilities across the enterprise. Oracle 10g: Data Warehousing Fundamentals

30 OracleBI Suite Enterprise Edition – Based on Siebel Analytics
Siebel Analytics has the following two sets of products: Siebel Business Analytics Platform, the application server, administration and query tools (now the components of Oracle BI Suite Enterprise Edition) Siebel Analytics Applications (a set of pre-packaged, applications build using the Analytics Platform tools) OracleBI Suite Enterprise Edition – Based on Siebel Analytics In early 2006, Oracle acquired Siebel, and decided to launch OracleBI Suite Enterprise Edition. The technology and products that make up the OracleBI Suite Enterprise Edition are based on Siebel Analytics, a product Oracle acquired when it took over Siebel. Siebel Analytics is based on technology Siebel acquired from a company called nQuire in 1999, and is generally seen in the industry as a next-generation, highly capable business intelligence platform. Siebel Analytics 7.8 (the latest version) consists of two sets of products: Siebel Business Analytics Platform, the application server, administration and query tools that will now become the components of Oracle BI Suite Enterprise Edition Siebel Analytics Applications, a set of pre-packaged, vertical applications, build using the Analytics Platform tools, that will continue to be the components of Siebel Business Analytics Applications. At the time of the Oracle acquisition, Siebel had seven applications within the Siebel Analytic Applications product family, each of which are renamed as Oracle products. These products include Sales Analytics, Service and Contact Center Analytics, Marketing Analytics, Financial Analytics, Supply Chain and Supplier Analytics, HR or Workforce Analytics, and Real-Time Decision Solutions. Oracle 10g: Data Warehousing Fundamentals

31 Oracle 10g: Data Warehousing Fundamentals 1 - 38
Summary In this lesson, you should have learned how to: Describe the evolution of data warehouses from MIS Describe why an online transaction processing system (OLTP) is not suitable for analytical reporting Describe how extract processing for decision support querying led to data warehouse solutions Identify the role of business intelligence in today’s market Identify the BI tools and technology from Oracle Identify the business drivers for data warehouses Explain why businesses are driven to employ data warehouse technology Identify the components of Oracle E-Business Intelligence Oracle 10g: Data Warehousing Fundamentals

32 Oracle 10g: Data Warehousing Fundamentals 1 - 39
Practice 1-1 Overview This practice covers the following topics: Answering questions about data warehousing Identifying Oracle’s BI technology and tools Discussing how data warehousing meets business needs Oracle 10g: Data Warehousing Fundamentals

33 Defining Data Warehouse Concepts and Terminology
Schedule: Timing Topic 50 minutes Lecture 20 minutes Practice 70 minutes Total

34 Oracle 10g: Data Warehousing Fundamentals 1 - 44
Objectives After completing this lesson, you should be able to do the following: Identify a common, broadly accepted definition of a data warehouse Describe the differences of dependent and independent data marts Identify some of the main warehouse development approaches Define some of the operational properties and common terminology of a data warehouse Lesson Aim The previous lesson covered how BI has evolved from early management information systems to today’s enterprisewide data warehousing and decision support systems. This lesson defines data warehouse concepts and terminology. Specifically, this lesson introduces the most common definitions of a data warehouse. The lesson offers a general description of the properties of a data warehouse. The standard components and tools required to build, operate, and use a data warehouse are identified. The differences between dependent and independent data marts in relation to the data warehouse are discussed. Oracle 10g: Data Warehousing Fundamentals

35 Data Warehouse: Definition
“A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions.” — W.H. Inmon “An enterprise structured repository of subject-oriented, time-variant, historical data used for information retrieval and decision support. The data warehouse stores atomic and summary data.” — Oracle’s definition of a data warehouse Data Warehouse: Definition There are a number of definitions of a data warehouse. Among the most famous and widely recognized ones is the one proposed by W. H. Inmon. A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions. Oracle’s Data Warehouse Definition The definition of a data warehouse from the Oracle Data Warehouse method describes many of the most common characteristics of a data warehouse. Subject Oriented The data in an OLTP system is stored to support a specific business process (for example, order entry, campaign management, and so on) as efficiently as possible, whereas the data in a data warehouse is stored based on common subject areas (for example, customer, product, and so on) for ease of access. That is because the complete set of questions to be posed to a data warehouse is never known. Every question the data warehouse answers spawns new questions. Thus, the focus of the design of a data warehouse is providing users easy access to the data so that current and future questions can be answered. Oracle 10g: Data Warehousing Fundamentals

36 Data Warehouse Properties
Integrated Subject oriented Nonvolatile Data Warehouse Time variant Data Warehouse Properties Bill Inmon’s definition of a data warehouse makes reference to the main properties of a data warehouse: Subject oriented Integrated Nonvolatile Time variant Oracle 10g: Data Warehousing Fundamentals

37 Oracle 10g: Data Warehousing Fundamentals 1 - 48
Subject Oriented Data is categorized and stored by business subject rather than by application. OLTP applications Equity plans Shares Insurance Loans Savings Data warehouse subject Subject Oriented Subject-oriented data is organized around major subject areas of an enterprise, and is useful for an enterprisewide understanding of those subjects. For example, a banking operational system keeps independent records of customer savings, loans, and other transactions. A warehouse pulls this independent data together to provide financial information. You can access subject-oriented data related to any major subject area of an enterprise: Customer financial information Toll calls made in the telecommunications industry Airline passenger booking information Insurance claim data The data is transformed so that it is consistent and meaningful for the warehouse. Customer financial information Oracle 10g: Data Warehousing Fundamentals

38 Oracle 10g: Data Warehousing Fundamentals 1 - 49
Integrated Data on a given subject is defined and stored once. Savings Current accounts Loans Customer Integrated In many organizations, data resides in diverse independent systems, making it difficult to integrate into one set of meaningful information for analysis. A key characteristic of a warehouse is that data is completely integrated. Data is stored in a globally acceptable manner, even when the underlying source data is stored differently. The transformation and integration process can be time consuming and costly. It requires commitment from every part of the organization, particularly top-level managers who make the decisions and allocate resources and funds. Data Consistency You must deal with data inconsistencies and anomalies before the data is loaded into the warehouse. Consistency is applied to naming conventions, measurements, encoding structures, and physical attributes of the data. OLTP applications Data Warehouse Oracle 10g: Data Warehousing Fundamentals

39 Oracle 10g: Data Warehousing Fundamentals 1 - 51
Time Variant Data is stored as a series of snapshots, each representing a period of time. Time Variant Warehouse data is by nature historical; it does not usually contain real-time transactional data. Data is represented over a long time horizon, from two to ten years, compared with one to three months of data for a typical operational system. The data allows for analysis of past and present trends, and for forecasting using “what-if” scenarios. Time Element The data warehouse always contains a key element of time, such as quarter, month, week, or day, which determines when the data was loaded. The date may be a single snapshot date, such as 10-JAN-02, or a range, such as 01-JAN-02 to 31-JAN-02. Snapshots by Time Period Warehouse data is essentially a series of snapshots by time periods that do not change. Special Dates A time dimension usually contains all the dates required for analysis, including special dates such as holidays and events. Data warehouse Oracle 10g: Data Warehousing Fundamentals

40 Oracle 10g: Data Warehousing Fundamentals 1 - 52
Nonvolatile Typically, data in the data warehouse is not updated or deleted. Operational Warehouse Load Nonvolatile Typically, data in the data warehouse is read-only. Data is loaded into the data warehouse for the first-time load, and then refreshed regularly. Warehouse data is accessed by business users. Warehouse operations typically involve: Loading the initial set of warehouse data (often called the first-time load) Refreshing the data regularly (called the refresh cycle) Accessing the Data After a snapshot of data is loaded into the warehouse, it rarely changes. Therefore, data manipulation is not a consideration at the physical design level. The physical warehouse is optimized for data retrieval and analysis. Refresh Cycle The data in the warehouse is refreshed—that is, snapshots are added. The refresh cycle is determined by the business users. A refresh cycle need not be the same as the grain (level at which the data is stored) of the data for that cycle. For example, you may choose to refresh the warehouse weekly, but the grain of the data may be daily. Insert, update, delete, or read Read Oracle 10g: Data Warehousing Fundamentals

41 Changing Warehouse Data
Operational databases Warehouse database First-time load Refresh Refresh Purge or archive Changing Warehouse Data The following operations are typical of a data warehouse: The initial set of data is loaded into the warehouse, often called the first-time load. This is the data by which you measure the business, and the data containing the criteria by which you analyze the business. Frequent snapshots of core data warehouse data are added (more occurrences) according to the refresh cycle and using data from the multiple source systems. Warehouse data may need to be changed for a number of reasons: The data that you are using to analyze the business may change; the data warehouse must be kept up-to-date to keep it accurate. The business determines how much historical data is needed for analysis, for example, five years’ worth. Older data is either archived or purged. Inappropriate or inaccurate data values may be deleted from or migrated out of the data warehouse. Refresh Oracle 10g: Data Warehousing Fundamentals

42 Data Warehouse Versus OLTP
Analysis Processes Activities Operational, internal, external Operational, internal Data sources Large to very large Small to large Size Subject, time Application Data organization Snapshots over time 30–60 days Nature of data Primarily read-only DML Operations Seconds to hours Subseconds to seconds Response time Data Warehouse OLTP Property Data Warehouse Versus Online Transaction Processing (OLTP) Response Time and Data Operations Data warehouses are constructed for very different reasons than online transaction processing (OLTP) systems. OLTP systems are optimized for getting data in—for storing data as a transaction occurs. Data warehouses are optimized for getting data out—for providing quick responses for analysis purposes. Because there tends to be a high volume of activity in the OLTP environment, rapid response is critical. However, data warehouse applications are analytical rather than operational, so though not very critical, performance still plays an important role. Nature of Data The data stored in each database varies in nature: the data warehouse contains snapshots of data over time to support time-series analysis, whereas the OLTP system stores very detailed data for a short time, such as 30 to 60 days. Oracle 10g: Data Warehousing Fundamentals

43 Enterprisewide Data Warehouse
Large scale implementation Scopes the entire business Data from all subject areas Developed incrementally Single source of enterprisewide data Synchronized enterprisewide data Single distribution point to dependent data marts Enterprisewide Data Warehouse To summarize, an enterprisewide warehouse stores data from all subject areas within the business for analysis by end users. The scope of the warehouse is the entire business and all operational aspects within the business. An enterprisewide warehouse is normally (and should be) created through a series of incrementally developed solutions. Never create an enterprisewide data warehouse under one project umbrella; it will not work. With an enterprisewide data warehouse, all users access the warehouse, which provides: A single source of corporate enterprisewide data A single source of synchronized data in the enterprisewide warehouse for each subject area A single point for distribution of data to dependent data marts Exponential Growth and Use After they are implemented, data warehouses continue to grow in size. Each time the warehouse is refreshed, more data is added, deleted, or archived. The refresh happens on a regular cycle. Successful data warehouses grow very quickly, perhaps to a magnitude of gigabytes a month and terabytes over time. When the success of the warehouse is proven, its use increases dramatically and it often grows faster than expected. Oracle 10g: Data Warehousing Fundamentals

44 Data Warehouses Versus Data Marts
Months Months to years Implementation time Few Many Data source Single-subject, LOB Multiple Subjects Department Enterprise Scope Data mart Data Warehouse Property Data Warehouse Versus Data Mart Definition Data mart is a subset of data warehouse fact and summary data that provides users with information specific to their departmental requirements. It can be a subject-oriented data warehouse for functional or departmental information needs, it can also be a geographical subset for local analysis of local material, or it can be a mini enterprisewide data warehouse combining data from multiple subject areas and acting as a kernel to feed the enterprise warehouse. Scope A data warehouse deals with multiple subject areas and is typically implemented and controlled by a central organizational unit such as the Corporate Information Technology group. It is often called a central or enterprise data warehouse. Subjects A data mart is a departmental form of a data warehouse designed for a single line of business (LOB) or functional area, such as sales, finance, or marketing. Oracle 10g: Data Warehousing Fundamentals

45 Oracle 10g: Data Warehousing Fundamentals 1 - 59
Dependent Data Mart Data marts Operational systems Flat files Data Warehouse Legacy data Marketing Operations data Sales Marketing Sales Finance HR Dependent Data Marts Data marts can be categorized into two types: dependent and independent. The categorization is based primarily on the data source that feeds the data mart. Dependent Data Mart Dependent data marts have the following characteristics: The source is the warehouse. Dependent data marts rely on the data warehouse for content. The extraction, transformation, and loading (ETL) process is easy. Dependent data marts draw data from a central data warehouse that has already been created. Thus, the main effort in building a mart, the data cleansing and extraction, has already been performed. The dependent data mart simply requires data to be moved from one database to another. The data mart is part of the enterprise plan. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access to data relevant to a specific department. External data External data Finance Oracle 10g: Data Warehousing Fundamentals

46 Oracle 10g: Data Warehousing Fundamentals 1 - 60
Independent Data Mart Operational systems Flat files Legacy data Sales or marketing Operations data Independent Data Marts Independent data marts are stand-alone systems built from scratch that draw data directly from operational or external sources of data. Independent data marts have the following characteristics: The sources are operational systems and external sources. The ETL process is difficult. Because independent data marts draw data from unclean or inconsistent data sources, efforts are directed toward error processing and integration of data. The data mart is built to satisfy analytical needs. The creation of independent data marts is often driven by the need for a quick solution to analysis demands. Instructor Note Many dependent data marts still obtain some of their internal and external data outside of the data warehouse. Ask students what might be the danger of having too many independent data marts. Mention that independent data marts are often not seen as a good solution and should probably be avoided for a number of reasons (for example, different answers to the same business question from multiple data marts, duplication of ETL, and so on). External data External data Oracle 10g: Data Warehousing Fundamentals

47 Typical Data Warehouse Components
Source systems Staging area Presentation area Access tools Legacy Data Warehouse External ODS Operational Data marts Typical Data Warehouse Components Source Systems Source systems may be in the form of data existing in: Production operation systems Archives Internal files not directly associated with company operational systems, such as individual spreadsheets and workbooks External data from outside the company Data Staging Area The data staging area is analogous to a construction site. This is where much of the data cleansing and preparation take place before data is loaded to the data warehouse. It is both a storage area and a set of processes commonly known as extraction, transformation, and loading (ETL). It is off limits to business users and it is not suitable for querying and reporting. A staging area is a typical requirement of warehouse implementation. It may be an operational data store (ODS) environment, a set of flat files, a series of tables in a relational database server, or proprietary data structures used by data staging tools. Metadata repository Oracle 10g: Data Warehousing Fundamentals

48 Warehouse Development Approaches
“Big bang” approach Incremental approach: Top-down incremental approach Bottom-up incremental approach Warehouse Development Approaches The most challenging aspect of data warehousing lies not in its technical difficulty, but in choosing the best approach to data warehousing for your company’s structure and culture, and dealing with the organizational and political issues that will inevitably arise during implementation. Among the different approaches to developing a data warehouse are: “Big bang” approach Incremental approach Top-down incremental approach Bottom-up incremental approach Oracle 10g: Data Warehousing Fundamentals

49 Oracle 10g: Data Warehousing Fundamentals 1 - 64
“Big Bang” Approach Analyze enterprise requirements. Build enterprise data warehouse. Report in subsets or store in data marts. “Big Bang” Approach Historically, IT departments attempted to provide enterprisewide data warehouse implementations in a single project approach. Data warehouse development is a huge task, and it is a mistake to assume that the solution can be built all at once. The time required to develop the warehouse often means that user requirements and technologies change before the project is completed. In this approach, you perform the following: Analyze the entire information requirement for the organization. Build the enterprise data warehouse to support these requirements. Build access, as required, either directly or by subsetting to data marts. Oracle 10g: Data Warehousing Fundamentals

50 Oracle 10g: Data Warehousing Fundamentals 1 - 66
Top-Down Approach Analyze requirements at the enterprise level. Develop conceptual information model. Identify and prioritize subject areas. Complete a model of selected subject area. Map to available data. Perform a source system analysis. Implement base technical architecture. Establish metadata, extraction, and load processes for the initial subject area. Create and populate the initial subject area data mart within the overall warehouse framework. Top-Down Incremental Approach Advantages This approach has the following advantages: Provides a relatively quick implementation and payback. Typically, the scoping, definition study, and initial implementation are scaled down so that they can be completed in six to seven months. Offers significantly lower risk because it avoids being as analysis heavy as the “big bang” approach Emphasizes high-level business needs Achieves synergy among subject areas. Maximum information leverage is achieved as cross-functional reporting and a single version of the truth are made possible. Disadvantages This approach has the following disadvantages: Requires an increase in up-front costs before the business sees any return on their investment Is difficult to define the boundaries of the scoping exercise if the business is global May not be suitable unless the client needs cross-functional reporting Oracle 10g: Data Warehousing Fundamentals

51 Oracle 10g: Data Warehousing Fundamentals 1 - 67
Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope. Define the initial increment based on the political pressure, assumed business benefit, and data volume. Implement base technical architecture and establish metadata, extraction, and load processes as required by increment. Create and populate the initial subject areas within the overall warehouse framework. Bottom-Up Incremental Approach This approach is similar to the top-down approach but the emphasis is on the data rather than the business benefit. Here, IT is in charge of the project either because IT wants to be in charge or the business has deferred the project to IT. Advantages This approach has the following advantages: This is a “proof of concept” type of approach; therefore, it is often appealing to IT. It is easier to get IT to choose this approach because it is focused on IT. Disadvantages This approach has the following disadvantages: Because the solution model is typically developed from source systems and these source systems will have encapsulated within them the current business processes, the overall extensibility of the model will be compromised. IT staff is often the last to know about business changes—IT could be designing something that will be out-of-date before they complete its delivery. As the framework of definition in this approach tends to be much narrower, often a significant amount of reengineering work is required for each increment. Oracle 10g: Data Warehousing Fundamentals

52 Incremental Approach to Warehouse Development
Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Build Iterative Incremental Approach The incremental approach manages the growth of the data warehouse by developing incremental solutions that comply with the full-scale data warehouse architecture. Rather than starting by building an entire enterprisewide data warehouse as a first deliverable, start with just one or two subject areas, implement them as scalable data mart, and roll them out to your end users. Then, after observing how users are actually using the warehouse, add the next subject area or the next increment of functionality to the system. This is also an iterative process. It is this iteration that keeps the data warehouse in line with the needs of the organization. Benefits Delivers a strategic data warehouse solution through incremental development efforts Provides extensible, scalable architecture Supports the information needs of the enterprise organization Quickly provides business benefit and ensures a much earlier return of investment Allows a data warehouse to be built based on a subject or application area at a time Allows the construction of an integrated data mart environment Production Oracle 10g: Data Warehousing Fundamentals

53 Data Warehousing Process Components
Methodology Architecture Extraction, transformation, and loading (ETL) Implementation Operation and support Components of Data Warehouse Design and Implementation Each of the components listed below are discussed in the following pages: Methodology Architecture Extraction, transformation, and loading (ETL) Implementation Operation and support Oracle 10g: Data Warehousing Fundamentals

54 Oracle 10g: Data Warehousing Fundamentals 1 - 71
Methodology Ensures a successful data warehouse Encourages incremental development Provides a staged approach to an enterprisewide warehouse that is: Safe Manageable Proven Recommended Methodology A methodology is a set of detailed steps or procedures to accomplish a defined goal. Employing a methodology for the development of any system is always important. In a warehouse environment, it is even more so. The warehouse is such a big investment in every resource you can think of that its success is essential. To avoid failure of the warehouse implementation, you must employ a methodology and keep to it. Failure is generally caused in two ways. The first cause of failure is that the warehouse is not delivered on time, and the second is that the warehouse fails to deliver what the business users need. A good method helps to manage expectations by identifying clear deliverables. However, do not become a slave to the steps of a methodology. Practice methodology with focus on results, not on activities. This achieves consistency of deliverables while recognizing differences in individual working styles. Oracle 10g: Data Warehousing Fundamentals

55 Oracle 10g: Data Warehousing Fundamentals 1 - 72
Architecture “Provides the planning, structure, and standardization needed to ensure integration of multiple components, projects, and processes across time.” “Establishes the framework, standards, and procedures for the data warehouse at an enterprise level.” — The Data Warehousing Institute Architecture From a business and technology view, an architecture defines a collection of components and specifies their relationships. The goal of the architecture activities is a single, integrated data warehouse meeting business information needs. Some of the components of a data warehousing architecture are: Data sources Data acquisition Data management Data distribution Information directory Data access tools Oracle 10g: Data Warehousing Fundamentals

56 Extraction, Transformation, and Loading (ETL)
“Effective data extract, transform, and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.” — DM Review Extraction, Transformation, and Loading (ETL) These processes are fundamental to the creation of quality information in the data warehouse. You take data from source systems; clean, verify, validate, and convert it into a consistent state; and then move it into the warehouse. Extraction: The process of selecting specific operational attributes from the various operational systems Transformation: The process of integrating, verifying, validating, cleaning, and time stamping the selected data into a consistent and uniform format for the target databases. Rejected data is returned to the data owner for correction and reprocessing. Loading: The process of moving data from an intermediate storage area into the target warehouse database ETL Tools Specialized tools make these tasks comparatively easy to set up, maintain, and manage. Specialized tools can be an expensive option, which motivates many warehouses to employ customized ETL programs written in COBOL, C++, PL/SQL, or other programming languages or application development tools. Oracle Warehouse Builder (OWB) is Oracle’s ETL tool. More discussion about OWB can be found in subsequent lessons. Source Staging area Target Oracle 10g: Data Warehousing Fundamentals

57 Implementation Data Warehouse Architecture
e.g., Incremental Implementation Implementation Increment 1 Increment 2 . Implementation Implementation deliverables: Analysis Confirm and refine requirements. Design Gather specifications and prepare the blueprint for the data warehouse or data mart. Construction Put in place and test the data warehouse or data mart and all required support tools. Deployment Data warehouse or data mart is accepted for use in the business. Increment n Oracle 10g: Data Warehousing Fundamentals

58 Oracle 10g: Data Warehousing Fundamentals 1 - 75
Operation and Support Data access and reporting Refreshing warehouse data Monitoring Responding to change Operation and Support Present warehouse data to the end user in a meaningful and business-specific manner, and select query tools that are tailored to the users’ requirements for information. Periodically refresh the warehouse data. Respond to changing data sources, requirements, and technology. Monitor, manage, and tune Oracle 10g: Data Warehousing Fundamentals

59 Phases of the Incremental Approach
Strategy Definition Analysis Design Build Production Strategy Definition Analysis Design Build Production Increment 1 Phases of the Incremental Approach Effective and efficient data warehouse project management involves the use of project phases. Project phases identify the tasks to be completed, the resources required, the directing and reporting efforts, and the quality assurance required before moving on to the next phase. Project phasing is a management technique used to focus project teams toward a short-term goal and to communicate progress to senior management. Strategy Define the business objectives and purpose of the data warehouse. Define the data warehouse team and executive sponsor. Define success measurements. Definition Define the scope and objectives for the incremental development effort. Identify the technical and data warehouse architecture. Outline data access methods. Oracle 10g: Data Warehousing Fundamentals

60 Strategy Phase Deliverables
Business goals and objectives Data warehouse purpose, objectives, and scope Enterprise data warehouse logical model Incremental milestones Source systems data flows Subject area gap analysis Identifying Warehouse Strategy Phase Deliverables For each of the data warehouse project phases, there are deliverables. The deliverables for the strategy phase focus on defining the business objectives and purpose of the data warehouse solution. The purpose and objectives for the total data warehouse solution are essential to setting and managing expectations. The strategy phase also clearly defines the data warehouse team and the executive sponsor. Business goals and objectives: Documents the strategic business goals and objectives Data warehouse purpose, objectives, and scope: Documents the purpose and objectives of the enterprise data warehouse, its scope, and how it is intended to be used Enterprise data warehouse logical model: High-level, logical information model that diagrams the major entities and relationships for the enterprise Incremental milestones: Documents a realistic scope of the data warehouse, acceptable delivery milestones for each increment, and source data availability Oracle 10g: Data Warehousing Fundamentals

61 Strategy Phase Deliverables
Data acquisition strategy Data quality strategy Metadata strategy Data access environment Training strategy Identifying Warehouse Strategy Phase Deliverables (continued) Source system data flows: Outlines the flow of the source system data, where it originates, the flow of data between business functions and source systems, degree of reliability, and data volatility Subject area gap analysis: Documents the variance between the information requirements and the ability of the data sources to provide that information Data acquisition strategy: Documents the approach to extracting, transforming, and loading data from the source systems to the target environments for the initial load and subsequent refreshes Data quality strategy: Outlines the approach for data management, error and exception handling, data cleansing, and the audit and control of the data Metadata strategy: Documents the strategy of capturing, integrating, and accessing metadata for all components of the warehouse environment Data access environment: Documents the identification, selection, and design of tools that support end-user access to the warehouse data Training strategy: Outlines the development and end-user training requirements, identifies the technical and business personnel requiring training, and establishes time frames for executing the training plans Oracle 10g: Data Warehousing Fundamentals

62 Sales History (SH) Schema
CUSTOMERS 55500 rows CHANNELS 5 rows TIMES 1826 rows PRODUCTS 72 rows SALES rows COUNTRIES 23 rows COSTS rows PROMOTIONS rows Sales History (SH) Schema Most of the demonstrations and the interactive viewlets provided with this course are based on Sales History (SH) sample schema. The SH schema is shipped with the Oracle database and is designed to allow demonstrations with larger amounts of data. The SH schema contains data about a global electronics retailer, who sells several categories and subcategories of products, such as computer hardware, peripherals, cameras, camcorders, and also software. The products are sold through several channels, including the Internet. The data model of SH schema is shown here. Oracle 10g: Data Warehousing Fundamentals

63 Introducing the Case Study: Roy Independent School District (RISD)
In January 2000, RISD and Oracle representatives met to discuss the details about the RISD Data Warehouse (RISD DW) project: Oracle to develop technical architecture for development, test, and production instance of RISD Data Warehouse Oracle to develop the logical and physical data models RISD responsible for data cleansing ETL process to be designed RISD DW project to support Student Information System (SIS) Reports to be created based on subject areas, and to be integrated with Portal Data access security to be implemented based on user roles Introducing the Case Study: Roy Independent School District (RISD) In January 2000, the RISD and Oracle representatives met to discuss the details about the Data Warehouse project. The RISD Data Warehouse environment will provide decision makers throughout the District with information to help improve student achievement. Users will access the system via Windows-based Internet-capable computers accommodating the needs of both computer novices and experts. Discoverer report data will be available online and can be downloaded into local applications where appropriate (for example, spreadsheets and PC databases) to perform additional analysis or for integration with local data. User groups will be restricted to accessing information associated with their responsibilities, needs, and skills. The majority of end users will view simple reports. Dashboard reports will provide summarized tabular and/or graphical information to various user groups. The system would be designed, so RISD can develop relevant reports as new data becomes available. Some power users will directly create new reports, and run reports, where as some end users view the standard and parameterized reports. This project will use Oracle Data Warehousing Methodology (DWM) and Project Management Methodology (PJM). The goal of the Data Warehouse is to supply useful consistent information to decision makers at various levels of the District via a Web-based interface in order to work toward improving student achievement, projecting trends in student achievement, and implementing or refining instructional interventions and programs in a timely manner. Oracle 10g: Data Warehousing Fundamentals

64 Oracle 10g: Data Warehousing Fundamentals 1 - 83
Summary In this lesson, you should have learned how to: Identify a common, broadly accepted definition of a data warehouse Describe the differences of dependent and independent data marts Identify some of the main warehouse development approaches Recognize some of the operational properties and common terminology of a data warehouse Oracle 10g: Data Warehousing Fundamentals

65 Oracle 10g: Data Warehousing Fundamentals 1 - 84
Practice 2-1: Overview This practice covers the following topics: Answering questions regarding the data warehousing concept and terminology Discussing some of the data warehouse concepts and terminology Discussing the case study to understand the requirements of the system Oracle 10g: Data Warehousing Fundamentals

66 Business, Logical, and Dimensional Modeling
Schedule: Timing Topic 60 minutes Lecture 30 minutes Practice 90 minutes Total

67 Oracle 10g: Data Warehousing Fundamentals 1 - 87
Objectives After completing this lesson, you should be able to do the following: Discuss data warehouse environment data structures Discuss data warehouse database design phases: Defining the business model Defining the logical model Defining the dimensional model Overview This lesson examines the role of data modeling in a data warehousing environment. The lesson presents a very high-level overview of warehouse modeling steps. You consider the different types of models that can be employed, such as the star schema. Tools that are available for warehouse modeling are introduced. Performance tuning techniques and Oracle’s data access tools are discussed. Oracle 10g: Data Warehousing Fundamentals

68 Data Warehouse Modeling Issues
Among the main issues that data warehouse data modelers face are: Different data types Many ways to use warehouse data Many ways to structure the data Multiple modeling techniques Planned replication Large volumes of data Data Warehouse Modeling Issues Warehouse data modeling is more complex than application (OLTP) data modeling. Some of the main differences between data warehousing modeling and application modeling are: Different data types The data warehouse includes many different types of data that need analysis and design work. Source data, target data, and metadata are components of the data warehouse. Modeling techniques may vary between these different types of data. Many ways to use warehouse data Warehouse data may be used in many ways from simple SQL-based query access to multidimensional online analytical processing (OLAP) uses. A warehouse database must be designed to accommodate such planned types of uses. Many ways to organize the data Warehouse data can be structured by using entity relationship modeling (ERM) or dimensional modeling (DM), or a combination of the two. To choose the right structure, you must know both the intended uses and the characteristics of the warehouse data. Oracle 10g: Data Warehousing Fundamentals

69 Data Warehouse Environment Data Structures
The data modeling structures that are commonly found in a data warehouse environment are: Third normal form (3NF) Star schema Snowflake schema Data Warehouse Environment Data Structures Warehouse environment table structures can take on a number of forms. The data modeling structures that are commonly encountered in a data warehouse environment are: Third normal form (3NF) Star schema Snowflake schema Note: Today, most of the very large data warehouse schemas are neither star schema nor 3NF schemas, but instead share characteristics of both schemas; these are referred to as hybrid schema models. Normalized structures store the greatest amount of data in the least amount of space. Entity relationship modeling (ERM) also seeks to eliminate data redundancy. This is immensely beneficial to transaction processing, OLTP systems. Dimensional modeling (DM) is a design that presents the data in an intuitive manner and allows for high-performance access. For these two reasons, dimensional modeling, such as star and snowflake schemas, has become the standard design for data marts and data warehouses. Oracle 10g: Data Warehousing Fundamentals

70 Oracle 10g: Data Warehousing Fundamentals 1 - 91
Star Schema Model Product Table Product_id Product_disc,... Store Table Store_id District_id,... Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Central fact table Denormalized dimensions Time Table Day_id Month_id Year_id,... Item Table Item_id Item_desc,... Star Schema Model A star schema model can be depicted as a simple star: a central table contains fact data, and multiple tables radiate out from it, connected by database primary and foreign keys. Unlike other database structures, a star schema has denormalized dimensions. A star model: Is easy to understand by the users because the structure is so simple and straightforward Provides fast response to queries with optimization and reductions in the physical number of joins required between fact and dimension tables Contains simple metadata Is supported by many front-end tools Is slow to build because of the level of denormalization The star schema is emerging as the predominant model for data warehouses or data marts. Oracle 10g: Data Warehousing Fundamentals

71 Snowflake Schema Model
Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Week_id Period_id Year_id Item Table Item_id Item_desc Dept_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Snowflake Schema Model According to Ralph Kimball, “A dimension is said to be snowflaked when the low cardinality fields in the dimension have been moved to separate tables and linked back into the original table with artificial keys.” A snowflake model is closer to an entity relationship diagram than the classic star model because the dimension data is more normalized. Developing a snowflake model means building class hierarchies out of each dimension (normalizing the data). One of the major reasons why the star schema model has become more predominant than the snowflake model is its query performance advantage. In a warehouse environment, the snowflake’s quicker load performance is much less important than its slower query performance. Oracle 10g: Data Warehousing Fundamentals

72 Snowflake Schema Model
Can be used directly by some tools Is more flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance Has more complex metadata Country State County City Snowflake Schema Model (continued) A snowflake model: Results in severe performance degradation because of its greater number of table joins Provides a structure that is easier to change as requirements change Is quicker at loading data into its smaller normalized tables, compared to loading into a star schema’s larger denormalized tables Allows using history tables for changing data, rather than level fields (indicators) Has a complex metadata structure that is harder for end-user tools to support Besides the star and snowflake schemas, there are other models that can be considered. Constellation A constellation model (also called galaxy model) simply comprises a series of star models. Constellations are a useful design feature if you have a primary fact table and summary tables of a different dimensionality. It can simplify design by allowing you to share dimensions among many fact tables. Third Normal Form Warehouse Some data warehouses consist of a set of relational tables that have been normalized to third normal form (3NF). Their data can be directly accessed by using SQL code. They may have more efficient data storage at the price of slower query performance due to extensive table joins. Some large companies build a 3NF central data warehouse feeding dependent star data marts for specific lines of business. Oracle 10g: Data Warehousing Fundamentals

73 Data Warehouse Design Phases
Defining the business models Phase 2: Defining the logical model Phase 3: Defining the dimensional model Phase 4: Defining the physical model Data Warehouse Design Phases Several methods for designing a data warehouse have been published over the past years. Although these methods define certain terms differently, all include the same general tasks. These tasks have been grouped into four phases: Defining the business model: A strategic analysis is performed to identify business processes for implementation in the warehouse. Then, a business requirements analysis is performed, where the business measures and business dimensions for each business process are identified and documented. Defining the logical model: In the logical design, you look at the business processes and identify the logical relationships among the objects. The logical design is more conceptual and abstract than the physical design. Various methods of data modeling exist, each using a variety of diagrammatic conventions and tools. The most popular approach is called the entity-relationship (ER) approach developed by Peter Chen in the late 1970s. Defining the dimensional model: The business model is transformed into a dimensional model. Warehouse schema tables and table elements are defined, relationships between schema tables are established, and sources for warehouse data elements are recorded. Oracle 10g: Data Warehousing Fundamentals

74 Phase 1: Defining the Business Model
Performing strategic analysis Creating the business model Documenting metadata Phase 1: Defining the Business Model The first phase, business modeling, includes at least three tasks, each with associated deliverables. These tasks include strategic analysis, business model creation, and metadata document creation. Strategic Analysis: The primary business process (or processes) is selected for implementation in the warehouse. Business Model Creation: The business (conceptual) model is developed by uncovering detailed business requirements for a specific business process and verifying the existence of source data needed to support the business analysis requirements. Metadata Creation: The metadata is created in this first phase of the design process. The results of business model are summarized in the metadata tool, and this information serves as the essential resource for subsequent phases in the design process. Oracle 10g: Data Warehousing Fundamentals

75 Performing Strategic Analysis
Identify crucial business processes. Understand business processes. Prioritize and select the business processes to implement. High Business benefit Low Performing Strategic Analysis Performed at the enterprise level, strategic analysis identifies, prioritizes, and selects the major business processes (also called business events or subject areas) that are most important to the overall corporate strategy. Strategic analysis includes the following steps: Identify the business processes that are most important to the overall corporate strategy. Examples of business processes are orders, invoices, shipments, inventory, sales, account administration, and the general ledger. Understand the business processes by drilling down on the dimensions that characterize each business process. The creation of a business process matrix can aid in this effort. Prioritize and select the business process to implement in the warehouse, based on which one will provide the quickest and largest return on investment (ROI). Low Feasibility High Oracle 10g: Data Warehousing Fundamentals

76 Creating the Business Model
Defining business requirements Determining granularity Documenting metadata Creating the Business Model Now that the strategic business process or processes have been identified for implementation in the warehouse, a business model is created. Defining Business Requirements: The business model is created by defining business analysis requirements for each selected process. Here, you meet with business managers and business analysts who are directly responsible for the specific business processes in order to: Define and document examples of their business measures Create a detailed listing of the analytic parameters for each measure Identify the granularity required to satisfy the analysis requirements Clarify business definitions and document business rules Verifying Data Sources: Concurrently, you must perform an information technology (IT) data audit, a systematic exploration of the underlying source systems to verify that the data required to support the business requirements is available. Oracle 10g: Data Warehousing Fundamentals

77 Business Requirements Drive the Design Process
Primary input Secondary input Interviews to collect business requirements Business Requirements Drive the Design Process The entire scope of the data warehouse initiative must be driven by business requirements. Business requirements determine: What data must be available in the warehouse How data is to be organized How often data is updated End-user application templates Maintenance and growth Primary Input The primary input for business requirements are interviews with the business users and analysts who are responsible for driving, measuring, and analyzing the business process. Prioritize the business requirements to narrow the focus. A primary goal of strategic interviews is to identify the organization’s crucial business processes. Strategic analysis interviews reveal the core business processes that are potential candidates for the warehouse. Information requirements as defined by the business people—the end users—will lay the foundation for the data warehouse design and content. Existing metadata Production ERD model Research Oracle 10g: Data Warehousing Fundamentals

78 Using a Business Process Matrix
Promotions Channels Products Times (Date) Inventory Customers Returns Sales Business Processes Business Dimensions Using a Business Process Matrix A useful tool to understand and quantify business processes is the business process matrix (also called the process/dimension matrix). This matrix establishes a blueprint for the data warehouse database design to ensure that the design is extensible over time. The business process matrix aids in the strategic analysis task in two ways: Helps identify high-level analytical information that is required to satisfy the analytical needs for each business process, and serves as a method of cross-checking whether you have all of the required business dimensions for each business process. Helps identify common business dimensions shared by different business processes. Business dimensions that are shared by more than one business process should be modeled with particular rigor, so that the analytical requirements of all processes that depend on them are supported. This is true even if one or more of the potential business processes are not selected for the first increment of the warehouse. Model the shared business dimensions to support all processes, so that later increments of the warehouse will not require a redesign of these crucial dimensions. A sample business process matrix is developed and shown in the slide, with business processes across the top and dimensions down the column on the very left side. Sample of business process matrix Oracle 10g: Data Warehousing Fundamentals

79 Identifying Business Measures and Dimensions
The attribute varies continuously: Sales Quantity sold Units sold Cost Measures The attribute is perceived as constant or discrete: Products Promotions Customers Countries Channels Times Dimensions Identifying Business Measures and Dimensions Measures Business measures are the success metrics of a business process, and are the core data elements that must be tracked in the warehouse. A measure (or fact) contains a numeric value—typical examples are gross sales, total cost, profit, margin, or quantity sold. A measure can be additive or partially additive across dimensions. Dimensions Business dimensions are the analytic parameters that categorize business processes for analysis purposes. That is, a dimension is an attribute by which measures can be characterized or analyzed. Business dimensions provide the metadata definitions for the data warehouse. Some examples of dimensions are customers, countries, products, channels, times, and so on. Several distinct dimensions, combined with facts, enable you to answer business questions. For example, a business manager would like to analyze the sales by customer, by channel, by year, and so on. An example is shown in the slide for a customer sales process. Ultimately, the business requirements document should contain a list of the business measures and a detailed list of all dimensions, down to the lowest level of detail for each dimension. Oracle 10g: Data Warehousing Fundamentals

80 Determining Granularity
TIMES PRODUCTS YEAR? Product category? QUARTER? Product subcategory? MONTH? Product name? WEEK? Product desc? DAY? Product item? Determining Granularity When gathering more specific information about dimensions, it is also important to understand the level of detail that is required for analysis and business decisions. Granularity is defined as the level of summarization (or detail) that will be maintained by your warehouse. The greater the level of detail, the finer is the level of granularity. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies, which are extremely helpful in aggregating the data at various levels. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level, and to the year level. During your interviews, you should discern the level of detail that users need for near-term future analysis. After that is determined, identify whether there is a lower level of grain available in the source data. If so, you should design for at least one grain finer, and perhaps even to the lowest level of grain. Remember that you can always aggregate upward, but you cannot decompose the aggregate lower than the data that is stored in the warehouse. Oracle 10g: Data Warehousing Fundamentals

81 Identifying Business Definitions and Rules
Customer Credit Meaning Rating A+ 0 bad checks or bank credit failures A 1 bad check or bank credit failures B 2 bad checks or bank credit failures C 3 or more bad checks or bank credit failures Order A customer with a credit rating of A or above Rule 1 will receive a 10% discount on any order totaling $500 (U.S.) or more. A customer with a credit rating of A or above Rule 2 will receive a 5% discount on any order totaling $250 (U.S.) but less than $500. Identifying Business Definitions and Rules Business model elements should also be documented with agreed-upon business rules and definitions. One example is depicted in the slide. When you are conducting interviews, pay attention to vocabulary. Often, the same term has multiple meanings, or different terms are used to describe the same thing. Business vocabulary standardization issues become especially apparent as you conduct interviews horizontally across the organization. You should: Record business definitions during your interviews, making notes of any inconsistencies in vocabulary. These inconsistencies must be resolved in follow-up sessions with cross-department decision makers. Require exact definitions for business dimension terminology because this has a direct impact on the grain and structure of the eventual dimensional model Document time retention requirements for online, near-line (disk), and archive (tape) storage for measures Listen for and document implicit business rules during your interviews. The business user interview is one of the best forums for gathering these rules, which serve as direct inputs for the development of ETL processes. Rule 5 A customer with a credit rating of C will not receive any discounts on purchases. Oracle 10g: Data Warehousing Fundamentals

82 Oracle 10g: Data Warehousing Fundamentals 1 - 106
Documenting Metadata Documenting metadata should include: Documenting the design process Documenting the development process Providing a record of changes Recording enhancements over time Documenting Metadata In general, metadata is described as “data about data.” Warehouse metadata is descriptive data about warehouse data and the processes that are used in creating the warehouse. It contains information used to map the data between the source systems and the warehouse, and additionally, contains transformation rules. The metadata repository (or document) should be created in the business modeling phase and used to record the first layer of business metadata. These business modeling results are summarized within the metadata and serve as the essential resource for subsequent phases in the design process. The metadata repository eventually contains detailed descriptions about the sources, content, structure, and physical attributes of the data warehouse. It is important to identify the business users who are the stewards or caretakers of the metadata. This keeps the business involved in the process while providing a clear, coherent understanding of metadata usage and definitions. Oracle 10g: Data Warehousing Fundamentals

83 Business Metadata Elements
Name of the measure Business dimensions Dimension attributes Sample data Business definition and rules Business Metadata Elements The first layer of business metadata is recorded during the business modeling phase. Each business process implemented in the warehouse should be documented with the following entries: Measure: The name of the measure (such as dollar sales, units sold, and so on) Business dimension (analytic parameter): The name of the high-level business dimension (such as product, time, customer, and so on) Dimension (parameter) attribute: The name of the business dimension attribute Sample data: An example of the source data for the dimension attribute or measure Business definition and rules: For business dimension attributes, a definition of the attribute in business terms, along with any business rules directly associated with that parameter (later, business rules are recorded in the ETL metadata for data transformation purposes) For measures, a definition of the measure and time retention requirements for online, near-line (disk), and archive (tape) storage Source system verification: Verify the source system file or database for each measure and dimension attribute. If the source system is a database, attempt to document the database table and column for each attribute. Source data expert: The name and contact information of the data source expert Oracle 10g: Data Warehousing Fundamentals

84 Metadata Documentation Approaches
Automated Data modeling tools ETL tools End-user tools Manual Metadata Documentation Approaches Regardless of the tools that you use to create a data warehouse, metadata must play a central role in the design, development, and ongoing evolution of the warehouse. Automated: There are three types of tools that automatically create and store metadata: Data modeling tools record metadata information as you perform modeling activities with the tool. ETL tools can also generate metadata. These tools also use the metadata repository as a resource to generate, build, and load scripts for the warehouse. End-user tools generally require the administrator to create a metadata layer that describes the structure and content of the data warehouse for that specific tool. Each of the tools used in your warehouse environment might generate its own set of metadata. The management and integration of different metadata repositories is one of the biggest challenges for the warehouse administrator. Manual: You can also create and manage your own metadata repository, using a tool that does not dynamically interface with the warehouse, such as a spreadsheet, word processor document, or custom database. The manual approach provides flexibility; however, it is severely hampered by the labor-intensive nature of managing a manual approach with the ongoing maintenance of metadata content. Oracle 10g: Data Warehousing Fundamentals

85 Phase 2: Designing the Logical Model
Entity Relationship Modeling (ERM) uses entity relationship diagram (ERD): Each CUSTOMER belongs to one COUNTRY. Each COUNTRY can have many CUSTOMERS. Countries Country_id Name Region ISO_Code … Customers Cust_name Country_Id Cust_Addr Belongs to have Designing the Logical Model One of the popular techniques you can use to model your organization’s logical information requirements is entity-relationship modeling (ERM). An entity relationship diagram (ERD) is a visual representation of the information requirements of an ERM system. This diagram depicts the entities, their attributes, and the relationships between the entities. The three components of an ERD are defined as follows: Entity is a thing of significance about which information needs to be known or held. Relationship is a significant way in which two things are associated. Attribute is a piece of information that serves to qualify, identify, classify, quantify, or express the state of an entity. ERD communicates the data requirements, describes data, shows links between data, and serves as a basis for database design. Relationships between entities: A relationship signifies the association between two entities. Each end of the relationship represents how the entities are related. Name of the relationship: Each end of the relationship has a name that enables you to understand the relationship easily—for example, a CUSTOMER belongs to a COUNTRY. The degree of a relationship: This indicates how many entity instances may exist at one end of the relationship for each entity instance at the other end. A crow’s foot signifies a relationship end of degree many, whereas a single point signifies a relationship end of degree one. For example, you can read the relationship between Countries and Customers as a “one to many” relationship. Entity Attributes Relationship Oracle 10g: Data Warehousing Fundamentals

86 Phase 3: Defining the Dimensional Model
Identify fact tables: Translate business measures into fact tables. Analyze source system information for additional measures. Identify dimension tables. Link fact tables to the dimension tables. Model the time dimension. Phase 3: Creating the Dimensional Model The database design process begins with the enterprise view of the business and the specific subject areas that are to be implemented. The work you do on the business and logical models sets the stage for the next phase in the design process, developing the dimensional model. Although entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications, the technique is still useful for data warehouse design in the form of dimensional modeling. In dimensional modeling, instead of seeking to discover atomic units of information (such as entities and attributes) and all of the relationships between them, you identify which information belongs to a central fact table and which information belongs to its associated dimension tables. The potential hierarchies of the dimensions are also identified. The outcome of the dimensional model is usually the star or snowflake model. The dimensional design is the foundation of the database design for the data warehouse. Note: In this course, you closely observe the star dimensional model taking the example of the SH schema. Oracle 10g: Data Warehousing Fundamentals

87 Star Dimensional Modeling
Star dimensional modeling is a logical design technique that seeks to present the data in a standard framework that is intuitive and provides high performance. Every dimensional model is composed of one table called the fact table, and a set of smaller tables called dimension tables. This characteristic (denormalized, star-like structure) is commonly known as a star model. Within this star model, redundant data is posted from one object to another for performance considerations. A fact table has a multipart primary key composed of two or more foreign keys and expresses a many-to-many relationship. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table. The slide depicts the star dimensional model of the SH schema, where Sales is the facts table, joined with the dimensions such as CHANNELS, COUNTRIES, TIMES, PROMOTIONS, PRODUCTS, and so on. Oracle 10g: Data Warehousing Fundamentals

88 Advantages of Using a Star Dimensional Model
Supports multidimensional analysis Creates a design that improves performance Enables optimizers to yield better execution plans Parallels end-user perceptions Provides an extensible design Broadens the choices for data access tools Advantages of Using a Star Dimensional Schema Provides rapid analysis across different dimensions for drilling down, rotation, and analytical calculations for the multidimensional cube Creates a database design that improves performance Enables database optimizers to work with a more simple database design to yield better execution plans Parallels how end users usually think of and use the data Provides an extensible design which supports changing business requirements Broadens the choices for data access tools, because some products require a star schema design Note: The definitions of star and snowflake models vary among practitioners. Here, the assumption is that the star model contains a fact table with one level of related dimensions. An example is Sales Fact and Product Dimension. The snowflake, on the other hand, has more than one level of dimension—that is, a hierarchy (for example, Sales Fact, Product Dimension, and Product Group). Instructor Note The slides list the high-level advantages of using a star dimensional model. More specific advantages and the characteristics of the star model are listed later in this lesson. Oracle 10g: Data Warehousing Fundamentals

89 Fact Table Characteristics
Fact tables: Contain numerical metrics of the business Hold large volumes of data Grow quickly Can contain base, derived, and summarized data Are typically additive Are joined to dimension tables through foreign keys that reference primary keys in the dimension tables What are factless fact tables? Sales (Fact Table) PROD_ID CUST_ID TIME_ID CHANNEL_ID PROMO_ID QUANTITY_SOLD AMOUNT_SOLD ... Fact Table Characteristics Facts are the numerical measures of the business. The fact table is the largest table in the star schema and is composed of large volumes of data, usually making up 90% or more of the total database size. It can be viewed in two parts: Multipart primary key Business metrics Numeric Additive (usually) Often, a measure may be required in the warehouse, but it may not appear to be additive. These are known as semiadditive facts. Inventory and room temperature are two such numerical measurements. It does not make sense to add these numerical measurements over time, but they can be aggregated by using a SQL function other than sum (for example, average). Although a star schema typically contains one fact table, other schemas can contain multiple fact tables. Oracle 10g: Data Warehousing Fundamentals

90 More on Factless Fact Tables
Employee dimension Emp_PK Grade dimension Grade_PK Emp_FK Sal_FK Age_FK Ed_FK Grade_FK Salary dimension Sal_PK Education dimension Ed_PK Age dimension Age_PK More on Factless Fact Tables The factless fact table represents the many-to-many relationships between the dimensions so that the characteristics of the event can be analyzed. Materialized views are generally built on factless fact tables to create summaries. Materialized views are discussed in the lesson titled “The ETL Process: Loading Warehouse Data.” Examples Human resources: Studies of the labor force composition are often conducted for reporting and planning purposes. Analysis of employees with different characteristics can be conducted using the illustrated star. Most of the resulting information from this kind of table is a series of counts. In the example illustrated, selecting COUNT(EMP_FK) gives the number of employees, whereas selecting COUNT(SAL_FK) gives the number of employees on a specified salary grade. Retail store: Promotions are typical within the retail environment. An upscale retail chain wants to compare its customers who do not respond to direct mail promotion to those who make a purchase. A factless fact table supports the relationship between the customer, product, promotion, and time dimensions. Student attendance: Factless fact tables can be used to record student class attendance in a college or school system. There is no fact associated with this; it is a matter of whether the students attended. Note: FK = foreign key and PK = primary key Oracle 10g: Data Warehousing Fundamentals

91 Identifying Base and Derived Measures
Business measures Facts (Base, Derived) Business Measures Quantity sold Amount sold Profit Sales Fact Table Quantity sold Base Amount sold Base Profit Derived Identifying Base and Derived Measures The fields in your fact tables are not just source data columns. There may be many more columns that the business wants to analyze, such as year-to-date sales or the percentage difference in sales from a year ago, or profits. You must keep track of the derived facts as well as the base facts. Derived facts are the data that is calculated or created from two or more sources of data. A derived value can be more efficiently stored for access, rather than calculating the value at execution time. For example, Profits is a derived measure, whereas Sales is a base measure. Similarly, Salary, and Commission are the base measures, whereas monthly compensation of the employee is a derived measure. In OLTP systems, the derived measures are not stored, but derived data in the warehouse is very important because of its inherent support for queries. When you store derived data in the database, values are available immediately for analysis through queries. The slide shows the translation of some of the business measures in a business model into a Sales fact table. In addition, each measure in the fact table is identified as either a base or derived measure—for example, Profit = Sales_amount – Cost. Derived values can be created during the extraction or transformation processes. Oracle 10g: Data Warehousing Fundamentals

92 Fact Table Measures Fact table measures can be:
Additive: Added across all dimensions Semiadditive: Added along some dimensions Nonadditive: Cannot be added along any dimension Fact Table Measures Fact table measures can be: Additive: Additive measures can be added across all of the dimensions to provide an answer to your query. Additive facts are fundamental components of most queries. Typically, when you query the database, you are requesting that data be summed to a particular level to meet the constraints of your query. Additive facts are numeric and can be logically used in calculations across all dimensions. Some examples of additive measures are units sold, customer account balance, and shipping charge. Note: You should choose the facts in a fact table to be numeric and additive (more useful). Semiadditive: Semiadditive measures can be added along some but not all of the dimensions such as a bank account balance. The bank records balances at the end of each banking day for customers, by account, over time. This allows the bank to study deposits, as well as individual customers. In some cases, the account balance measure is additive. If a customer holds both checking and savings accounts, you can add together the balances of each account at the end of the day and get a meaningful combined balance. You can also add balances together across accounts at the same branch for a picture of the total deposits at each location; however, you cannot add together account balances for a single customer over multiple days. Nonadditive: Nonadditive measures cannot logically be added between records. Nonadditive data can be numeric and is usually combined in a computation with other facts (to make it additive) before being added across records. Margin percent is an example of a nonadditive measure. Nonadditive measures are also factless fact tables. Oracle 10g: Data Warehousing Fundamentals

93 Dimension Table Characteristics
Dimension tables: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference Dimension Table Characteristics Dimensions are the textual descriptions of the business attributes. Dimension tables are typically smaller than fact tables and the data changes less frequently. Dimension tables give perspective regarding the whys and hows of the business and element transactions. Although dimensions generally contain relatively static data, customer dimensions are updated more frequently. Dimensions Are Essential for Analysis The key to a powerful dimensional model lies in the richness of the dimension attributes because they determine how facts can be analyzed. Dimensions can be considered as the entry point into “fact space.” Always name attributes in the users’ vocabulary. That way, the dimension will document itself and its expressive power will be apparent. Oracle 10g: Data Warehousing Fundamentals

94 Translating Business Dimensions into Dimension Tables
Products Products Dimension Table Prod_Id Channels Prod_name Prod_desc Prod_Category Customers Prod_Subcategory Product_status Prod_List_price Countries Prod_Category_Desc Prod_Status Prod_Weight _class Translating Business Dimensions into Dimension Tables Translating a list of business dimension attributes into a dimension table is not a simple one-to-one mapping process. You must also understand the source data structure to identify all of the source data elements that need to be included in the data warehouse in order to support users’ analytical requirements. For example, the slide shows the translation of a Products business dimension into a dimension table: A list of attributes for the Products business dimension are uncovered during the business-modeling phase. All of the product-related source data information is gathered and used to help translate the business dimension requirements into dimension table attributes. In the example, “ID” fields (codes that uniquely identify product attributes) and description fields are available in the source data, and should be included in the warehouse so that users familiar with the production system can cross-reference the source system ID fields with warehouse data elements. Promotions Prod_category_id Prod_Pack_Size … Oracle 10g: Data Warehousing Fundamentals

95 Slowly Changing Dimensions
Dimension data does not change as dynamically as fact data. Over a given number of years, the dimension data may change. For example, the product size and product package may change in the Products dimension over a period of time. Similarly, Country, State, and City names may change in the Countries dimension. These dimensions are referred to as slowly changing dimensions (SCDs). The data warehouse should be able to store both the current and historical data very effectively for SCDs. There are three ways to manage slowly changing dimensions: Overwrite the existing dimensional attribute with the change. This does not affect the keys nor does it insert records. Add a new record each time the dimension data changes. This preserves the history of the old record and accurately partitions history across time; however, a significant increase to the database’s size is incurred. Examine the grain of the data very closely when first designing the data warehouse; otherwise, partitioning of the data will not occur properly. Caution should be used when implementing this change. If an ID is used as the primary key, a potential integrity violation could occur. Preserve the current record information, but include some critical fields when initially designing the data warehouse. These fields retain previous and current information, and should include a time attribute to signify when the change occurred thus allowing a history of the change to be preserved. It should be evident that this method increases complexity and the component size of the table. Note: Oracle Warehouse Builder supports all these three options to manage SCDs. Oracle 10g: Data Warehousing Fundamentals

96 Slowly Changing Dimension (SCD): An Example
Products Dimension Table Business Dimension Product Prod_Id (Natural) Id (Surrogate) Prod_name Prod_desc Prod_Category Products: SCD Product weight, Product package size varying over time Prod_Subcategory Product_status Prod_List_price Prod_Category_Desc Prod_Status Prod_Weight_class Prod_category_id Prod_Pack_Size … Slowly Changing Dimensions: An Example A typical example of a slowly changing dimension (Products) is shown in the slide. The example uses the third option as described on the previous page. Note that new attributes Prod_Eff_From and Prod_Eff_To are added. A new surrogate key ID is also added so that the changes to the Products dimension can be tracked by adding a new record. The Prod_Id may be the same for the two records, but the Id value differs. Original record: Product_PK Product_Id Product_Name MiniDVCamcorder with 3.5" Swivel LCD Prod_pack_size Prod_Eff_From Prod_Eff_To P Jan-1998 Changed record: MiniDVCamcorder with 3.5" Swivel LCD Prod_pack_size Prod_Eff_From Prod_Eff_To P Jan Jan-2005 Note: A surrogate key is a system-generated key in a warehouse environment. In some instances, the methods used to maintain historical data could potentially allow duplicate keys. To ensure uniqueness, a surrogate key is generated by the applications that maintain data (for example, OWB). More discussion about surrogate keys is included on the following page. Prod_Eff_From Prod_Eff_To Where Product_key is a calculated number stored within the database Oracle 10g: Data Warehousing Fundamentals

97 Oracle 10g: Data Warehousing Fundamentals 1 - 124
Types of Database Keys Primary keys (PKs) Foreign keys (FKs) Composite keys Surrogate keys Types of Database Keys Primary key: A logical or natural primary key is the column or columns that uniquely identify a row in a table. A primary key constraint is used to enforce uniqueness for all rows in that table. It is also critical to performance, granting quick access compared to the unacceptable amount of time that would be required if the RDBMS had to scan the entire table every time a row was queried. The choice of primary key is an important design consideration because it forces data integrity and eliminates data duplication within the table. Foreign key: This key column in a table references a primary key for another table, establishing relationships between tables. Composite key: The composite key consists of a number of columns. In the case of a concatenated primary key, combinations of values between the columns must be unique. These keys are sometimes referred to as concatenated or segmented keys. All keys identified here are important for the efficiency of all systems, whether the systems are operational or warehouse. Composite keys are commonly used in the warehouse. Surrogate key (Warehouse, Synthetic, or Generalized keys): A surrogate key is a system-generated key. The key itself has no meaning, and therefore, you cannot ascertain anything about the underlying dimension with which it is associated. Generally speaking, a four-byte integer is sufficient (containing 232 values or more than two billion positive integers) when assigning attributes for the key. An example of a system-generated key is the ROWID value that is generated by the Oracle server for every row inserted into the database. Oracle 10g: Data Warehousing Fundamentals

98 Using Time in the Data Warehouse
Defining standards for time is critical. Aggregation based on time is complex. Using Time in the Data Warehouse Though it may seem obvious, real-life aggregations based on time can be quite complex. Which weeks roll up to which quarters? Is the first quarter the calendar months of January, February, and March, or the first 13 weeks of the year that begin on Monday? Some causes for nonstandardization are: Some countries start the workweek on Mondays, others on Sunday. Weeks do not cleanly roll up to years because a calendar year is one day longer than 52 weeks (one day longer in leap years). There are differences between calendar and fiscal periods. Consider a warehouse that includes data from multiple organizations, each with its own calendars. Holidays are not the same for all organizations and all locations. You should consider all the above points while designing the Time dimension. Oracle 10g: Data Warehousing Fundamentals

99 Time Dimension Time dimension is critical to the data warehouse.
Choose the right granularity for the Time dimension. Fiscal year Sales fact Time dimension Fiscal quarter Where should the element of time be stored? Fiscal month Fiscal week Current dimension grain Time Dimension The dimension of time is most critical to the data warehouse. A consistent representation of time is required for extensibility. Because online transaction data, typically the source data for the warehouse, may not have a time element, you apply an element of time in the extraction, transformation, and transportation process. For example, you might assign a week identifier to all the airline tickets that sold within that week. Defining Time Granularity The grain you choose for the time dimension can have a significant impact on the size of your database. It is important to know how your data has been defined with regard to time to accurately determine the grain. This is particularly true with external data that is commonly aggregated data. Even when you think there is no gap in the granularity of your systems, you may find that basic definitions between systems differ. The primary consideration here is to keep a low enough level of data in your warehouse to be able to aggregate and match values with other data that has a predetermined aggregate. Day Future dimension grain Oracle 10g: Data Warehousing Fundamentals

100 Identify Hierarchies for Dimensions
Multiple time hierarchies Geography hierarchy Fiscal time Calendar time Fiscal year Calendar year Region Fiscal quarter Calendar quarter Country Fiscal month State/Province Calendar month Identify Hierarchies for Dimensions Dimension tables contain hierarchical data. There can be a single hierarchy or multiple hierarchies for a dimension. The example in the slide shows the Geography hierarchy associated with the Customers dimension, and the Time dimension described by multiple hierarchies to support both calendar and fiscal year. Representation of time is critical in the data warehouse. You may decide to store multiple hierarchies in the data warehouse to satisfy the varied definitions of units of time. If you are using external data, you may find that you create a hierarchy or translation table simply to be able to integrate the data. A simple time hierarchy corresponds to a calendar approach: days, months, quarters, and years. You can also have a hierarchy based on fiscal approach: days, months, quarters, and years. A hierarchy based on weeks seems fairly simple as well: weeks, four-week period. What is the definition of a week? Does the week start on Sunday or Monday? Internally, you may define it one way; however, when you try to integrate external data that is defined in a different way, you may get unexpected or misleading results. Are there not 13 weeks in a quarter? Why can I not map 13-week periods to a quarter? Typically, the start and stop date of a quarter corresponds to a calendar date—the first of the month, the last day of a month. Thirteen-week periods may start at any time but are usually consistent with the start day of the week definition. City Fiscal date Calendar date Oracle 10g: Data Warehousing Fundamentals

101 Oracle 10g: Data Warehousing Fundamentals 1 - 130
Data Drilling Market Hierarchy Group Region 1 Region 2 Country1 Country2 Country3 Country 4 State1 State 2 State 3 State 4 State 5 State 6 Data Drilling Hierarchies in dimensions enable you to perform data drilling. Drilling refers to the investigation of data to greater or lesser detail from a starting point. Typically, in an analytical environment, you start with less detail, at a higher level within a hierarchy, and investigate down through a hierarchy to greater detail. This process is drilling down (to more detailed data). Drilling down means retrieving data at a greater level of granularity. For example, you may want to analyze the sales revenue or profits at the Region level, and further drill down to Country, State, and City levels. Drilling up is the reverse of this process. Consider the Geography hierarchy example. If your starting point were an analysis of data at the City level, drilling up would mean looking at a lesser level of detail, such as country, or the region, or group level. City1 City2 Oracle 10g: Data Warehousing Fundamentals

102 Using Data Modeling Tools
Tools with a GUI enable definition, modeling, and reporting. Avoid a mix of modeling techniques caused by the following: Development pressures Developers with lack of knowledge No strategy Determine a strategy for data modeling. Write and publish data models formally. Make the data models available electronically. Using Data Modeling Tools Your logical design should result in: 1. A set of entities and attributes corresponding to fact tables and dimension tables 2. A model of operational data from your source into subject-oriented information in your target data warehouse schema You can create the logical design by using a pen and paper, or you can use a design tool such as Oracle Warehouse Builder (specifically designed to support modeling the ETL process) or Oracle Designer (a general purpose modeling tool). You can generally model the warehouse database by using tools that provide a GUI for: Entering metadata definitions of facts, dimensions, hierarchies, and relationships Drawing diagrams of star schemas containing the facts and dimensions Documenting business requirements Defining integrity rules and constraints Generating reports about your metadata definitions Oracle 10g: Data Warehousing Fundamentals

103 Oracle 10g: Data Warehousing Fundamentals 1 - 133
Summary In this lesson, you should have learned about: Data warehouse environment data structures Data warehouse database design phases: Defining the business model Defining the logical model Defining the dimensional model Oracle 10g: Data Warehousing Fundamentals

104 Oracle 10g: Data Warehousing Fundamentals 1 - 134
Practice 3-1: Overview This practice covers the following topics: Identifying the facts, measures, hierarchies, and slowly changing dimensions based on the RISD scenario given Exploring viewlet-based demonstrations provided for modeling concepts, and answering the questions in these interactive viewlets Oracle 10g: Data Warehousing Fundamentals

105 Physical Modeling: Sizing, Storage, Performance, and Security Considerations
ILT Schedule: Timing Topic 90 minutes Lecture 15 minutes Practice 105 minutes Total

106 Oracle 10g: Data Warehousing Fundamentals 1 - 138
Objectives After completing this lesson, you should be able to do the following: Describe how to translate the dimensional model to physical model Explain data warehouse sizing techniques and test load sampling Describe data warehouse partitioning methods Describe indexing types and strategies Explain parallelism in data warehouse operations Explain the importance of security in data warehouses Identify the tools and technologies provided by Oracle Lesson Overview In the previous lesson, you learned about the business, logical, and dimensional modeling. In this lesson, you learn about how to translate the dimensional model into a physical model. You learn about the various factors to be considered for implementing a physical model such as the database sizing, storage, performance, and security. You learn about the ways to size your data warehouse, and enhance query performance through partitioning and indexing. One of the main challenges within data warehousing is to recognize that fact and detail tables will grow incredibly large and to manage that growth successfully. Query performance continues to present challenges as these fact tables grow. Partitioning, indexing, and offloading data that is no longer required are essential to sustaining a healthy data warehouse. This lesson focuses on different methods that assist in balancing the warehouse, and is broken down into three primary components: sizing, partitioning, and indexing. Oracle 10g: Data Warehousing Fundamentals

107 Phase 4: Defining the Physical Model
Translate the dimensional design to a physical model for implementation. Update metadata document with physical model information. Determine hardware architecture. Define storage strategy for tables and indexes. Perform database sizing. Define partitioning strategy. Define initial indexing strategy. Define the security strategy. Phase 4: Defining the Physical Model A good physical model is often the difference between a data warehouse success or failure. The design of the physical model builds on the logical model, adding indexes, referential integrity, physical storage, and other characteristics. There are other considerations that you should bear in mind for performance, such as data partitioning, security, and so on. The physical model is translated into data that resides in the database server. Defining the physical model is accomplished by performing the following tasks: Translate the dimensional design to a physical model. Define storage strategy for tables and indexes. Perform database sizing. Define the initial indexing strategy. Define partitioning strategy. Update metadata document. Oracle 10g: Data Warehousing Fundamentals

108 Translating Dimensional Model to Physical Model
# *Product _PK n # *Channel_PK n # *Promotion_PK n Develop object naming conventions. Apply the naming standards to the tables and attributes of the dimensional model. * PRODUCT_ID v(11) * PROD_DESC v(125) * PROD_NAME v(35) * PROD_CATEGORY_ID v(20) * PROD_CATEGORY_DESC v(50) * PROD_SUBCATEGORY v(25) * SUPPLIER_ID v(20) * PROD_STATUS v(10) * PROD_LIST_PRICE n * PROD_MIN_PRICE v(20) * PROD_PACK_SIZE v(20) * PROD_WEIGHT_CLASS v(10) * PROMOTION_CODE v(10) * WHSE_LOCATION v(10) * PROD_EFF_FROM date * PROD_EFF_TO date Translating Dimensional Model to Physical Model The starting point for the physical data model is the dimensional model. Although similar, the major difference between these two models is that the physical model is a thorough and detailed specification of physical database characteristics such as the data types, length, database sizing, indexing strategy, and partitioning strategy. Develop Object Naming Conventions: It is very important to have database object naming standards and to follow them. In general, it is recommended that logical and physical names be identical and as descriptive as possible. The following are some of the conventions: Capitalize table and attributes names. Keep table name in plural form and attribute names in singular form. Use underscores rather than spaces to delineate separate words in an object’s name. Use a suffix of _PK to indicate primary keys. For example, use PRODUCT _PK to indicate the primary (surrogate) key for the PRODUCT dimension. Use a suffix of _ID to indicate production keys. For example, the PRODUCT dimension has PRODUCT_ID as an attribute. This is probably the key used in the production systems. Find a good balance between being too specific (such as COMPANY_CUSTOMER_ID_LIST)and far too vague (such as C_ID). Note: This is an overview; not all conventions are mentioned. Oracle 10g: Data Warehousing Fundamentals

109 Architectural Requirements
Scalability Manageability Availability Extensibility Flexibility Integration User Business Architecture Requirements The data warehouse tenets that are described in the slide are perceived to be the primary tenets in a data warehouse environment—that is, the architecture must be scalable, manageable, available, extensible, flexible, and integrated. This list can be extended to include tunable, reliable, robust, supportable, and recoverable. Making Compromises Compromises may affect the task of balancing user needs and business requirements if budgetary constraints restrain your choices or if technical difficulties are too challenging. The architecture requirements definition must be considered at an early stage, in parallel with the user requirements. Successful choices can be made only at this time. Strategy for Architecture Definition You must have a definitive strategy that employs identified and proven technology. Consider some of the tasks you need to perform in the early stages when planning the hardware architecture and surrounding environment. Obtain existing plans and outlines of the current technical architecture for the environments that will supply the warehouse. Obtain existing capacity plans for the current environments. Budget Technology Oracle 10g: Data Warehousing Fundamentals

110 Hardware Requirements
Symmetric multiprocessing (SMP) Nonuniform memory access (NUMA) Clusters Massively Parallel Processing (MPP) MPP nCUBE Hardware Requirements Today, hardware architectures support a number of different configurations that are useful for data warehousing and are more cost-effective than hardware architectures previously available. Symmetric multiprocessing (SMP): Symmetric multiprocessing architectures are the oldest of the technologies and have a proven track record. A symmetric multiprocessing (SMP) machine comprises a set of CPUs that share memory. Each CPU has full access to the shared memory through a common bus, and communication between the CPUs uses the shared memory. Benefits: High concurrency Workload balancing Moderate scalability (not as scalable as MPP or NUMA) Easier to administer than a cluster environment, with proven tools Limitations: Available memory may be limited (this can be enhanced by clustering) Bandwidth for CPU-to-CPU communication and I/O and bus communication Oracle 10g: Data Warehousing Fundamentals

111 Making the Right Choice
Requirements differ from operational systems. Benchmarks: Available from vendors Develop your own Use realistic queries Scalability is important. Making the Right Choice How do you know which architecture to choose? Operational environments do not map directly to the way the warehouse operates, with its unpredictable workloads and scalability requirements. The only realistic way to determine the interaction between your data warehouse database and the hardware configuration is to perform full-scale testing. You may not be able to achieve this. When benchmarking, use real user queries against volumes of data that mimic the volumes anticipated in the warehouse. If you are unhappy with vendor benchmarks, consider developing your own. This is going to add to the cost of development. However, costs are high for a warehouse implementation and you may find the amount spent on your own benchmark worthwhile in the long term. Oracle 10g: Data Warehousing Fundamentals

112 Storage and Performance Considerations
Database sizing Test load sampling Data partitioning Partitioning methods Benefits of partitioning Indexing B-Tree indexes Bitmap indexes Bitmap-join indexes Star query optimization Parallelism Storage and Performance Considerations One of the main challenges within data warehousing is to recognize that fact and detail tables will grow incredibly large and to manage that growth successfully. Query performance continues to present challenges as these fact tables grow. Partitioning, indexing, and offloading data that is no longer required are essential to sustaining a healthy data warehouse. The points listed in the slide are discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

113 Oracle 10g: Data Warehousing Fundamentals 1 - 149
Database Sizing Sizing influences capacity planning and systems environment management. Sizing is required for: The database Other storage areas Sizing is not an exact science. Techniques vary from implementation to implementation. Sizing the Database and Other Storage Requirements A major factor in capacity and space planning is the physical size of the data warehouse database. The considerations for sizing include the amount of physical disk space that is required for the data warehouse database (for example, for the tables, views, and indexes). Determine the amount of physical disk space required for: The architecture of the environment Backup and recovery tasks Mirroring techniques Temporary space and loading techniques Sizing the database is not an exact science. Techniques vary from implementation to implementation with many possible approaches; you should identify one that meets the requirements of your implementation. Oracle 10g: Data Warehousing Fundamentals

114 Estimating the Database Size
Estimate the size of each row in the fact table. Determine the grain of each dimension and estimate the number of entries in the finest level. Multiply the number of rows of all dimensions and multiply the result by the fact table row size. Determine whether the fact table is sparse or dense and estimate the reduction or increase in size. Estimating the Database Size Although the Oracle Database 10g Administrator’s Guide provides important information about sizing a database, additional steps are required to preclude fragmentation, which could ultimately result in performance degradation. The methodology that you use involves the following exigencies: Examine the size of the Oracle data block because this affects database performance and storage efficiency. The data block must account for header space, PCTFREE, and growth (future inserts). Examine the space required for an average row by determining the average column length for each column within the row. Because most Oracle data types are stored in variable length formats, review the test or quality assurance database to use as a benchmark estimate. Use the AVG(VSIZE(column_name))to assist with this evaluation. Repeat this for each table within the star design. Examine the space required for indexes and summary table by using the same method annotated above except estimate the number of rows to be stored in each summary by using the GROUP BY clause. Plan approximately four times this space for rollback segments, temporary tablespace, and system tablespace. A good practice is to provide three to four times the space of actual detail data for indexing, summarization, and other system requirements. Oracle 10g: Data Warehousing Fundamentals

115 Validating Database Size Assumptions
Description Estimation Estimate the size 52 bytes (assumed for this example) of one row of the fact table. Channels 5 channels Estimate the Customers Located in 23 countries entries in the Products 72 items lowest level History 48 months within each dimension. Multiply the no. of (5 x 23 x 72 x 48) x 52 = 20,666,880 bytes entries for each dimension and multiply the result by the fact table row size. Sparsity is low; 20,666,880 * 0.1 = 2,066,688 adjust by 10%. 20,666,880 – 2,066,688 = 18,600,192 bytes Validating Database Size Assumptions After you estimate the size of the database, you can validate your assumptions by doing the following: Extract sample files. Load data into the database. Compute exact expected row lengths. Add overhead for indexing, rollback and temporary tablespaces, aggregates, views, and a file system staging area for flat files. Every time Oracle reads from or writes to the database, it does so using database blocks. The database block size must be a multiple of the underlying operating system block size. Although the default block size for Oracle is set to 2 KB, a minimum of 8 KB is recommended with 16 KB preferred. In a large data warehouse, the more data that can be read at one time, the more performance is enhanced. Note: An example of estimating and validating the database size is shown in the slide. However, the actual database sizes for a data warehouse are much larger than the one shown in the example. Estimated 18.6 MB (approximately) database size Oracle 10g: Data Warehousing Fundamentals

116 Oracle 10g: Data Warehousing Fundamentals 1 - 152
Testing Load Sampling Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: Test loads for different periods Day-to-day operations Seasonal data and worst-case scenarios Indexes and summaries Testing Load Sampling A good approach to sizing is based on the analysis of a representative sample of the data chosen using proven statistical methods. Test loads can be performed on data from a day, week, month, or any other period of time. You must ensure that the sample periods reflect the true day-to-day operations of your company, and that the results include any seasonal issues or other factors, such as worst-case scenarios that may prejudice the results. After you have determined the number of transactions based on the sample, you calculate the size. You must also consider the following factors that can have an impact: Indexing, because the amount of indexing can significantly impact the size of the database Summary tables that can be as large as the primary fact table, depending on the number of dimensions and the number of levels of the hierarchies associated with those dimensions Oracle 10g: Data Warehousing Fundamentals

117 Oracle Database 10g: Architectural Advantages
New and improved technologies: Real Application Clusters and Cache Fusion Self-managing in critical areas Flashback Query Data Guard and Recovery Manager Oracle Database 10g: Architectural Advantages Applications run concurrently across nodes without any modifications, yielding a highly scalable structure for e-business. Real Application Clusters employs Cache Fusion to combine multiple caches virtually across nodes providing high availability of new hardware. Additionally, the database workload dynamically shifts and self-tunes to accommodate the database workload and to satisfy query requests from local or other caches. Oracle Database is capable of managing its own undo segments by allocating the rollback space in a single UNDO tablespace. You can grow or shrink the System Global Area (SGA) dynamically and resize a buffer cache or shared pool. Flashback Query allows for human error correction, whereas Data Guard and Recovery Manager promote enhanced disaster recovery and automation. Oracle 10g: Data Warehousing Fundamentals

118 Why Data Partitioning Is Needed
Data warehouses may: Grow to become very large databases (VLDB) Have very huge fact tables and lots of historical data Table availability: Large tables are more vulnerable to disk failure. It is too costly to have a large table inaccessible for hours due to recovery. Large table manageability: Indexes take too long to be built. Partial deletes take hours, even days. Performance considerations: Large table and large index scans are costly. Scanning a subset improves performance. Why Data Partitioning Is Needed Typically, data warehouses may grow to become very large databases (VLDB), or some data warehouses may have very huge fact tables with lots of historical data in the implementation phase itself. A VLDB is a very large database that contains hundreds of gigabytes or even terabytes of data. VLDBs typically owe their size to a few very large tables and indexes rather than a very large number of objects. The following are some typical situations that make it hard to work with VLDBs: A disk failure renders a big table inaccessible. The table may be striped over many disks. Users may still need to access the subset of rows unaffected by disk failure. Reloading or rebuilding large tables and indexes can greatly exceed any of the company’s down time allowances. In a data warehouse environment, users might query the most recent data more intensely than older data. It would be advantageous to tune the database to meet this pattern of behavior. The above problems can be addressed by using data partitioning. Note: Data partitioning techniques are discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

119 Oracle 10g: Data Warehousing Fundamentals 1 - 155
Data Partitioning Breaking up of data into smaller units that can be handled independently Data partitioning provides ease of: Restructuring Reorganization Removal Recovery Monitoring Management Archiving Indexing Data Partitioning Data partitioning enables you to break tables down into smaller, more manageable units, thus addressing the problems of supporting large tables (which are inherent in data warehouses). Indexes can also be partitioned in similar fashion. Each partition can be managed individually, and can function independently of the other partitions, thus providing a structure that can be better tuned for availability and performance. Partitioning is transparent to existing applications—that is, standard data manipulation language (DML) statements run against partitioned tables as they do with the normal tables, but with a better performance. The data can be partitioned horizontally or vertically. Partitioning helps in the following ways: Improves the speed of access and data management by eliminating the need to visit both vertical and horizontal partitions during query and backup tasks Increases the availability by reducing the time to perform all the warehouse management tasks (such as load) and the ability to take one area of the database offline and keep others active Note: Oracle database supports various horizontal partitioning techniques, and these are briefly discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

120 Benefits of Partitioning
Oracle database offers easy administration of partitions. The optimizer eliminates (prunes) partitions that do not need to be scanned. Partitions can be scanned, updated, inserted, or deleted in parallel. Join operations can be optimized to join “by the partition.” Partitions can be load-balanced across physical devices. Large tables within Real Application Clusters environments can be partitioned. Benefits of Partitioning Easy Administration: Oracle database offers easy administration of partitions and provides a set of SQL commands for creating and managing partitions. For example: ALTER TABLE ADD PARTITION ALTER TABLE DROP PARTITION ALTER TABLE TRUNCATE PARTITION ALTER TABLE MOVE PARTITION ALTER TABLE SPLIT PARTITION ALTER TABLE EXCHANGE PARTITION Commands for partition indexes, and so on Improved Performance: Oracle database offers intelligent optimizers, which are aware of the following points when accessing a partitioned table or index: If WHERE clauses are specified in a SQL statement, the optimizer can evaluate the statement and, based on values, prune partitions that do not need to be accessed. Queries and DML operations are narrowed down to partition-level instead of full table or index scan. Oracle 10g: Data Warehousing Fundamentals

121 Partitioning Methods Oracle database provides the following partitioning methods: Range partitioning Hash partitioning Composite partitioning List partitioning Partitioning Methods Range partitioning: Range partitioning maps data to partitions based on a range of partition key values that you establish for each partition. Range partitions are ordered and this ordering is used to define the lower and upper boundary of a specific partition. It is the most common type of partitioning and is often used with dates. For example, the partitions can be based on the joining dates of the employees (employees joined in the years 2001–2003 and so on) or the transaction dates for the Sales fact table, and so on. It is a convenient method for partitioning historical data. The boundaries of range partitions define the ordering of the partitions in the tables or indexes. Range partitioning is also ideal when you periodically load new data and purge old data. It is easy to add or drop partitions. However, it is not always possible to know beforehand how much data will map into a given range, and in some cases, sizes of partitions may differ quite substantially, resulting in suboptimal performance for certain operations such as parallel DML. Range partitioning, and partitioning in general, is available in Oracle8 and later versions. Hash partitioning: This method uses a hash function on the partitioning columns to stripe data into partitions. It controls the physical placement of data across a fixed number of partitions and gives you a highly tunable method of data placement. Hash partitioning is available in Oracle8i and later versions. Oracle 10g: Data Warehousing Fundamentals

122 Partition Pruning: An Example
01-Jan Sales Partition pruning: Only the relevant partitions are accessed. 01-Feb SQL> SELECT SUM(amount_sold) 2 FROM sales 3 WHERE time_id BETWEEN 4 TO_DATE('01-MAR-2000', 'DD-MON-YYYY') AND 6 TO_DATE('31-MAY-2000', 'DD-MON-YYYY'); 01-Mar 01-Apr 01-May 01-Jun Partition Pruning Partition pruning is an essential performance feature for data warehouses. In partition pruning, the cost-based optimizer analyzes FROM and WHERE clauses in SQL statements to eliminate unneeded partitions when building the partition access list. This enables Oracle server to perform operations only on those partitions that are relevant to the SQL statement. Oracle server prunes partitions when you use a range, LIKE, equality, and IN-list predicates on the range or list partitioning columns, and when you use equality and IN-list predicates on the hash partitioning columns. This allows pruning for conjunctive predicates such as c > 10 and c < 20 but not for disjunctive predicates such as c in (10,30) or (c > 10 and c < :B1) or (c > :B2 and c < 1000). If you partition the index and table on different columns (with a global, partitioned index), then partition pruning also eliminates index partitions even when the partitions of the underlying table cannot be eliminated. On composite partitioned objects, Oracle server can prune at both the range partition level and at the hash or list subpartition level using the relevant predicates. Partition pruning dramatically improves in query performance and resource utilization. However, the optimizer cannot prune partitions if the SQL statement applies a function to the partitioning column. Oracle 10g: Data Warehousing Fundamentals

123 Oracle 10g: Data Warehousing Fundamentals 1 - 162
Indexing Indexing is used for the following reasons: It saves cost and greatly improves performance and scalability. It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. Indexing Data By intelligently indexing data in your data warehouse, you can increase both the performance and scalability of your warehouse solution. Using indexes, you can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. The types of indexes that are supported by Oracle Database 10g are described in the following slides. Oracle 10g: Data Warehousing Fundamentals

124 Oracle 10g: Data Warehousing Fundamentals 1 - 163
B-Tree Index Most common type of indexing Used for high cardinality columns Designed for few rows returned B-Tree Indexes This is the most common type of indexing, used for high cardinality columns, and designed for few rows returned. Rather than scanning an entire table to find rows where a certain column satisfies a WHERE clause predicate, you create a separate index structure on that column. The index structure consists of sorted values, and is represented as a balanced tree (B-tree) in order to allow the database engine to quickly find any element in the sorted list. The upper blocks (branch blocks) of a B-tree contain index data that points to lower-level index blocks. They provide a road map to get to the right block at the leaf level. The lowest level index blocks (leaves in the tree) contain the discrete column value, and a corresponding ROWID that is used to locate the actual row. This is similar to the way an index in a book has a page number associated with each index entry. Note: Cardinality is defined as the number of distinct key values in a column. It is often expressed as a percentage of the number of rows in the table. For example, a million-row index with five distinct values has a low cardinality, whereas a 100-row table with 80 distinct values has a high cardinality. Oracle 10g: Data Warehousing Fundamentals

125 Oracle 10g: Data Warehousing Fundamentals 1 - 164
Bitmap Indexes Provide performance benefits and storage savings Store values as 1s and 0s Can be used instead of B-tree indexes when: Tables are large Columns have low cardinality Bitmap Indexes Bitmap indexes provide substantial performance benefits and storage savings. When a bitmap index is created on a column, a bit stream (ones and zeros) is created for each distinct value in the indexed column. They are useful on low cardinality data. Scanning 1s and 0s is much more efficient than scanning data values. Bitmap indexes are more advantageous than B-tree indexes in certain situations: When a table has millions of rows and the key columns have low cardinality—that is, there are very few distinct values for the column. For example, bitmap indexes might be preferable to B-tree indexes for the region, gender, and marital status columns of a table containing customer records. When queries often use a combination of multiple WHERE conditions involving the OR operator When there is read-only or low update activity on the key columns. Oracle 10g: Data Warehousing Fundamentals

126 Bitmap Index: Example Customer table 101 single east male bracket_1
CUSTOMER_No MARITAL_STATUS REGION GENDER INCOME_LEVEL 101 single east male bracket_1 102 married central female bracket_4 103 married west female bracket_2 104 divorced west male bracket_4 105 single central female bracket_2 106 married central female bracket_3 CREATE BITMAP INDEX REGION_IDX ON CUSTOMER(REGION); Sample bitmap index on the REGION column REGION='east' REGION = 'central' REGION = 'west' 1 1 1 Bitmap Index: Example The Customer table in the slide shows a portion of a company’s customer data. Because MARITAL_STATUS, REGION, GENDER, and INCOME_LEVEL are all low cardinality columns (there are only three possible values for marital status and region, two possible values for gender, and four income levels), it is appropriate to create bitmap indexes on these columns. You should not create a bitmap index on CUSTOMER_NBR because this is a high cardinality column. Instead, a unique B-tree index on this column would provide the most efficient representation and retrieval. Each entry or bit in the bitmap corresponds to a single row of the Customer table. The value of each bit depends upon the values of the corresponding row in the table. For instance, the bitmap REGION = ‘east’ contains a 1 as its first bit because the region is “east” in the first row of the table. The bitmap REGION = ‘east’ has a 0 for its other bits because none of the other rows of the table contains “east” as their value for REGION. 1 1 1 Oracle 10g: Data Warehousing Fundamentals

127 Oracle 10g: Data Warehousing Fundamentals 1 - 166
Bitmap Index: Example SELECT COUNT(*) FROM CUSTOMER WHERE MARITAL_STATUS = ‘married’ AND REGION IN (‘central’, ‘west’); MARITAL_STATUS = 'married' REGION = 'central' REGION = 'west' Query Result 1 AND OR 1 = 1 1 col 2 1 1 1 AND = 1 col 3 1 1 col 5 Bitmap Index: Example (continued) An analyst investigating demographic trends of the company’s customers might ask, “How many of our married customers live in the central or west regions?” This corresponds to the following SQL query: SELECT COUNT(*) FROM CUSTOMERS WHERE MARITAL_STATUS = ‘married’ AND REGION IN (‘central’, ‘west’); Bitmap indexes can process this query with great efficiency by merely counting the number of 1s in the resulting bitmap as shown in the slide. To identify specific customers who satisfy the criteria, you use the resulting bitmap to access the table. In the example, rows 2, 3, and 5 satisfy the query and are therefore accessed from the Customer table. Oracle 10g: Data Warehousing Fundamentals

128 Comparing B-Tree and Bitmap Indexes
Suitable for high-cardinality columns Updates on keys relatively inexpensive Inefficient for queries using OR predicates Useful for OLTP Bitmap Suitable for low-cardinality columns Updates to key columns very expensive Efficient for queries using OR predicates Useful for data warehousing Comparing B-Tree and Bitmap Indexes Bitmap indexes are more compact than B-tree indexes when used with low-cardinality columns. Updates to key columns in a bitmap index are more expensive because bitmaps use bitmap-segment-level locking, whereas in a B-tree index, locks are on entries corresponding to individual rows of the table. Bitmap indexes can be used to perform operations such as bitmap Boolean. The Oracle server can use two bitmap segments to perform a bitwise Boolean and get a resulting bitmap. This allows efficient use of bitmaps in queries that use the Boolean predicate. In summary, B-tree indexes may be more suitable in an OLTP environment for indexing dynamic tables, whereas bitmap indexes may be useful in data warehouse environments where complex queries are used on large, static tables. Oracle 10g: Data Warehousing Fundamentals

129 Oracle 10g: Data Warehousing Fundamentals 1 - 168
Other Types of Indexes Bitmap join indexes Function-based indexes Domain indexes Partitioned indexes Other Types of Indexes Bitmap Join Indexes Oracle database also has the ability to build a bitmap index on a table based on columns of another table. This index type is called a bitmap join index, and it can be a single or multicolumn index and can combine columns of different tables. Bitmap join indexes materialize precomputed join results in a very efficient way. Typical usage in a data warehouse would be to create bitmap join indexes on a fact table in a star or snowflake schema on one or more columns of one or more dimension tables. This could improve star query processing times dramatically, especially when a star query has filter predicates on low cardinality attributes of different dimension tables and the combination of these attributes is highly selective on the fact table. Function-Based Indexes A function-based index is created when using functions or expressions that involve one or more columns in the table that is being indexed. A function-based index precomputes the value of the function or expression and stores it in the index. Function-based indexes can be created as either a B-tree or a bitmap index. Oracle 10g: Data Warehousing Fundamentals

130 Star Query Optimization
Star query optimization requires the following: Tuning star queries A bitmap index should be built on each of the foreign key columns of the fact table. The STAR_TRANSFORMATION_ENABLED initialization parameter should be set to TRUE. The cost-based optimizer should be used. Using star transformation Star Query Optimization A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table by using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer (Oracle’s statistical mechanism that analyzes where and how to retrieve data from the server in the fastest manner) recognizes star queries and generates efficient execution plans for them. To get the best possible performance for star queries, it is important to follow some basic guidelines: A bitmap index should be built on each of the foreign key columns of the fact table or tables. The STAR_TRANSFORMATION_ENABLED initialization parameter should be set to TRUE. This enables an important optimizer feature for star queries. It is set to false by default for backward compatibility. The cost-based optimizer should be used. This does not apply solely to star schemas; all data warehouses should always use the cost-based optimizer. When a data warehouse satisfies these conditions, the majority of the star queries running in the data warehouse will use a query execution strategy known as the star transformation. The star transformation provides very efficient query performance for star queries. Oracle 10g: Data Warehousing Fundamentals

131 Oracle 10g: Data Warehousing Fundamentals 1 - 171
Star Transformation A cost-based query transformation aimed at executing star queries efficiently Works well for schemas with a small number of dimensions and dense fact tables Automatically chosen by Oracle’s cost-based optimizer (CBO) Star Transformation The star transformation is a powerful optimization technique that relies upon implicitly rewriting (or transforming) the SQL of the original star query. The end user never needs to know any of the details about the star transformation. Oracle’s cost-based optimizer automatically chooses the star transformation where appropriate. The star transformation is a cost-based query transformation aimed at executing star queries efficiently. Oracle processes a star query by using two basic phases: The first phase retrieves exactly the necessary rows from the fact table (the result set). Because this retrieval uses bitmap indexes, it is very efficient. The second phase joins this result set to the dimension tables. Oracle 10g: Data Warehousing Fundamentals

132 Oracle 10g: Data Warehousing Fundamentals 1 - 173
Star Query: Example SELECT ch.channel_class, c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s,times t,customers c,channels ch WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND s.channel_id = ch.channel_id AND c.cust_state_province = 'CA' AND ch.channel_desc IN ('Internet','Catalog') AND t.calendar_quarter_desc IN ('1999-Q1','1999-Q2') GROUP BY ch.channel_class, c.cust_city, t.calendar_quarter_desc; Star Query: Example Consider the star query in the slide. In order for the star transformation to operate, it is assumed that the Sales table of the Sales History schema has bitmap indexes on the time_id, channel_id, and cust_id columns. Oracle 10g: Data Warehousing Fundamentals

133 Star Transformation Hints
STAR_TRANSFORMATION hint: Use the best plan containing a star transformation, if there is one. FACT(<table_name>) hint: The hinted table should be considered as the fact table in the context of a star transformation. NO_FACT (<table_name>) hint: The hinted table should not be considered as the fact table in the context of a star transformation. FACT and NO_FACT hints are useful for star queries containing more than one fact table. Star Transformation Hints You can also use the following hints for optimizing the star queries further: The STAR_TRANSFORMATION hint makes the optimizer use the best plan in which the transformation has been used. Without the hint, the optimizer could make a cost-based decision to use the best plan generated without the transformation, instead of the best plan for the transformed query. Even if the hint is given, there is no guarantee that the transformation will take place. The optimizer generates the subqueries if it seems reasonable to do so. If no subqueries are generated, then there is no transformed query, and the best plan for the untransformed query is used, regardless of the hint. The FACT hint is used in the context of the star transformation to indicate to the transformation that the hinted table should be considered as a fact table. The NO_FACT hint is used in the context of the star transformation to indicate to the transformation that the hinted table should not be considered as a fact table. Note: The FACT and NO_FACT hints might be useful only if there is more than one fact table accessed in the star query. Oracle 10g: Data Warehousing Fundamentals

134 Oracle 10g: Data Warehousing Fundamentals 1 - 175
Parallelism P1 P2 P3 Sales table Customers table Parallelism Parallelism is the ability to apply multiple CPU and I/O resources to the execution of a single SQL command. Simply expressed, parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when four processes handle four different quarters in a year instead of one process handling all four quarters by itself. The improvement in performance can be quite high. In this case, each quarter will be a partition, a smaller and more manageable unit of an index or table. Oracle server’s unique parallel architecture allows any query to execute with any degree of parallelism. Oracle server intelligently chooses the degree of parallelism for each query, based upon the complexity of the query, the size of the tables in the query, the hardware configuration, and the current level of activity on the system. Parallelism is a fundamental performance feature for executing queries over large volumes of data. Note: For more information, refer to the Oracle Database Data Warehousing Guide 10g Release 2 (10.2). Parallel execution servers Oracle 10g: Data Warehousing Fundamentals

135 Degree of Parallelism (DOP)
DOP is the number of parallel execution servers used by one parallel operation. This applies only to intraoperation parallelism. If interoperation parallelism is used, then the number of parallel execution servers can be twice the DOP. No more than two sets of parallel execution servers can be used for one parallelized statement. When using partition granules, use a relatively high number of partitions. Degree of Parallelism (DOP) The number of parallel execution servers associated with a single (parallel) operation is known as the degree of parallelism. The parallel execution coordinator may enlist two or more of the instance’s parallel execution servers to process a SQL statement. Note that the degree of parallelism applies directly only to intraoperation parallelism. If interoperation parallelism is possible, the total number of parallel execution servers for a statement can be twice the specified degree of parallelism. No more than two sets of parallel execution servers can execute simultaneously. Each set of parallel execution servers may process multiple operations. Only two sets of parallel execution servers need to be active to guarantee optimal interoperation parallelism. You can specify the DOP as a hint in the operation. The default DOP is used when you want to parallelize an operation but you do not specify a DOP in a hint or within the definition of a table or index. The default DOP is appropriate for most applications. The default DOP for a SQL statement is determined by the factors such as the number of CPUs, some parallelism initialization parameters, number of partitions, and so on. Oracle 10g: Data Warehousing Fundamentals

136 Operations That Can Be Parallelized
Access methods: Table scans, index lookups Partitioned index range scans Various SQL operations Joins: Nested loop, sort merge Hash, star transformation, partitionwise join DDL statements: CREATE TABLE AS SELECT (CTAS), CREATE INDEX, REBUILD INDEX [PARTITION] MOVE, SPLIT, COALESCE PARTITION DML statements: INSERT SELECT, UPDATE, DELETE, MERGE SQL*Loader Operations That Can Be Parallelized The Oracle server can use parallel execution for any of the following: Access Methods: Table scans, index full scans, and partitioned index range scans Joins: Nested loop, sort merge, hash, and star transformation DDL Statements: CREATE TABLE AS SELECT, CREATE INDEX, REBUILD INDEX, REBUILD INDEX PARTITION, and MOVE SPLIT COALESCE PARTITION DML Statements: INSERT SELECT, UPDATE, MERGE, and DELETE Miscellaneous SQL Operations: Such as GROUP BY, ORDER BY, NOT IN, EXISTS, IN, SELECT DISTINCT, UNION, UNION ALL, MINUS, INTERSECT, CUBE, and ROLLUP, as well as aggregate and table functions. Oracle 10g: Data Warehousing Fundamentals

137 Parallel Execution Server Pool
Pool of servers are created at instance startup. Minimum pool size is determined by PARALLEL_MIN_SERVERS. Pool size can increase based on demand. Maximum pool size is determined by PARALLEL_MAX_SERVERS. If a parallel execution server is idle for more than a threshold period of time, then it is terminated. Parallel Execution Server Pool When an instance starts up, Oracle database creates a pool of parallel execution servers that are available for any parallel operation. The PARALLEL_MIN_SERVERS initialization parameter specifies the number of parallel execution servers that are created at instance startup. When executing a parallel operation, the parallel execution coordinator obtains parallel execution servers from the pool and assigns them to the operation. If required, Oracle database can create additional parallel execution servers for the operation. However, Oracle database never creates more parallel execution servers for an instance than the value specified by the PARALLEL_MAX_SERVERS initialization parameter. These parallel execution servers remain with the operation throughout job execution, and then become available for other operations. After the statement has been processed completely, the parallel execution servers return to the pool. If the number of parallel operations decreases, Oracle database terminates any parallel execution servers that have been idle for a threshold period of time. Oracle database does not reduce the size of the pool below the value of PARALLEL_MIN_SERVERS, no matter how long the parallel execution servers have been idle. Note: The recommended value for PARALLEL_MAX_SERVERS is 2*DOP*Number of concurrent users. The recommended value for PARALLEL_MIN_SERVERS is 0. Oracle 10g: Data Warehousing Fundamentals

138 PARALLEL Clause: Examples
CREATE INDEX ord_customer_ix ON orders (customer_id) NOLOGGING PARALLEL; ALTER TABLE customers PARALLEL 5; ALTER TABLE sales SPLIT PARTITION sales_q4_2000 AT ('15-NOV-2000') INTO (PARTITION sales_q4_1, PARTITION sales_q4_2) PARALLEL 2; PARALLEL Clause: Examples In the preceding examples: In the first example, if the sample table orders had been created using a fast parallel load, you might issue the following statement to quickly create an index in parallel. (Oracle server chooses the appropriate degree of parallelism). Note that in this case, the default degree of parallelism used to create the index will also be stored as the dictionary DOP. The second example changes the default DOP of the Customers table. The last example split the SALES_Q4_2000 partition into two new partitions. This operation is done in parallel with a DOP explicitly set to two. Note: The TO_DATE()function for the last SQL example was omitted for simplicity reasons. Oracle 10g: Data Warehousing Fundamentals

139 Oracle 10g: Data Warehousing Fundamentals 1 - 180
Using Summary Data Designing summaries offers the following benefits: Provides fast access to precomputed data Reduces use of I/O, CPU, and memory Using Summary Data Another technique employed in data warehouses to improve performance is the creation of summaries. Summaries contain preaggregated and prejoined data, and are created in Oracle database by using a schema object called materialized view. Materialized Views for Data Warehouses A materialized view eliminates the overhead that is associated with expensive joins and aggregations for a large or important class of queries. Having direct access to a summary containing precomputed data reduces the disk I/O, and CPU time, sort area, and memory swapping requirements. Materialized views within the data warehouse are transparent to the end user or to the database application. The database administrator creates materialized views. The end user queries the tables and views at the detail data level. The query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. Note: More detailed discussion about summary management and materialized views is presented in the lesson titled “Summary Management.” Oracle 10g: Data Warehousing Fundamentals

140 Security in Data Warehouses
Why is security important in data warehouses? To prevent unauthorized access To avoid data theft by hackers To provide the right data to the right set of users To keep a record of the user activities Security in Data Warehouses Data warehousing poses its own set of challenges for security: enterprise data warehouses are often very large systems serving many user communities with varying data access and security needs. Security must be built into the core of a data warehouse; therefore, considering the security aspects while finalizing the physical model becomes very important. Why Is Security Important for a Data Warehouse? Many of the basic requirements for security are well-known, and apply equally to a data warehouse as to any other system: the application must prevent unauthorized users from accessing or modifying data, the applications and underlying data must not be susceptible to data theft by hackers, the data must be available to the right users at the right time, and the system must keep a record of activities performed by its users. These requirements are perhaps even more important in a data warehouse because, by definition, a data warehouse contains data consolidated from multiple sources, and thus from the perspective of a malicious individual trying to steal information, a data warehouse can be one of the most lucrative targets in an enterprise. Oracle 10g: Data Warehousing Fundamentals

141 Oracle’s Strategy for Data Warehouse Security
Role-based access control Roles are a collection of privileges: system privileges and object privileges Virtual Private Database (VPD) Fine grain access control Application contexts Row-level and column-level security Oracle Label Security (augments VPD) Oracle’s Strategy for Data Warehouse Security Among the best ways to mitigate security risk is to provide multiple layers of security mechanisms so that the failure of a single mechanism does not result in compromise on critical information security. Oracle Database 10g addresses data warehouse security, performance, and scalability objectives by offering the following features or techniques: Role-based access control: Database privileges and roles ensure that a user can perform an operation on a database object only if he or she has been authorized to perform that operation. A privilege is an authorization to perform a particular operation; without explicitly granted privileges, a user cannot access any information in the database. System privileges authorize a user to perform a specific operation, such as the CREATE TABLE privilege, which allows a user to create a database table. Object privileges authorize a user to perform a specific operation on a particular object. An example of object privileges is SELECT ON SH.SALES to allow a particular user to query the Sales fact table. To address the complexity of privilege management, database roles encapsulate one or more privileges that can be granted to and revoked from users. For example, you can create the SALES_MANAGER role and grant it the required privileges, and then this role can in turn be granted to all the sales managers. Roles enforce object-level security, which can be enhanced with the enforcement of security at the level of individual rows (or columns) within a database object, which is the next layer of security discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

142 Oracle’s Strategy for Data Warehouse Security
What is Virtual Private Database (VPD)? Server-enforced fine grain access control (FGAC) Is implemented by associating security policies with tables, views, and materialized views. Provides secure access to critical data. Within a single central database instance, every user sees a private database (virtually). What is Oracle Label Security? Oracle Label Security augments VPD with sensitive labeled data management. Oracle’s Strategy for Data Warehouse Security What Is VPD? Oracle Database 10g has set the standard in data warehouse security with Virtual Private Database (VPD), offering support for both row-level security (RLS) and column-level security. VPD supports server-enforced fine grain access control, together with the secure application context, the features that are unique to Oracle databases. Virtual Private Database is implemented by associating one or more security policies (defined in packages and functions) with tables, views, or materialized views. This enables multiple customers and partners to have secure access to critical data. Thus, within a single database, VPD enables per-user or per-customer data access. Fine Grain Access Control Fine grain access control relies on “dynamic query modification” to enforce security policies on the objects with which the policies are associated. Here, “query” refers to any selection from a table or view, including data access through a query-for-update, INSERT, UPDATE, or DELETE statements, or a subquery, not just statements that begin with SELECT. A user directly or indirectly accessing a table or view that has a security policy associated with it causes the server to dynamically modify the statement based on a WHERE condition (known as a predicate) returned by a function that implements the security policy. Oracle 10g: Data Warehousing Fundamentals

143 Oracle-Supplied Technology and Tools for Implementing VPD
PL/SQL packages and functions: DBMS_RLS, DBMS_SESSION SYS_CONTEXT Tool: Oracle Policy Manager Oracle-Supplied Technology and Tools for Implementing VPD Oracle database supplies many PL/SQL packages, such as DBMS_RLS and DBMS_SESSION, and functions, such as SYS_CONTEXT, which aid the implementation of VPD. DBMS_RLS: Using this package, you can administer security policies. The DBMS_RLS package contains procedures and functions (such as ADD_POLICY, DROP_POLICY, and so on), which enable you to add a policy, drop a policy, or refresh, enable, or disable an existing policy. DBMS_SESSION: This package provides access to session information, which can be altered from PL/SQL statements. You can also use this package to set preferences and security levels. Use the programs in the DBMS_SESSION package to maintain and list application contexts, and maintain identifiers used with global contexts. Note: Application developers or DBAs can use the CREATE CONTEXT command to create application contexts. SYS_CONTEXT: This function returns the values of application context attributes, including built-in attributes that contain session properties, and user-defined attributes from user-defined contexts. Oracle 10g: Data Warehousing Fundamentals

144 Oracle 10g: Data Warehousing Fundamentals 1 - 188
Practice 4-1: Overview This practice covers the following: Identifying the probable attributes for the dimensions and facts for the RISD scenario given Identifying the suitable indexes for the attributes Answering the questions related to indexing and parallelism Identifying the best strategy to implement security for the RISD scenario Reviewing the other concepts explained in the lesson Oracle 10g: Data Warehousing Fundamentals

145 The ETL Process: Extracting Data
Schedule: Timing Topic 45 minutes Lecture 20 minutes Practice 65 minutes Total

146 Oracle 10g: Data Warehousing Fundamentals 1 - 192
Objectives After completing this lesson, you should be able to do the following: Outline the extraction, transformation, and loading (ETL) processes for building a data warehouse Identify the ETL tasks, importance, and cost Explain how to examine data sources Identify extraction techniques and methods Identify analysis issues and design options for extraction processes List the selection criteria for the ETL tools Identify Oracle’s solution for the ETL process Lesson Aim This lesson introduces extraction, transformation, and loading (ETL) processes. It also focuses on extraction issues, and explores the sources of data for the data warehouses. You consider various extraction techniques, methods, and tools. Instructor Note Oracle offers Oracle Warehouse Builder (OWB), which is a powerful tool that enables the design and deployment of enterprise data warehouses, data marts, and e-business intelligence applications. Coupled with the Common Warehouse Metamodel (CWM) standard, OWB provides the framework for performing the ETL tasks, integrating the components of Oracle’s Business Intelligence solution, including Oracle E-Business Suite, as well as industry specific data warehousing solutions. Note that the Extraction, Transformation, and Loading phases are broadly classified, and often the ETL tasks may be conducted as a single process. Oracle 10g: Data Warehousing Fundamentals

147 Extraction, Transformation, and Loading (ETL) Process
Extract source data. Transform and clean data. Load data into warehouse. Operational systems Programs Tools Gateways Extraction, Transformation, and Loading (ETL) Process In order to load the data warehouse regularly, data from one or more operational systems needs to be extracted and copied into the data warehouse. The process of extracting data from source systems and bringing it into the data warehouse is known as extraction, transformation, and loading (or ETL in short). Before considering this lesson’s focus on extraction, you should be aware of what happens during each of these three major phases of the ETL process: Extraction: During extraction, the desired data has to be identified and extracted from many different sources, including database systems and applications. Transformation: After the data has been extracted, the required transformations are done to the data. For example, if you have extracted data from a nonrelational table, then some transformations need to be performed in order to insert this data into a relational table in the data warehouse. Loading: Loading refers to the operation of loading the data into the target warehouse database. Before loading the data, the transformed data has to be physically transported to the target system or an intermediate system for further processing. Data warehouse ETL Oracle 10g: Data Warehousing Fundamentals

148 ETL: Tasks, Importance, and Cost
Extract Clean up Consolidate Restructure Load Maintain Refresh Data warehouse Operational systems ETL Relevant Useful Quality Accurate Accessible ETL: Tasks, Importance, and Cost ETL Tasks: Extraction, transformation, and loading (ETL) involves a series of tasks that: Extract data from source systems Transform and clean up the data Index the data Summarize the data Load data into the warehouse Track the changes made to source data required for the warehouse Restructure keys Maintain the metadata Refresh the warehouse with updated data ETL Importance: The extraction, transformation, and loading processes are absolutely fundamental in ensuring that the data resident in the warehouse is: Relevant and useful to the business users Of high quality Accurate Easy to access so that the warehouse is used efficiently and effectively by the business users Oracle 10g: Data Warehousing Fundamentals

149 Oracle 10g: Data Warehousing Fundamentals 1 - 197
Extracting Data Source systems: Data from various data sources in various formats Extraction routines: Are developed to select data fields from sources Consist of business rules, audit trails, and error correction facilities Data mapping Transform Extracting Data Extraction is the operation of extracting data from one or more source systems for further use in a data warehouse environment. The data may come from a variety of data source systems, and the data may exist in a variety of formats. The extraction routines are specifically developed to account for the variety of systems from which data is taken. The routines contain data or business rules, audit trails, and error correction facilities. The routines take into account the frequency with which data is to be extracted. Extraction is the first step of the ETL process. After the extraction, this data can be transformed and loaded into the data warehouse. Note: Data staging area is where much of the data transformation and cleansing takes place. More details about data staging area can be found in the lesson titled “The ETL Process: Transforming Data.” Operational databases Warehouse database Data staging area Oracle 10g: Data Warehousing Fundamentals

150 Examining Data Sources
Production Archive Internal External Examining Data Sources The data source systems may comprise data existing in: Production operational systems Archives Internal files, such as individual spreadsheets and workbooks, which are not directly associated with company operational systems External data from sources outside the company Oracle 10g: Data Warehousing Fundamentals

151 Oracle 10g: Data Warehousing Fundamentals 1 - 199
Production Data Operating system platforms File systems Database systems and vertical applications IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials Production Data Production data may come from a multitude of different sources: Operating system platforms File systems (flat files, Virtual Storage Access System (VSAM), Indexed Sequential Access Method (ISAM), and so on.) Database systems—for example, Oracle, DB2, dBase, Informix, and so on Vertical applications, such as Oracle Financials, SAP, PeopleSoft, Baan, and Dun and Bradstreet Financials Oracle 10g: Data Warehousing Fundamentals

152 Oracle 10g: Data Warehousing Fundamentals 1 - 200
Archive Data Historical data Useful for analysis over long periods of time Useful for first-time load May require unique transformations Archive Data Archive data may be useful to the enterprise in supplying historical data. Historical data is needed if analysis over long periods of time is to be achieved. Archive data is not used consistently as a source for the warehouse—for example, it would not be used for regular data refreshes. However, for the initial implementation of a data warehouse (and for the first-time load), archived data is an important source of historical data. You need to consider this carefully when planning the data warehouse. How much historical data do you have available for the data warehouse? How much effort is necessary to transform it into an acceptable format? The archive data may need some careful and unique transformations, and clear details about the changes must be maintained in metadata. Operation databases Warehouse database Oracle 10g: Data Warehousing Fundamentals

153 Oracle 10g: Data Warehousing Fundamentals 1 - 201
Internal Data Planning, sales, and marketing organization data Maintained in the form of: Spreadsheets (structured) Documents (unstructured) Treated like any other source data Planning Accounting Marketing Internal Data Internal data may be information prepared by planning, sales, or marketing organizations that contains data such as budgets, forecasts, or sales quotas. The data contains figures (numbers) that are used across the enterprise for comparison purposes. The data is maintained using software packages such as spreadsheets and word processors, and uploaded into the warehouse. Internal data is treated like any other source system data. It must be transformed, documented in metadata, and mapped between the source and target databases. Warehouse database Oracle 10g: Data Warehousing Fundamentals

154 External Data Information from outside the organization
Issues of frequency, format, and predictability Described and tracked using metadata A.C. Nielsen, IRI, IMS, Walsh America Purchased databases Competitive information Dun and Bradstreet Economic forecasts External Data External data is important if you want to compare the performance of your business against others’. There are many sources for external data: Periodicals and reports External syndicated data feeds (Some warehouses rely regularly on this as a source.) Competitive analysis information Newspapers Purchased marketing, competitive, and customer-related data Free data from the Web Wall Street Journal Barron’s Warehousing databases Oracle 10g: Data Warehousing Fundamentals

155 Oracle 10g: Data Warehousing Fundamentals 1 - 204
Mapping Data Mapping data defines: Which operational attributes to use How to transform the attributes for the warehouse Where the attributes exist in the warehouse File A F1 123 F2 Bloggs F3 10/12/56 Staging File One Number USA123 Name Mr. Bloggs DOB Dec-56 Metadata Mapping Data After you have determined your business subjects for the warehouse, you need to determine the required attributes from the source systems. On an attribute-by-attribute basis, you must determine how the source data maps into the data warehouse, and what, if any, transformation rules to apply. This is known as mapping. There are mapping tools available. Mapping information should be maintained in the metadata that is server (RDBMS) resident, for ease of access, maintenance, and clarity. File A F1 Staging File One Number F2 F3 Name DOB Oracle 10g: Data Warehousing Fundamentals

156 Oracle 10g: Data Warehousing Fundamentals 1 - 205
Extraction Methods Logical extraction methods: Full extraction Incremental extraction Physical extraction methods: Online extraction Offline extraction Your logical choice influences the way the data is physically extracted. Extraction Methods The extraction method that you choose is highly dependent on the source system and the business needs in the target data warehouse environment. In addition, the estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective. Basically, you have to decide how to extract data logically and physically. Your logical choice influences the way the data is physically extracted. Logical Extraction Methods Full extraction: The data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes to the data source from the time of the last successful extraction. The source data will be provided as is and no additional logical information (for example, timestamps) is necessary on the source site. Incremental extraction: At a specific point of time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event such as the last booking day of a fiscal period. To identify this delta change, there must be a possibility to identify all the changed information since this specific time event. Oracle 10g: Data Warehousing Fundamentals

157 Change Data Capture Mechanism in Oracle Database 10g
Facilitates incremental extraction Captures all the INSERTs, UPDATEs, and DELETEs Allows changes to data to be stored in change tables Is based on Publish and Subscribe model Allows synchronous and asynchronous data capture Change data Data Publisher Change Data Capture Mechanism in Oracle Database 10g An important consideration for extraction is incremental extraction, also called Change Data Capture. If a data warehouse extracts data from an operational system every night, then the data warehouse requires only the data that has changed since the last extraction (that is, the data that has been modified in the past 24 hours). When it is possible to efficiently identify and extract only the most recently changed data, the extraction process (as well as all downstream operations in the ETL process including refresh) can be much more efficient because the volume of data extracted is small. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Change Data Capture is typically the most challenging technical issue in data extraction. Oracle Change Data Capture (CDC) is an in-built feature of the Oracle Database server that is used in data warehouses (introduced in Oracle9i database). Change Data Capture captures all the INSERTs, UPDATEs, and DELETEs (DML operations) done on the tables. These changes are stored in a new database object called a change table, and the change data is made available to applications in a controlled way using views. Change table Source table Oracle 10g: Data Warehousing Fundamentals

158 Change Data Capture Mechanism in Oracle Database 10g
Subscriber view 1 Subscriber view 2 Subscriber 1 Subscriber 2 Change table Change Data Capture Mechanism in Oracle Database 10g (continued) Publish and Subscribe Model The Change Data Capture system is based on the Publish and Subscribe model. Change Data Capture provides PL/SQL packages to accomplish the publish and subscribe tasks. The publisher (usually, the DBA) is in charge of capturing and publishing change data for any number of Oracle source tables. The publisher determines which source tables the data warehouse application is interested in capturing the changes from. For each source table on the OLTP system for which the changes are to be captured, the publisher creates a change table on the staging system. As the data manipulation language (DML) operations are performed, the change data is captured and published to corresponding change tables. The publisher allows subscribers to access these change tables in a controlled way. The publisher controls the access to the change tables by using SQL GRANT and REVOKE statements. The subscribers (usually, applications), are consumers of the published change data. Subscribers subscribe to one or more sets of columns in source tables. Each subscriber has its own view of change data (often called “subscription view”), so there can be multiple subscribers accessing the same change data, without interfering with one another. Oracle 10g: Data Warehousing Fundamentals

159 Extraction Techniques
Programs: C, C++, PL/SQL, or Java Gateways: Transparent database access ETL tools: Oracle Warehouse Builder Extraction Techniques You can extract data from different source systems to the warehouse in different ways: Programmatically, using procedural languages such as C, C++, PL/ SQL, or Java Using a gateway to access data sources. This method is acceptable only for small amounts of data; otherwise, the network traffic becomes unacceptably high. Oracle 10g: Data Warehousing Fundamentals

160 Designing Extraction Processes
Analysis: Sources, technologies Data types, quality, owners Design options: Manual, custom, gateway, tools Replication, full, or delta refresh Design issues: Volume and consistency of data Automation, skills needed, resources Designing Extraction Processes When designing your extraction processes, consider the analysis issues, the design options that are available to you, and the design issues. Analysis The sources and technologies used Existing data feeds and redo logs Data types (EBCDIC or ASCII) Data quality and ownership Data volumes Operational schedule in the source environment Spare processing capacity in the source environment Design Options Manual data entry Custom programs Gateway technologies Replication techniques Tools Full refresh or delta changes Oracle 10g: Data Warehousing Fundamentals

161 Maintaining Extraction Metadata
Source location, type, structure Access method Privilege information Temporary storage Failure procedures Validity checks Handlers for missing data Maintaining Extraction Metadata It is essential to maintain a metadata trail of information about all ETL processes, including the extraction process. This information is important for warehouse enhancement and performance improvements. The quality of metadata is critical for every aspect of the warehouse; attention must be paid to its control, management, and change. Extraction metadata includes: The source location, type, contact, and structure information The access method The privilege information The extraction temporary storage information The extraction failure and validity check procedures information Information about how to handle missing data Extraction metadata also contains information about the frequency of program execution and maps the source data to the target database. Oracle 10g: Data Warehousing Fundamentals

162 Oracle 10g: Data Warehousing Fundamentals 1 - 215
Possible ETL Failures A missing source file A system failure Inadequate metadata Poor mapping information Inadequate storage planning A source structural change No contingency plan Inadequate data validation Possible ETL Failures ETL processes are vital to the warehouse, and they must succeed. ETL may fail for any of the following reasons: Extraction routines must specify the name and location of the source data. A missing file may cause the extraction to fail. You must therefore ensure that exception and error handling routines are included. If there is a system or media failure during the process, the process may fail entirely. You must start again or you may, depending upon system settings, be able to continue from the point of failure. Metadata that inadequately describes the source to destination mapping and rules will cause ETL to fail (for example, when an unexpected value is found). Without the space for temporary data, staging data, and sorting operations, ETL fails. Any changes to the source systems that are not documented in metadata will cause extraction to fail. Oracle 10g: Data Warehousing Fundamentals

163 Maintaining ETL Quality
ETL must be: Tested Documented Monitored and reviewed Disparate metadata must be coordinated. Maintaining ETL Quality Any failure of the ETL processes affects data quality, the importance of which cannot be underestimated. Inaccurate data leads to inaccurate analysis results, which lead to bad business decisions. The result of poor data quality is a lack of confidence in the system to deliver the solution. Testing the process: You should test the proposed ETL techniques to ensure that volumes can be physically moved within the load window constraints and network capabilities. Documenting the process: You must communicate and document the proposed load processes with the operations organization to ensure their agreement and commitment to this important process. Monitoring and reviewing the process: You should ensure that the load is constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision. The granularity of the data also affects query capabilities and the warehouse size. Note: For more details about data granularity, see the lesson titled “Building the Data Warehouse: Loading Warehouse Data.” Oracle 10g: Data Warehousing Fundamentals

164 Oracle’s ETL Tool: Oracle Warehouse Builder
Offers extensible framework for the design and deployment of data warehouses Generates the following types of code: SQL DDL scripts PL/SQL programs SQL*Loader control files TCL scripts Oracle Warehouse Builder Oracle Warehouse Builder (OWB) offers an extensible framework for designing and deploying enterprise data warehouses, data marts and E-Business Intelligence Applications. OWB leverages Oracle Database 10g ETL features, provides the framework for integrating all the components of an Oracle Warehouse, and is the most comprehensive solution for data warehousing and e-business intelligence applications. OWB’s graphical user interface (GUI) facilitates fast, efficient design, and deployment of data warehouses. There are wizard-driven processes that guide users through all the implementation phases of building a data warehouse: The metadata source definition (wizard-driven) process supports source metadata import into the repository of OWB. OWB supports both 3NF (3rd Normal Form), and star schema designs. OWB also features wizards and graphical editors for tables, fact tables, dimensions, views, and materialized views. OWB provides automated code generation with a validation process for error-free code. Different types of codes are generated based on deployment requirements: SQL DDL scripts that create staging tables, and target schema. PL/SQL programs for source-to-target mapping and transformations SQL*Loader control files for loading data from flat files TCL scripts to schedule generated PL/SQL programs as jobs in Oracle Enterprise Manager Oracle 10g: Data Warehousing Fundamentals

165 Oracle-Supported Features for ETL
Oracle’s utilities: SQL* Loader Export/Import Oracle Database 10g has many enhanced features for ETL. Oracle’s support for the current trends in data warehousing Oracle’s Solution for ETL Oracle offers utilities such as SQL*Loader and Export/Import to aid the ETL processes. SQL*Loader loads data from external or flat files into tables of an Oracle database. When you run the Export utility, the objects (such as tables) are extracted, followed by their related objects (such as indexes, comments, and grants), and the extracted data is written to an export (dump) file. The Import utility reads the object definitions and table data from an export (dump) file. It inserts these data objects into an Oracle database. Oracle database also provides a rich set of features and capabilities that can be used by both ETL tools and customized ETL solutions. Oracle offers techniques for extracting and transporting data between databases, for transforming large volumes of data, and for quickly loading new data into a data warehouse. For example, Oracle database offers many enhanced features such as the Change Data Capture mechanism and transportable tablespaces to aid faster and incremental data extraction. External tables, Multitable INSERTs, and table functions are some of the other features of Oracle database, which aid ETL processes. These features of the Oracle database are often referred to as the ETL tool kit. These ETL features are covered in more detail in the next lessons. Oracle 10g: Data Warehousing Fundamentals

166 Oracle 10g: Data Warehousing Fundamentals 1 - 221
Oracle’s Solution for ETL: Oracle Streams, Replication, and Message Queuing Oracle Streams: Is the key information-sharing technology Provides greater functionality and flexibility Can be used for replication, message queuing, and data warehouse loading Oracle’s Solution for ETL: Oracle Streams, Replication, and Message Queuing An important feature of any database management system is the ability to share information among multiple databases and applications. Examples of information sharing include message queuing, database replication, data warehouse loading, and so on. Oracle Streams has features that can meet all these information-sharing needs. Oracle Streams Oracle Streams is the key information-sharing technology (introduced in Oracle9i). Oracle Streams enable the propagation of data, transactions, and events in a data stream, either within a database or from one database to another. Customers can use Streams to replicate data, implement message queuing and management, load changed data into data warehouses, send notifications of database events to subscribers, and provide high availability solutions to protect data. Streams provides greater functionality and flexibility for sharing information with other databases and applications. It satisfies the information sharing needs of most customers with a single integrated solution. This integrated solution allows customers to break the cycle of trading off one solution for another. They can use all the capabilities of Streams (such as message queuing, replication, and so on) at the same time. Oracle 10g: Data Warehousing Fundamentals

167 Oracle 10g: Data Warehousing Fundamentals 1 - 223
Summary In this lesson, you should have learned how to: Outline the extraction, transformation, and loading (ETL) processes for building a data warehouse Identify ETL tasks, importance, and cost Explain how to examine data sources Identify extraction techniques and methods Identify analysis issues and design options for extraction processes List the selection criteria for the ETL tools Identify Oracle’s solution for the ETL process Oracle 10g: Data Warehousing Fundamentals

168 Oracle 10g: Data Warehousing Fundamentals 1 - 224
Practice 5-1: Overview This practice covers the following topics: Identifying the tools and techniques which wil aid the ETL process for RISD Answering questions based on the scenario given Exploring viewlet-based demonstrations on OWB Oracle 10g: Data Warehousing Fundamentals

169 The ETL Process: Transforming Data
Schedule: Timing Topic 90 minutes Lecture 30 minutes Practice 120 minutes Total

170 Oracle 10g: Data Warehousing Fundamentals 1 - 227
Objectives After completing this lesson, you should be able to do the following: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data Describe techniques for transforming data Design transformation process List Oracle’s enhanced features and tools that can be used to transform data Lesson Aim In this lesson, you explore how the transformation process transforms data from source systems into data that is suitable for end-user query and analysis. The lesson also describes typical data anomalies and looks at ways to eliminate them. This lesson does not seek to illustrate all the typical transformations that would be encountered in a data warehouse, but to demonstrate the types of fundamental technology that can be applied to implement these transformations and to provide guidance on how to choose the best techniques. Oracle 10g: Data Warehousing Fundamentals

171 Oracle 10g: Data Warehousing Fundamentals 1 - 228
Transformation Transformation eliminates anomalies from operational data: Cleans and standardizes Presents subject-oriented data Transform: Clean up Consolidate Restructure Extract Warehouse Operational systems Load Transformation Transformation involves a number of tasks, the most important being to eliminate all anomalies. Transformation also includes eliminating formatting differences, assigning data types, defining consistent units of measure, and determining encoded structures. Along with these tasks, another objective is to ensure that the data is presented in a subject-oriented fashion. Data transformations are often the most complex and, in terms of processing time, the most costly part of the ETL process. They can range from simple data conversions to extremely complex data clean-up techniques. Typically, data warehouse implementations require a data staging area, where much of the data transformation and cleansing takes place. It may be an operational data store environment, a set of flat files, a series of tables in a relational database server, or proprietary data structures used by data staging tools. You may employ multitier staging that reconciles data before and after the transformation process and before data is loaded into the warehouse. As many as three tiers are possible, from the operational server to the staging area and then to the warehouse server. Data staging area Oracle 10g: Data Warehousing Fundamentals

172 Possible Staging Models
Remote staging model Onsite staging model Possible Staging Models Out of the possible staging models listed in the slide, the model you choose depends on operational and warehouse requirements, system availability, connectivity bandwidth, gateway access, and volume of data to be moved or transformed. The description of each of these models is given in the next pages. Note: Oracle uses a model, which does not require intermediate staging, to implement enhanced transform-while-loading. Oracle’s table functions provide the support for this kind of pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. For more information, refer to the section titled “Oracle’s Enhanced Features for Transformation.” Oracle 10g: Data Warehousing Fundamentals

173 Oracle 10g: Data Warehousing Fundamentals 1 - 230
Remote Staging Model Data staging area within the warehouse environment Data staging area in its own environment Transform Warehouse Extract Load Operational system Staging area Transform Warehouse Remote Staging Model In this model, the staging area is not part of the operational environment. You may choose to extract the data from the operational environment and transport it into the warehouse environment for transformation processing. You may optionally execute some transformation processing during the extraction and transportation from the operational system to the warehouse environment. You would then execute the bulk of transformation processing in the warehouse environment. The other option is to have a separate data staging area, which is neither a part of the operational system nor a part of the warehouse environment, which eliminates the negative impact on the performance of the warehouse. Extract Load Operational system Staging area Oracle 10g: Data Warehousing Fundamentals

174 Oracle 10g: Data Warehousing Fundamentals 1 - 231
Onsite Staging Model Data staging area within the operational environment, possibly affecting the operational system Transform Extract Load Operational system Staging area Warehouse Onsite Staging Model Alternatively, you may choose to perform the cleansing, transformation, and summarization processes locally in the operational environment and then transport or load to the warehouse. This model may conflict with the day-to-day working of the operational system. If chosen, this model’s process should be executed when the operational system is idle or less heavily used. Instructor Note Make the point that some vendors employ slightly different approaches to ETL. The traditional scenario is that the data is transformed on the operational side. Other vendors are moving the staging file (containing nontransformed data) to the warehouse server and transforming it there, commonly called extraction, transportation, and transformation. In the diagrams in the slide, explain that the bulk of transformation typically occurs in the staging platform, but some transformation may optionally occur earlier during the extraction process. Oracle 10g: Data Warehousing Fundamentals

175 Oracle 10g: Data Warehousing Fundamentals 1 - 232
Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUM NAME ADDRESS Oracle Limited 100 N.E. 1st St. Oracle Computing 15 Main Road, Ft. Lauderdale Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA Oracle Corp UK Ltd 181 North Street, Key West, FLA Data Anomalies Reasons for Data Anomalies One of the causes of inconsistencies within internal data is that in-house system development takes place over many years, often with different software and development standards for each implementation. There may be no consistent policy for the software used in the corporate environment. Systems may be upgraded or changed over the years. Each system may represent data in different ways. Source Data Anomalies Many potential problems can exist with source data: No unique key for individual records Anomalies within data fields, such as differences between naming and coding (data type) conventions Differences in the interpreted meaning of the data by different user groups Spelling errors and other textual inconsistencies (This is particularly relevant in the area of customer names and addresses.) Oracle 10g: Data Warehousing Fundamentals

176 Transformation Routines
Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load Transformation Routines The transformation process uses many transformation routines to eliminate the inconsistencies and anomalies from the extracted data. These transformation routines are designed to perform the following tasks: Cleaning the data, also referred to as data cleansing or scrubbing Adding an element of time (timestamp) to the data, if it does not already exist Translating the formats of external and purchased data into something meaningful for the warehouse Merging rows or records in files Integrating all the data into files and formats to be loaded into the warehouse Transformation can be performed: Before the data is loaded into the warehouse In parallel (On larger databases, there is not enough time to perform this process as a single threaded process.) The transformation process should be self-documenting, generate summary statistics, and process exceptions. Note: The terms scrubbing, cleaning, cleansing, and data reengineering are used interchangeably. Oracle 10g: Data Warehousing Fundamentals

177 Transforming Data: Problems and Solutions
Multipart keys Multiple local standards Multiple files Missing values Duplicate values Element names Element meanings Input formats Referential integrity constraints Name and address Transforming Data: Problems and Solutions The factors listed in the slide can potentially cause problems in the transformation process. Each of these problems and the probable solutions are discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

178 Multipart Keys Problem
Country code Sales territory Product number Salesperson Product code = M Multipart Keys Problem Many older operational systems used record key structures that had a built-in meaning. To allow for decision-support reporting, these keys must be broken down into atomic values. In the example, the key contains four atomic values. Key Code: 12M Where: 12 is the country code M is the sales territory is the product code 45 is the salesperson code Oracle 10g: Data Warehousing Fundamentals

179 Multiple Local Standards Problem
Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600 Multiple Local Standards Problem This is particularly relevant for values entered in different countries. For example, some countries use imperial measurements and others metric: currencies and date formats may differ; currency values and character sets may vary; and numeric precision values may differ. Currency values are often stored in two formats: a local currency (such as British pounds, European euros, Indian rupees, or Australian dollars) and a global currency (such as U.S. dollars). Solution Typically, you use tools or filters to preprocess this data into a suitable format for the database, with the logic needed to interpret and reconstitute a value. You might employ steps that are similar to those identified for multiple encoding. You may consider revising source applications to eliminate these inconsistencies early on. Oracle 10g: Data Warehousing Fundamentals

180 Multiple Files Problem
Added complexity of multiple source files Start simple Multiple source files Logic to detect correct source Transformed data Multiple Files Problem The source of information may be one file for one condition, and a set of files for another. Logic (normally procedural) must be in place to detect the right source. The complexity of integrating data is greatly increased according to the number of data sources being integrated. For example, if you are integrating data from two sources, there is a single point of integration where conflicts must be resolved. Integrate from three sources, and there are three points of conflict. Four sources provide six conflict points. The problem is exponential. Solution This is a complex problem that requires the use of tools or well-documented transformation mechanisms. Try not to integrate all the sources in the first instance. Start with two or three and then enhance the program to incorporate more sources. Build on your learning experiences. Oracle 10g: Data Warehousing Fundamentals

181 Missing Values Problem
Solution: Ignore Wait Mark rows Extract when timestamped If NULL, then field = “A” A Missing Values Problem Null, missing, and default values are always an issue. NULL values may be valid entries where NULLs are allowed; otherwise, NULLs indicate missing values. Solution You must examine each occurrence of the condition to determine validity and decide whether these occurrences must be transformed—that is, identify whether a NULL is valid or invalid (missing data). You may choose to: Ignore the missing data. If the volume of records is relatively small, it may have little impact overall. Wait to extract the data until you are sure that missing values are entered from the operational system Mark rows when extracted so that, on the next extract, you can select only those rows not previously extracted. It does involve the overhead of SELECT and UPDATE, and if the extracted data forms the basis of a summary table, these need re-creating. Extract data only when it is timestamped Oracle 10g: Data Warehousing Fundamentals

182 Duplicate Values Problem
Solution: SQL self-join techniques RDBMS constraints SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+); Duplicate Values Problem You must eliminate duplicate values, which invariably exist. This can be time consuming, although it is a simple task to perform. Solution You can use standard SQL self-join techniques or RDBMS constraints to eliminate duplicates. Oracle database offers different types of integrity constraints: primary key, foreign key, unique, not null, and check constraints. For example, you can enforce unique constraint on the columns that should not be taking duplicate values. Oracle 10g: Data Warehousing Fundamentals

183 Oracle 10g: Data Warehousing Fundamentals 1 - 241
Element Names Problem Solution: Common naming conventions Customer Client Customer Contact Element Names Problem Individual attributes, columns, or fields may vary in their naming conventions from one source to another. These need to be eliminated to ensure that one naming convention is applied to the value in the warehouse. If you are employing independent data marts, then you should ensure that the ETL solution is mirrored; should you plan to employ the data marts dependently in the future, they will all refer to the same object. Solution You must obtain agreement from all relevant user groups on renaming conventions, and rename the elements accordingly. Document the changes in metadata. The programs you use determine the solution. For example, if you are using the SQL CREATE TABLE AS SELECT (CTAS) statement, the new column name is used in that statement. If you use SQL*Loader as an intermediary mechanism before load, you create your destination object with the agreed naming convention applied. Agreement on the name change and the meaning of the data can become a political issue between groups and departments in the organization. Name Oracle 10g: Data Warehousing Fundamentals

184 Element Meaning Problem
Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name All customer details All details except name Element Meaning Problem Like the name of an element, the meaning is often interpreted differently by different user groups. The variations in naming conventions typically drive this misinterpretation. You need to keep your model independent of naming conventions that may be popular today, but subject to change. Solution It is a difficult problem, often political, but you must ensure that the meaning is clear. By documenting the meaning in metadata you can solve this problem, especially if the meaning is composed of several elements and algorithms have been used. In order to take information from the operational system into the warehouse, you must know the meaning of the data. This may involve rebuilding the transaction from its component parts (which are likely in a normalized state). You must know the: Business rules Processes executed for a type of transaction, such as the tables that are updated This is a complex task, which may involve merging or separating data components, extracting values from multipart keys, and much more. Customer_detail Oracle 10g: Data Warehousing Fundamentals

185 Oracle 10g: Data Warehousing Fundamentals 1 - 243
Input Format Problem EBCDIC ASCII “123-73” 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8) Input Format Problem Input formats vary considerably. For example, one entry may accept alphanumeric data, so the format may be “123-73”. Another entry may accept numeric data only, so the format may be “12373”. You may also need to convert from ASCII to EBCDIC, or even convert complex character sets such as Hebrew, Arabic, or Japanese. Solution First, ensure that you document the original and the resulting formats. Your program (or tool) must then convert those data types either dynamically or through a series of transforms into one acceptable format. You can use Oracle SQL*Loader to perform certain transformations (such as EBCDIC to ASCII conversions) and assign values to default or NULL values. Oracle 10g: Data Warehousing Fundamentals

186 Referential Integrity Problem
Solution: SQL antijoin Server constraints Dedicated tools Department 10 20 30 40 Emp Name Department 1099 Smith 10 1289 Jones 20 1234 Doe 50 6786 Harris 60 Referential Integrity Problem If the constraints at the application or database level have in the past been less than accurate, child and parent record relationships can suffer; orphaned records can exist. You must understand data relationships built into legacy systems. The biggest problem encountered here is that they are often undocumented. You must gain the support of users and technicians to help you with analysis and documentation of the source data. Solution This cleaning task is time consuming and requires business experience to resolve the inconsistencies. You can use SQL antijoin query techniques, server constraint utilities, or dedicated tools to eliminate these inconsistencies. Oracle 10g: Data Warehousing Fundamentals

187 Name and Address Problem
Single-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Database 1 NAME LOCATION DIANNE ZIEFELD N100 HARRY H. ENFIELD M300 Name Mr. J. Smith Street 100 Main St. Town Bigtown Country County Luth Code 23565 Database 2 NAME LOCATION ZIEFELD, DIANNE 100 ENFIELD, HARRY H 300 Name and Address Problem One of the largest areas of concern, with regard to data quality, is how name and address information is held, and how to transform it. Name and address information has historically suffered from a lack of legacy standards. This information has been stored in many different formats, sometimes dependent upon the software or even the data processing center used. Some of the following data inconsistencies may appear: No unique key Missing data values (NULLs) Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines The data may be in a single field of no fixed format (example shown in the slide) Each component of an address may be in a specific field (example shown in the slide) Oracle 10g: Data Warehousing Fundamentals

188 Name and Address Processing in Oracle Warehouse Builder
Name and address mapping operator supports: Parsing Standardization Postal matching and geocoding Name and Address Processing in Oracle Warehouse Builder Oracle Warehouse Builder features a name and address mapping operator to support parsing, standardization, postal matching, and geocoding of name and address data. Name and address parsing is the breakdown of nondiscrete input into discrete name or address components. For example, the following input address would be parsed into an abbreviated list of address components: Input address: Mr. Joe A. Smith Sr. 8500 Normandale Lake Blvd Suite 710 Bloomington MN 55438 Parsed address: Pre name: MR First name: JOE First name standardized: JOSEPH Post name: SR Street name: NORMANDALE LAKE Oracle 10g: Data Warehousing Fundamentals

189 Quality Data: Importance and Benefits
Key to a successful warehouse implementation Quality data helps you in: Targeting right customers Determining buying patterns Identifying householders: private and commercial Matching customers Identifying historical data Data Quality: Importance and Benefits Importance of Quality Data Quality data is the key to a successful warehouse implementation. Data anomalies are bound to exist in source systems. However, if they are allowed to get into the data warehouse, this leads to inaccurate information, which further leads to inaccurate reports and bad business decisions. The overall result is a lack of confidence in the system to deliver the solution and a data warehouse that either is not used or requires substantial improvement and management buy-in. Benefits of Quality Data Quality data (after the dirty data is eliminated from the staging area) helps you query the warehouse to: Target the right audience for marketing communication Determine that a particular customer buys related products Determine that a group of people form a family, each of whom is a potential customer (house holding) Identify that an organization is part of a larger enterprise (commercial house holding) Oracle 10g: Data Warehousing Fundamentals

190 Quality: Standards and Improvements
Setting standards: Define a quality strategy. Decide on optimal data-quality level. Improving operational data quality: Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ. Quality: Standards and Improvements Setting Standards A data-quality strategy must be defined in the early stages of the development cycle. The strategy should define the optimal level of data quality that provides the value required for the business. For example, there is little point in seeking a low data inconsistency rate at great expense if the benefit to the business is not tangible. Improving Operational Data Quality You may need to consider making changes over time to the operational system in order to improve the quality of data for the warehouse: Some of the validation and integrity rules that are applied to current operational data may need to be modified or enhanced. You may need to document previously undocumented sources, enlist the help of users who know the business data, and consider creating a “data stewardship” program. You should carefully examine the cleanup processes that you employ in transforming the extracted data. Oracle 10g: Data Warehousing Fundamentals

191 Data Quality Guidelines
Operational data: Should not be used directly in the warehouse Must be cleaned for each increment Is not fixed by modifying applications Data Quality Guidelines Do not assume that because the data in the operational system suits you at the operational level, it is going to be appropriate, suitable, and of a sufficiently high quality for the data warehouse. Operational data should never be used directly in the warehouse because: The operational system contains no aging information There are many examples of disparity in the data There are many different meanings applied to data Good operational data when merged may become poor data warehouse data Do not assume that it is acceptable to clean up data after the pilot run of the first increment or implementation; operational data must be cleaned for each increment, failing which will lead to the following problems: The credibility of the data warehouse or data mart will suffer. Post-implementation cleanups are more costly and the risk is higher than that during the pilot run. The programs needed to handle the multitude of problems are very complex and would need to be rewritten after cleanup. Oracle 10g: Data Warehousing Fundamentals

192 Data Quality: Solutions and Management
COBOL, Java, 4GL Specialized tools Customized data conversion process: Investigation Conditioning and standardization Integration Management: Take responsibility. Resolve problems. Appoint a data quality manager. Data Quality: Solutions and Management Solutions Use COBOL, Java, or 4GL programs or purchase specialized tools to capture and eradicate anomalies before data load. It is often very difficult to predict all possible variants. You may consider designing a process in-house to assure the quality of the data entering the data warehouse. The process must involve: Data investigation: Parsing, lexical analysis, and pattern investigation Data conditioning and standardization: Moving the data into fixed fields, standardizing names and addresses Data integration: Building unique keys and integrating the data You should involve the business experts in the entire warehouse ETL process. Data Quality Management You must manage the quality of the data, processes, and rules, and put people in place to manage them. Someone must own, be directly responsible for, and resolve the issue of poor data quality. This person is often known as the data quality manager. Oracle 10g: Data Warehousing Fundamentals

193 Transformation Techniques
Merging data Adding a date stamp Adding keys to data Transformation Techniques Each of the transformation techniques listed in the slide is discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

194 Oracle 10g: Data Warehousing Fundamentals 1 - 257
Merging Data Operational transactions do not usually map one-to-one with warehouse data. Data for the warehouse is merged to provide information for analysis. Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:04 Sausage Pizza $11.00 Return 1/2/02 12:00:03 Anchovy Pizza – $12.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Sale 1/2/02 12:00:01 Ham Pizza $10.00 Pizza sales/returns by day, hour, seconds Merging Data An operational transaction does not usually have a one-to-one mapping with data in the warehouse, even if the data in the warehouse is maintained at the transaction level. For example, consider a sales transaction in a store. The logical transaction comprises a number of components such as date of sale, charge amount, number of items, discount amount, and payment method. The transaction may even be a return. A customer purchase and a customer return are very different types of sales transactions, and different business rules must apply. For each different transaction, a different process occurs. A purchase depletes inventory and a return adds stock back into inventory. The result is, for the warehouse, that the data you are keeping is held for purely reporting purposes and these transactions become merged into data that is useful for that purpose. The data will not, in the end, map strictly to sales or returns. An example of merging data is shown in the following slide. Oracle 10g: Data Warehousing Fundamentals

195 Oracle 10g: Data Warehousing Fundamentals 1 - 258
Merging Data Pizza sales/returns by day, hour, seconds Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:02 Anchovy Pizza $12.00 Return 1/2/02 12:00:03 Anchovy Pizza – $12.00 Sale 1/2/02 12:00:04 Sausage Pizza $11.00 Pizza sales Sale 1/2/02 12:00:01 Ham Pizza $10.00 Sale 1/2/02 12:00:02 Cheese Pizza $15.00 Sale 1/2/02 12:00:04 Sausage Pizza $11.00 Oracle 10g: Data Warehousing Fundamentals

196 Oracle 10g: Data Warehousing Fundamentals 1 - 259
Adding a Date Stamp Time element can be represented as a: Single point in time Time span Add time element to: Fact tables Dimension data Adding a Date Stamp Time is important within the data warehouse. You have already looked at the time dimension, which is always created in the warehouse in order to provide reporting by time periods. Extracted source data probably does not contain time information because it is not typical of timestamp information in operational systems (unless of course they too are maintaining history, or time is a critical component). It is more likely that the record in the operational system has a value associated with it, such as Order_date, Ship_date, or Call_date. Therefore it is important to consider how you are going to add a time element to your warehouse data. The time element can be represented as: A single point-in-time date A date range (start and end date) Oracle 10g: Data Warehousing Fundamentals

197 Adding a Date Stamp: Fact Tables and Dimensions
ChannelsTable Channel_id Channel_name Time_key Customers Table Cust_id Cust_first_name Sales Item_id Store_id Sales_dollars Sales_units Times Table Week_id Period_id Year_id Products Table Product_id Product_desc Adding a Date Stamp: Fact Tables and Dimensions Fact Table Data Assume that you need to add the next set of records from the source systems to your fact table. You need to determine which records are to be moved into the fact table. You have added data for March Now you need to add data for April You need to find a mechanism to stamp records so that you pick up only April 2002 records for the next refresh. You might choose from a number of techniques to timestamp data: Code application or database triggers at the operational level, which can then be extracted using date selection criteria. Perform a comparison of tables, original and new, to identify differences. Maintain a table containing copies of changed records to be loaded. You must decide what the best techniques for you to use are according to your current system implementations. These are discussed in greater detail later in the course. Oracle 10g: Data Warehousing Fundamentals

198 Oracle 10g: Data Warehousing Fundamentals 1 - 263
Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza –$12.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Adding Keys to Data You are moving the data from one structure, with its keys defining relationships, into another structure, which may be totally different. You must ensure that this new structure also has keys, defining the same relationships. The transformation of this data includes adding keys (generalized or artificial) or creating keys from existing data values. Note: Creating keys is discussed in more detail later in the course. #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Oracle 10g: Data Warehousing Fundamentals

199 Oracle 10g: Data Warehousing Fundamentals 1 - 264
Summarizing Data During extraction on staging area After loading to the warehouse server Summarizing Data Creating summary data is essential for a data warehouse to perform well. Here, summarization is classified under transformation because you are changing (transforming) the way the data exists in the source system to be used with the data warehouse. You can summarize the data: At the time of extraction in batch routines: This reduces the amount of work performed by the data warehouse server because all the effort is concentrated on the source systems. However, summarizing at this time increases the complexity and time taken to perform the extract, the number of files created, the number of load routines, and the complexity of the scheduling process. After the data is loaded into the warehouse server: In this process the fact data is queried, summarized, and placed into the requisite summary fact table. This method reduces the complexity and time taken for the extract tasks. However, it places all the CPU and I/O intensive work on the warehouse server, thus increasing the time that the warehouse is unavailable to the users. Operational databases Staging area Warehouse database Oracle 10g: Data Warehousing Fundamentals

200 Maintaining Transformation Metadata
Transformation metadata contains: Transformation rules Algorithms and routines Sources Stage Rules Publish Maintaining Transformation Metadata As with the extraction process, metadata must be maintained for the transformation process. Transformation metadata contains: Information about how to perform key restructuring Logic to eliminate different coding methods, data values, and parsing rules Logic to detect multiple source files Logic and exception rules to handle NULL, negative values, and default values and to eliminate and consolidate duplicate values Element renaming conventions Granularity conversions, input or language formats, conversion algorithms, and data standardization rules Referential integrity fixes Logic and program names used to create summary data Transformation frequency, program name, location, failure procedures, and validation Temporary extraction storage location, name, and source Extract Transform Load Query Oracle 10g: Data Warehousing Fundamentals

201 Maintaining Transformation Metadata
Restructure keys. Identify and resolve coding differences. Validate data from multiple sources. Handle exception rules. Identify and resolve format differences. Fix referential integrity inconsistencies. Identify summary data. Maintaining Transformation Metadata (continued) The metadata also contains information about the frequency of program execution. Data repair usually involves using simple algorithms or more complex artificial intelligence programs to correct data. When the warehouse is available, the users should know the exact business meaning of a field, what data it contains, and how that data is achieved (derivation, calculation, or summary algorithms). Metadata maintains this information and presents the information to the user through the query tool. The slide lists a small set of tasks that a programmer must consider when creating a transformation program. You can see why this process is possibly the longest, complex, and most time-consuming part of the data warehouse implementation. Oracle 10g: Data Warehousing Fundamentals

202 Data Ownership and Responsibilities
Data ownership and responsibilities should be shared by the: Operational team Data warehouse team Business benefit gained with the “work together” approach Data Ownership and Responsibilities The data extracted from the source systems is often under the control and ownership of application development teams who have been working with the operational data since its inception. The loading of the data into the warehouse is usually under the control of the data warehousing development team. This raises the question of who is responsible for the transformation of the data: the process between developing and loading the data into the warehouse. These two teams must work together—those responsible for operational data and those responsible for warehouse data. It brings all the required knowledge together and produces the best solution. Working together enhances understanding, knowledge, teamwork, and a leveling of roles within the groups. The operational team may be critical to ensuring the success of the data extraction and providing the data warehouse team with extract files in requisite formats (for example C, COBOL, and PL/SQL). The data warehouse team can then take on the task of making sure the extracted data is accurate and of sufficiently high quality for the warehouse. Oracle 10g: Data Warehousing Fundamentals

203 Transformation Timing and Location
Transformation is performed: Before load In parallel Can be initiated at different points: On the operational platform In a separate staging area Transformation Timing and Location You need to consider carefully when and where you perform transformation. You must perform transformation before the data is loaded into the warehouse, and in parallel; on larger databases, there is not enough time to perform this process as a single threaded process. Consider the different places and points in time where transformation may take place. The following are different transformation points: Transformation Points On the operational platform: This approach transforms the data on the operational platform, where the source data resides. The negative impact of this approach is that the transformation operation conflicts with the day-to-day working of the operational system. If it is chosen, the process should be executed when the operational system is idle or less utilized. The impact of this approach is so great that it is very unlikely to be employed. In a separate staging area: This approach transforms data on a separate computing environment, the staging area, where summary data may also be created. This is a common approach because it does not affect either the operational or warehouse environment. Oracle 10g: Data Warehousing Fundamentals

204 Choosing a Transformation Point
Workload Impact on environment CPU usage Disk space Network bandwidth Parallel execution Load window time User information needs Choosing a Transformation Point The approach you choose depends upon operational requirements. You must balance many factors in order to determine the best solution. You must consider the following factors: The actual workload (time to complete) of the transformations needed to provide the data for the warehouse The physical impact on each of the environments you might choose (This is particularly relevant if you choose to use the operational platform.) The available disk space (for temporary and intermediate data and file store) and CPU usage in each environment The available network and bandwidth between environments, affecting transfer volumes Whether the environment is capable of working in a parallel manner The load window time constraints The information needs of the business user (When do they need this data? How often do refreshes occur?) Oracle 10g: Data Warehousing Fundamentals

205 Monitoring and Tracking
Transformations should: Be self-documenting Provide summary statistics Handle process exceptions Monitoring and Tracking The transformations should be self-documenting, should generate summary statistics, and should be able to process exceptions. Oracle 10g: Data Warehousing Fundamentals

206 Designing Transformation Processes
Analysis: Sources and target mappings, business rules Key users, metadata, grain Design options: Tools (OWB) Custom 3GL programs 4GLs such as SQL or PL/SQL Replication Design issues: Performance Size of the staging area Exception handling, integrity maintenance Designing Transformation Processes When designing your transformation processes, consider the analysis issues, the design options that are available to you, and the design issues. Analysis Source and target mappings Business rules Key users Metadata Granularity of the fact data and summaries Design Options Tools (such as OWB) Custom 3GL programs 4GLs such as SQL or PL/SQL Replication Design Issues Performance and throughput Sizing the staging areas to hold the data to be loaded into the warehouse Exception handling Integrity maintenance Oracle 10g: Data Warehousing Fundamentals

207 Oracle 10g: Data Warehousing Fundamentals 1 - 275
Transformation Tools SQL*Loader Oracle Warehouse Builder(OWB) supports Predefined transformations Custom transformations Transformation Tools Transformations can be performed by: SQL*Loader: This is an Oracle product that is commonly used to transport large volumes of data into the warehouse tables. It can also provide you with simple data transformations, such as multiple records becoming a single record, or conversely a single record at source becoming multiple records. Data type conversions and simple NULL handling can also be automatically resolved during the data load. Warehouse Builder: OWB enables you to perform common transformations quickly and easily by providing a set of predefined transformations. These predefined transformations are provided as a part of the Oracle Library in OWB, and consist of Oracle-supplied functions and procedures. You can directly use these predefined transformations to transform your data. In addition, Warehouse Builder enables you to define custom transformations. Custom transformations include procedures, functions, packages, and table functions. Warehouse Builder provides wizards to create each type of custom transformation. Custom transformations can belong to the Global Shared Library or to a particular project. Custom transformations in the Global Shared Library can be used across all projects of the repository in which they are defined. Oracle 10g: Data Warehousing Fundamentals

208 Oracle’s Enhanced Features for Transformation
Transformation methods: Load into staging tables. Staging table 1 Flat files Transform data. Validate data. Staging table 2 Merge into warehouse tables. Staging table 2 Oracle’s Enhanced Features for Transformation Transformation Methods Multistage transformation: The data transformation logic for most data warehouses consists of multiple steps. For example, in transforming new records to be inserted into a sales table, there may be separate logical transformation steps to validate each dimension key. Oracle supports this usual type of “multistage data transformation.” Multistage transformation Data warehouse Oracle 10g: Data Warehousing Fundamentals

209 Oracle’s Enhanced Features for Transformation
Transformation methods: External tables Table functions External table Transform data. Validate data. Flat files Warehouse tables Merge into warehouse tables. Pipelined transformation Oracle’s Enhanced Features for Transformation (continued) Transformation Methods (continued) Pipelined data transformation: In addition to the multistage transformation, Oracle Database10g has enhanced features to support “pipelined data transformation.” Pipelined data transformation can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformation steps. This technique aids fast and efficient transformations. Data can be extracted from flat files using external tables. The extracted data can be transformed while loading it into the data warehouse. (The transformations can be done using table functions, which are discussed in the following pages.) The advantage of pipelined data transformations is that you can achieve better performance by combining many simple logical transformations into a single SQL statement or a single PL/SQL procedure, rather than performing each step independently. The illustration in the slide shows the steps in a pipelined data transformation. Oracle 10g: Data Warehousing Fundamentals

210 Oracle’s Enhanced Features for Transformation
Transformation mechanisms using SQL: CREATE TABLES AS SELECT (CTAS) UPDATE MERGE Multitable INSERT Cust Customer 50 130 50 60 80 130 Existing row updated MERGE Oracle’s Enhanced Features for Transformation (continued) Transformation Mechanisms Transformation using SQL: Oracle supports data transformations using CTAS, UPDATE, MERGE, and multitable INSERT. The CREATE TABLE AS SELECT statement is a powerful tool for manipulating large sets of data. Another technique for implementing a data substitution is to use an UPDATE statement. However, if the data substitution transformations require that a very large percentage of the rows (or all the rows) be modified, then it may be more efficient to use a CTAS statement than an UPDATE. Oracle Database 10g has a merge functionality that extends SQL by introducing the SQL keyword MERGE in order to provide the ability to update or insert a row conditionally into a table. Using the MERGE statement, you can select rows from one table and either update or insert these rows into another table. It is very frequent in data warehouse environments to fan out the same source data into several target objects. Multitable inserts provide a new SQL statement for these kinds of transformations, where data can either end up in several or exactly one target, depending on the business transformation rules. This insertion can be done conditionally based on business rules or unconditionally. That is, multitable INSERT allows you to insert rows into multiple tables as part of a single DML statement. Multitable INSERT is depicted in the next slide. New row inserted Oracle 10g: Data Warehousing Fundamentals

211 Application of the MERGE Statement in Data Warehousing
An example: MERGE INTO customers C USING cust_src S ON (c.cust_id = s.src_cust_id) WHEN MATCHED THEN UPDATE SET c.cust_address = s.cust_address WHEN NOT MATCHED THEN INSERT ( cust_id, cust_first_name,…) VALUES (src_cust_id, src_first_name,…); Application of the MERGE Statement in Data Warehousing The slide shows an example of using the MERGE statement in data warehousing. Customers (C) is a large table and cust_src is a smaller “delta” table with rows that need to be inserted into customer conditionally. This MERGE statement indicates that the customer table has to be merged with the rows returned from the evaluation of the ON clause of the MERGE. The USING clause in this case is the table cust_src (S), but it can be an arbitrary query. Each row from S is checked for a match to any row in C by satisfying the join condition specified by the ON clause. If so, each row in C is updated using the UPDATE SET clause of the MERGE statement. If no such row exists in C, then the rows are inserted into table causing the ELSE INSERT clause. Note: You can use the APPEND hint with the MERGE statement. Applications of MERGE: MERGE statements use a single SQL statement to complete and UPDATE or INSERT or both, which make it very useful for the ETL process. (An UPDATE is performed when the row exists, or the row is inserted.) The statement can be parallelized transparently. It is useful for bulk DML, as performance is improved because fewer statements require fewer scans of the source tables. Oracle 10g: Data Warehousing Fundamentals

212 Multitable INSERT Statements
Types: Unconditional INSERT Pivoting INSERT Conditional ALL INSERT Conditional FIRST INSERT Source table Condition Multitable INSERT Statements The INSERT … AS SELECT statement with the new syntax can be parallelized and used to insert rows into multiple tables as part of a single DML statement. The multitable INSERT statement inserts computed rows derived from the rows returned from the evaluation of a subquery, and can be used in data warehousing systems to transfer data from one or more operational sources to a set of target tables. Note: Multitable inserts can be used in direct loading. Loading is discussed in the lesson titled “The ETL Process: Loading Warehouse Data.” However, the multitable INSERT statement can be performed with or without direct load, and with or without parallelization for faster performance. The following are the types of the multitable inserts with examples: Unconditional INSERT (ALL into_clause) Specify ALL followed by multiple insert_into_clauses to perform an unconditional multitable INSERT. Oracle server executes each insert_into_clause once for each row returned by the subquery. Target table 1 Target table 2 Target table 3 Oracle 10g: Data Warehousing Fundamentals

213 Advantages of Multitable INSERTs
Eliminates the need for multiple INSERT…AS SELECT statements to populate multiple tables Eliminates the need for a procedure to perform multiple INSERTs using IF…THEN…ELSE syntax Significant performance improvement over the preceding two methods due to the elimination of the cost of repeated scans on the source data Oracle 10g: Data Warehousing Fundamentals

214 Oracle’s Enhanced Features for Transformation
Transformation mechanisms Using PL/SQL: Used for complex transformations Using table functions. Table functions can: Return multiple rows from a function Accept results of multiple row SQL subqueries as input Take cursors as input Be parallelized Support incremental pipelining Oracle’s Enhanced Features for Transformation Transformation Mechanisms Transformation using PL/SQL: In a data warehouse environment, you can use procedural languages such as PL/SQL to implement complex transformations in the Oracle database. That is, PL/SQL procedures can be used for complex transformations, which are difficult or impossible to express using a sequence of standard SQL statements. Transformation using table functions: A table function is defined as a function that can take a set of rows as input and produce a set of rows as output. Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java without requiring the use of intermediate staging tables, which aids faster and efficient data flow through various transformations steps. This enables the ETL tasks shift from a serial transform-then-load process (with most of the tasks done outside the database) or a load-then-transform process to an enhanced transform-while-loading process. Oracle 10g: Data Warehousing Fundamentals

215 Advantages of PL/SQL Table Functions
Table functions “pipeline” the results to the consuming process as soon as they are produced. Table functions can return multiple rows during each invocation (pipelining of data). Pipelining eliminates the need for buffering the produced rows. Oracle 10g: Data Warehousing Fundamentals

216 Oracle 10g: Data Warehousing Fundamentals 1 - 287
Summary In this lesson, you should have learned how to: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data Describe techniques for transforming data Design transformation process Describe Oracle’s enhanced features and tools that can be used to transform data Oracle 10g: Data Warehousing Fundamentals

217 Oracle 10g: Data Warehousing Fundamentals 1 - 288
Practice 6-1: Overview This practice covers the following topics: Identifying the suitable staging model for RISD data warehouse Identifying the problems, and the best suited transformation techniques for the RISD data based on the given scenario Exploring the viewlet based demonstrations on ETL features of Oracle Warehouse Builder Oracle 10g: Data Warehousing Fundamentals

218 The ETL Process: Loading Data
Schedule: Timing Topic 60 minutes Lecture 25 minutes Practice 85 minutes Total

219 Oracle 10g: Data Warehousing Fundamentals 1 - 292
Objectives After completing this lesson, you should be able to do the following: Explain key concepts in loading warehouse data Outline how to build the loading process for the initial load Identify loading techniques Describe the loading techniques provided by Oracle Identify the tasks that take place after data is loaded Explain the issues involved in designing the transportation, loading, and scheduling processes Lesson Aim In the last two lessons, you examined extraction and transformation issues. In this lesson, you examine how the extracted and transformed data is transported and loaded into the warehouse as the first-time loading of data. Instructor Note The focus of this lesson is on initial load of data into the warehouse. Data refresh is mentioned here but is covered in the lesson titled “Refreshing Warehouse Data.” Oracle 10g: Data Warehousing Fundamentals

220 Loading Data into the Warehouse
Loading moves the data into the warehouse. Loading can be time consuming: Consider the load window. Schedule and automate the loading. Initial load moves large volumes of data. Subsequent refresh moves smaller volumes of data. Transform Extract Transport, load Loading Data into the Warehouse The acronym ETL for “extraction, transformation, and loading” is perhaps too simplistic because it omits the transportation phase and implies that each of these processes is essentially distinct. This may not be true always; sometimes, the entire process, including data loading, is referred to as ETL. The transportation process involves moving the data from source data stores or an intermediate staging area to the data warehouse database. The loading process loads the data into the target warehouse database in the target system server. Transportation is often one of the simpler portions of the ETL process, and is often integrated with the loading process. These processes comprise a series of actions, such as moving the data and loading the data into tables. There may also be some processing of objects after the load, often referred to as postload processing. Moving and loading of the data can be a time-consuming task, depending upon the volumes of data, the hardware, the connectivity setup, and whether parallel operations are in place. The time period within which the warehouse system can perform the load is called the load window. Loading should be scheduled and prioritized. You should also ensure that the loading process is automated as much as possible. Operational databases Staging area Warehouse database Oracle 10g: Data Warehousing Fundamentals

221 Transportation in a Data Warehouse
Three basic choices in transportation: Transportation using flat files Transportation through distributed operations Transportation using transportable tablespaces Transportation in a Data Warehouse Flat Files The most common method for transporting data is by the transfer of flat files, using mechanisms such as FTP or other remote file system access protocols. Data is unloaded or exported and then transported to the target platform using FTP or similar mechanisms. Because source systems and data warehouses often use different operating systems and database systems, using flat files is often the simplest way to exchange data between heterogeneous systems with minimal transformations. However, even when transporting data between homogeneous systems, flat files are often the most efficient and most easy-to-manage mechanism for data transfer. Distributed Operations Distributed queries, either with or without gateways, can be an effective mechanism for extracting data. These mechanisms also transport the data directly to the target systems, thus providing both extraction and transformation in a single step. As opposed to flat file transportation, the success or failure of the transportation is recognized immediately with the result of the distributed query or transaction. Oracle 10g: Data Warehousing Fundamentals

222 Transportable Tablespaces
Are the fastest way for moving data between two Oracle databases Bypass the unload and reload steps Provide a mechanism for transporting data along with metadata Are useful for moving data from: Staging database to a data warehouse Data warehouse to data marts Transportable Tablespaces Oracle databases (Oracle8i and above) have an important mechanism for transporting data: transportable tablespaces. This feature is the fastest way for moving large volumes of data between two Oracle databases. Before Oracle8i, the most scalable data transportation mechanisms required that data be unloaded or exported into files from the source database, and then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass these unload and reload steps. Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported and loaded from one Oracle database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data. Transportable tablespaces have some limitations. The source and target systems must be running Oracle8i (or higher), must be running the same operating system (prior to Oracle10g), must use the same character set, and must use the same data block size (prior to Oracle9i). Despite these limitations, transportable tablespaces can be an invaluable data transportation technique in Oracle warehouse environments. The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart. Oracle 10g: Data Warehousing Fundamentals

223 Example: Transportable Tablespace
Identify the tablespace, whose data is to be transported (for example, TEMP_SALES), and make it READ ONLY. SQL> CONNECT / AS SYSDBA SQL> ALTER TABLESPACE TEMP_SALES READ ONLY; Export the metadata of the tablespace. exp USERID=\'/ AS SYSDBA\' TABLESPACES= TEMP_SALES TRANSPORT_TABLESPACE=Y FILE=temp_sales.dmp LOG=temp_sales.log Copy the data files and the export file on the target. Import the metadata on the target database. imp userid=\'/ AS SYSDBA\' TABLESPACES=users TRANSPORT_TABLESPACE=Y DATAFILES='/u02/DATA/users01.dbf' TTS_OWNERS=SH FILE=temp_sales.dmp Example: Transportable Tablespace The steps for moving the data using transportable tablespaces are shown in the slide. Transportable tablespaces require the use of EXPORT and IMPORT for the data movement. On the source database, the tablespace being transported must be made READ ONLY before using export; this is because a tablespace cannot be transported unless there are no active transactions modifying the tablespace. When the tablespace is READ ONLY, the metadata for the tablespace can be exported. Before any changes are made to the data files (that is, before making the tablespace READ WRITE or performing a backup) belonging to the tablespace, copy the data files to the target database. After the data files have been moved, the metadata can be imported into the target database. Note: Objects can be moved from one user to another when using transportable tablespaces. You can also import objects only for a specific user. Oracle 10g: Data Warehousing Fundamentals

224 Initial Load and Refresh
Is a single event that populates the database with historical data Involves large volumes of data Employs distinct ETL tasks Involves large amounts of processing after load Refresh: Performed according to a business cycle Less data to load than first-time load Less complex ETL tasks Smaller amounts of postload processing Initial Load and Refresh Initial Load: The initial load (also called first-time load) is a single event that occurs before implementation. It populates the data warehouse database with as much data as needed or available. The first-time load moves data in the same way as the regular refresh. However, the complexity of the task is made greater due to: Data volumes that may be very large (Your company decides to load the last five years of data, which may comprise millions of rows. The time taken to load the data may be in days rather than hours.) Distinct extraction and transformation tasks that are applicable only to this old data The task of populating all fact tables, all dimension tables, and any other ancillary tables that you may have created (such as reference tables) Postprocessing of loaded data, with tasks that must work on the large data volumes, such as indexing and key generation Postload processing on large volumes of data, such as creating summary tables With all the issues surrounding the initial load, it is a task not to be considered lightly. You must plan, prepare, and have recovery capabilities built in to your processing routines to ensure success. Oracle 10g: Data Warehousing Fundamentals

225 Data Refresh Models: Extract Processing Environment
After each time interval, build a new snapshot of the database. Purge old snapshots. Operational databases Data Refresh Models: Extract Processing Environment First, to ensure that you understand how the warehouse data presentation differs from nonwarehouse data presentation, consider how up-to-date data is presented to users in two different decision-support environments: a simple extract processing environment and a data warehouse environment. Extract Processing Environment A snapshot of operational data is taken at regular time intervals: T1, T2, and T3. At each interval, a new snapshot of the database is created and presented to the user; the old snapshot is purged. T1 T2 T3 Oracle 10g: Data Warehousing Fundamentals

226 Data Refresh Models: Warehouse Processing Environment
Build a new database. After each time interval, add changes to the database. Archive or purge the oldest data. Operational databases Data Refresh Models: Warehouse Processing Environment Warehouse Environment An initial snapshot is taken and the database is loaded with data. At regular time intervals, T1, T2, and T3, a delta database or file is created and the warehouse is refreshed. A delta contains only the changes made to operational data that need to be reflected in the data warehouse. The warehouse fact data is refreshed according to the refresh cycle that is determined by user requirements analysis. The warehouse dimension data is updated to reflect the current state of the business, only when changes are detected in the source systems. The older snapshot of data is not removed, ensuring that the warehouse contains the historical data that is needed for analysis. The oldest snapshots are archived or purged only when the data is not required any longer. T1 T2 T3 Oracle 10g: Data Warehousing Fundamentals

227 Building the Loading Process
Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth Building the Loading Process Specifying the Process You must identify early in the development process how you are going to move the data from the source systems and load it into the data warehouse. You must identify: The data movement techniques and tools available File transfer methods and transfer models available The time available to load the data into the warehouse—the load window Determine whether the time window is sufficient for other tasks such as backup, preventative maintenance, and recovery, given expected performance metrics The volumes of data involved in the initial load and subsequent refreshes The frequency of the refresh cycle for the data Connectivity bandwidth Note that transportation is inherent in the above process specification. Oracle 10g: Data Warehousing Fundamentals

228 Building the Loading Process
Test the proposed technique. Document the proposed load. Monitor, review, and revise. Building the Loading Process (continued) Testing the Process You should test the proposed technique to ensure that volumes can be physically moved within the load window constraints and network capabilities. Documenting the Process You must communicate and document the proposed load with the operations organization to ensure their agreement and commitment to this important process. Monitoring, Reviewing, and Revising the Process You should ensure that the load is constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision. Oracle 10g: Data Warehousing Fundamentals

229 Oracle 10g: Data Warehousing Fundamentals 1 - 304
Data Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk space, more details High-level grain: Cheaper, less processing, less disk space, little details Data Granularity Data granularity has been discussed in the context of modeling the warehouse. Data granularity also plays an important role in the loading of warehouse data. The higher the level of granularity, the more is the data loaded in lesser time. That is, data granularity affects the amount of time taken to load the data into the warehouse. Low-Level Grain: Low-level grain data can be expensive to build and maintain. It requires a large amount of processing power to process the details and provide answers to business queries. It takes up more disk space and could create response time problems. However, it provides the detailed information that is needed at a low level to give sophisticated business analysis. High-Level Grain: High-level grain data is easier to build and maintain than low-level grain data. It requires less processing power and disk space, allows a higher number of concurrent users to access data, and performs well. However, the lack of detail and drill-down capability hinders definitive answers to business questions. Note: The level of granularity affects not only the amount of direct access storage devices (DASD) required for warehouse data, but also the amount of space required for backup, recovery, and partitioning. Oracle 10g: Data Warehousing Fundamentals

230 Oracle 10g: Data Warehousing Fundamentals 1 - 305
Loading Techniques ETL tools Utilities such as SQL*Loader Gateways Customized copy programs Replication FTP Manual Loading Techniques Now that you have seen how to capture the data that is needed for the refresh, consider how to physically move and load the data to the warehouse server. The following are the common techniques used to load data into the warehouse: ETL tools (such as Oracle Warehouse Builder) Proprietary data movement utilities such as Oracle SQL*Loader Note: The fastest way to load large amounts of data into the warehouse is to use utilities such as SQL*Loader that can access the database directly, use networks efficiently, and run in parallel environments. Gateways, which may be vendor-specific or programmable, such as the Oracle Transparent Gateways Customized copy programs that may employ C, C++, PL/SQL, and FTP Oracle 10g: Data Warehousing Fundamentals

231 Loading Technique Considerations
Tools are comprehensive. Data-movement utilities are fast and powerful. Gateways are suitable for specific instances: Access other databases. Supply dependent data marts. Support a distributed environment. Provide real-time access if needed. Use customized programs as a last resort. Replication is limited by data-transfer rates. Loading Technique Considerations ETL Tools: ETL tools such as OWB offer support for all the phases of the ETL process, by which your data is transported, as well as extracted and transformed. Note: However, if you feel that ETL tools are expensive for early implementations, you can choose copy utilities as the logical alternative. Data-Movement Utilities: Oracle provides the SQL*Loader utility, which is capable of executing in parallel environments, running in a mode where server intervention is minimized, and performing limited transformations, such as merging rows and changing data types. SQL*Loader is capable of loading very large volumes of data in a relatively short time, and you can use it for first-time load and refreshes successfully. Gateways: A gateway is a middleware component that presents a unified view of data coming from different data sources. Oracle Transparent Gateways (or Procedural Gateways), Open Database Connectivity (ODBC) tools present a uniform view of a database other than an Oracle database, or a file on specific file systems. Oracle gateways are mostly read-only, whereas other gateways are read-write. Oracle 10g: Data Warehousing Fundamentals

232 Loading Techniques Provided by Oracle: SQL*Loader
Input files Control file Log files SQL*Loader Bad files Discard files Loading Techniques Provided by Oracle SQL*Loader: SQL*Loader is used to move data from flat files into an Oracle data warehouse. During this data load, SQL*Loader can also be used to implement basic data transformations. A typical SQL*Loader session takes as input a control file, which controls the behavior of SQL*Loader, and one or more data files. The output of SQL*Loader is the data that is loaded into an Oracle database. In addition, a bad file, which contains records that were rejected, and (optionally) a discard file may also get created. SQL*Loader provides the following methods to load data: Direct Path Load: Direct path load is optimized for maximum data loading capability. Instead of filling a bind array buffer and creating INSERT commands, direct path loads create data blocks in the Oracle database block format. The blocks are then written directly to the database. Calls are made to the Oracle database, but they are quick and handled at the start and end of the load process. When using direct path load, basic data manipulation, such as data type conversion and simple NULL handling, can be automatically resolved during the data load. Most data warehouses use direct path loading for performance reasons. Oracle 10g: Data Warehousing Fundamentals

233 Loading Techniques Provided by Oracle
Oracle Call Interface (OCI) and direct path APIs Export/Import Data Pump Export/Import (Oracle Database 10g only) Load utility Loading Techniques Provided by Oracle (continued) Oracle Call Interface (OCI) and Direct Path API: Oracle provides application programming interface (API) to the direct path load mechanism in the Oracle server. This provides a way for independent software vendors and system management tool partners to create easy-to-use and high-performance customized data-loading tools. Access to all load functionality is available through the API. With the help of this API, performance of any third-party data-loading tool can therefore be comparable to SQL*Loader. OCI and direct path APIs are frequently used when the transformation and computation are done outside the database and there is no need for flat file staging. Instructor Note For more information about OCI and direct path APIs, refer to the Oracle Call Interface Programmer’s Guide, Release 2 (10.2). For more information about Export/Import utilities and external table loads, refer to Oracle Database Utilities 10g, Release 2 (10.2). Oracle 10g: Data Warehousing Fundamentals

234 More Loading Techniques: External Tables
External tables are read-only tables where the data is stored outside the database in flat files. The data can be queried like a virtual table, using any supported language such as SQL. No DML is allowed and no indexes can be created. The metadata for an external table is created using a CREATE TABLE…ORGANIZATION EXTERNAL… statement. Access rights are controlled via SELECT TABLE and READ/WRITE DIRECTORY privilege. External Tables External tables (introduced in Oracle 9i) are read-only tables whose data is stored in flat files outside the database. They provide a way to access data in external sources as if it were in a table in the database. That is, external table features enable you to use external data as a virtual table that can be queried, and joined directly in parallel using SQL without requiring the external data to be first loaded in the database. The main difference between external tables and regular tables is that externally organized tables are read-only tables. No DML operations (UPDATE, INSERT, or DELETE) are possible and no indexes can be created on them. The metadata for the external table is created using the CREATE TABLE … ORGANIZATION EXTERNAL statement. This statement involves the creation of only metadata in the Oracle Dictionary because the data already exists outside (external to) the database. Because the actual data is external to the database, the access rights to the external tables are controlled by using the SELECT privilege on the table, and READ/WRITE DIRECTORY privileges. Note: The DIRECTORY object in the Oracle database is created to map to a physical directory in the file system, where the external files reside. The DIRECTORY object was introduced in Oracle8i. Oracle 10g: Data Warehousing Fundamentals

235 Applications of External Tables
Complement the SQL*Loader functionality Are useful in environments where the complete external source has to be joined with existing database objects Enable the pipelining of the loading and transformation phases Are useful when the external data is large and not queried frequently Are used by the Oracle Data Pump Applications of External Tables External tables are very useful in data warehouse implementations: External tables are a complement to the existing SQL*Loader functionality. An external table load creates an external table (which acts as view) for data in a data file and executes INSERT statements to insert the data from the data file into the target table. External tables are especially useful in environments where the complete external source has to be joined with existing database objects, or where the external data volume is large and used only once. External tables also enable the pipelining of the loading phase with the transformation phase. It is not necessary to stage the data inside the database for further processing, such as comparison or transformation. On the other hand, SQL*Loader can be used for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing. The Data Pump Export/Import utilities also use external tables. For example, when you unload the data using Data Pump, the server-side infrastructure will use external tables as the unload mechanism. Oracle 10g: Data Warehousing Fundamentals

236 Example of Defining External Tables
CREATE TABLE sales_delta_xt ( prod_id NUMBER(6), cust_id NUMBER, time_id DATE, ...) ORGANIZATION external ( External Table TYPE oracle_loader –- Access Driver DEFAULT DIRECTORY data_dir –- Files Directory ACCESS PARAMETERS –- Similar to SQL*Loader ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII BADFILE log_dir:'sh_sales_%p.bad' LOGFILE log_dir:'sh_sales_%p.log_xt' FIELDS TERMINATED BY "|" LDRTRIM ) location ( 'sales_delta.dat', data_dir2:'sales_delta2.dat' )) PARALLEL 5 –- Independent from the number of files REJECT LIMIT UNLIMITED; Example of Defining External Tables An external table can be created with a single CREATE TABLE command. This creates the metadata that is necessary to access the external data seamlessly from inside the database. The following information must be provided: Columns and data types for access in the database Where to find the external data Access driver. It is the responsibility of the access driver and the external table layer to do the necessary transformations required on the data in the data file so that it matches the external table definition. There is one access driver for every implementation of an external table type. Right now, only one access driver is provided. It is called oracle_loader. Format of the external data (similar to SQL*Loader) Degree of parallelism. Note that the degree is not dependent on the number of external data files. The example depicts the syntax of creating an external table named sales_delta_xt. Delta_dir is the default directory where the external flat file sales_delta.dat resides. Another flat file called sales_delta2.dat is used, and resides in the data_dir2 directory. The access parameters control the extraction of data from the flat files using record and file formatting information. Oracle 10g: Data Warehousing Fundamentals

237 Defining External Tables Using SQL*Loader
After creating a control file, SQL*Loader can generate a log file with SQL commands to: Create the metadata for the external table Insert the data into the target table Drop the metadata for the external table sqlldr sh/sh control=sales_dec00.ctl EXTERNAL_TABLE=GENERATE_ONLY LOG=sales_dec00.sql Defining External Tables Using SQL*Loader To save time in creating the CREATE TABLE statement for the external table, SQL*Loader can generate the statement for you. Using the EXTERNAL_TABLE parameter and setting its value to GENERATE_ONLY, SQL*Loader places the SQL statements in the log file for creating the external table definition within the database, inserting the data from the external table to the target table, and for deleting the external table definition. Note: Not all SQL*Loader functionality is currently supported with external tables. Oracle 10g: Data Warehousing Fundamentals

238 Postprocessing of Loaded Data
Transform Extract Load Staging area Warehouse Create indexes Generate keys Postprocessing of loaded data Summarize Filter Postprocessing of Loaded Data You have seen how to extract data to an intermediate file store or staging area, where it is: Transformed into acceptable warehouse data Loaded to the warehouse server You have also seen how the ETL process is slightly different for: First-time load, which requires all data to be loaded once Refreshing, which requires only changed data to be loaded You now need to consider the different tasks that might take place after the data is loaded. There are various terms used for these tasks. In this course, the choice of terms is postprocessing. The postprocessing tasks are not definitive; you may or may not have to perform them, depending on the volumes of data moved, the complexity of transformations, and the loading mechanism. For example, it is possible to load data by using SQL*Loader in a manner that excludes database trigger processing. However, at the warehouse server, you want to ensure the triggers are executed so that the integrity and validity of data are retained. This is referred to as postprocessing. Four categories of postprocessing tasks are explored in the following pages: Creating indexes Creating keys Creating summary tables Filtering Oracle 10g: Data Warehousing Fundamentals

239 Oracle 10g: Data Warehousing Fundamentals 1 - 319
Indexing Data Before load: Enables indexes at server During load: Adds time to load window, row-by-row approach After load: Adds time to load window, but faster than row-by-row approach Index Indexing Data Before Load Indexing of data may occur before load. You can index the data values for the warehouse after data cleansing and before transportation and load. You can retrieve the data from a presorted list of values much more rapidly by reading the index, rather than performing a full-table scan. This makes it easier to enable indexes at the server level. However, this is not done very often. During Load It is possible to create the indexes at the same time as loading the data, using the usual techniques employed by the server. However, this action is a row-by-row approach to index creation, which lengthens the time to load data. In most cases, the time taken is too long, and for this reason, the next option is preferable. After Load It is common to index after the data has been loaded into the warehouse. This adds time to the load window, but it is much faster than row-by-row processing, and you can speed up the index creation process by indexing in parallel in a parallel environment. Operational databases Staging area Warehouse database Oracle 10g: Data Warehousing Fundamentals

240 Oracle 10g: Data Warehousing Fundamentals 1 - 320
Unique Indexes Disable constraints before load. Enable constraints after load. Re-create index if necessary. Disable constraints. Enable constraints. Catch errors. Unique Indexes If the index you are creating is a unique index (that forces unique values in key columns without any duplicates), then it is usual to load the data with the database constraints disabled, and then enable the constraints after the load process. Then, you build the index, which may find duplicate values and fail. Ensure that the action catches the errors so that you can correct and reindex. Note: Oracle offers many types of indexes, which are useful in the data warehousing environment: bitmap indexes, bitmap join indexes, and B-tree indexes. Oracle also supports bitmap indexing on partitioned tables. These topics were already discussed in the lesson titled “Physical Modeling: Sizing, Storage, Performance, and Security Considerations.” Load data. Create index. Reprocess. Oracle 10g: Data Warehousing Fundamentals

241 Oracle 10g: Data Warehousing Fundamentals 1 - 321
Creating Derived Keys The use of derived or generalized keys is recommended to maintain the uniqueness of a row. Methods: Concatenate operational key with a number. Assign a number sequentially from a list. 109908 100 Creating Derived Keys A derived key (sometimes referred to as a generalized or artificial key) may be used to guarantee that every row in the table is unique. The warehouse data may likely be a combination of many transformed records, of which there are no natural data keys to use as unique identifiers. Concatenating Operational Key with a Number Your postprocessing program executes the create index commands and allocates the key values, which may be a concatenation of the primary key and version digit or characters. For example, if a customer record key value contains six digits, such as , the derived key may be The last two digits are the sequential number generated automatically. The advantage of this method is that it is relatively easy to maintain and set up the necessary programs to manage number allocation. Oracle 10g: Data Warehousing Fundamentals

242 Oracle 10g: Data Warehousing Fundamentals 1 - 323
Summary Management Summary tables Materialized views Summary data Summary Management Summarization is the process of consolidating data values into a single value. For example, sales data could be collected on a daily basis and then be aggregated to the week level, the week data could be aggregated to the month level, and so on. The data can then be referred to as aggregate data. Aggregation is synonymous with summarization, and aggregate data is synonymous with summary data. A summary table or materialized view is a precomputed table comprising aggregated or joined data from fact and (possibly) dimension tables. Summary management includes defining, analyzing, and managing summaries. After you perform initial user requirements analysis, you determine the summaries needed by the user. However, you must constantly monitor summaries to determine which new summaries should be created and which summaries are no longer needed. Summaries play an important role in improving the data warehouse performance because: They improve query performance They eliminate the overhead associated with expensive joins and aggregations Note: The term materialized view is more specific to Oracle databases. Materialized views are introduced from Oracle8i. Summary management is discussed in greater detail in the lesson titled “Summary Management.” Oracle 10g: Data Warehousing Fundamentals

243 Oracle 10g: Data Warehousing Fundamentals 1 - 324
Filtering Data From warehouse to data marts CTAS pCTAS Summary data Warehouse Filtering Data You may filter out specific information to supply subject-specific data for dependent data marts. The filtering uses simple SQL to create new objects by using existing objects. The new objects are then moved into the data mart, similar to the way data is moved into the warehouse. You can perform this filtering task by using either of the following: CREATE TABLE AS SELECT (CTAS) CREATE TABLE AS SELECT... PARALLEL (pCTAS) Data marts Oracle 10g: Data Warehousing Fundamentals

244 Verifying Data Integrity
Load data into intermediate file. Compare target flash totals with totals before load. Counts & amounts = Flash totals Load Intermediate file Target Preserve, inspect, fix, and then load Verifying Data Integrity It is important at all stages of ETL that errors be detected, flagged, and resolved. How you verify data integrity depends upon whether you have a customized approach to ETL or whether you employ an ETL tool, which will probably deal with these issues automatically, and allow you to visibly access the data only when available in the warehouse. It is important to ensure that each load, whether first time or a refresh, executes successfully. You need to create jobs that track: The status of the warehouse load, whether it has started, is in progress, or complete The time of process completion Statistics to show load start and complete time, and records processed in order to monitor and ensure continuing efficiency Comparison of load control counts and amounts: You must be aware of the amounts of data that are to be loaded so that you can perform an accurate validation of completeness. You can load the detail and summary records into intermediate files to compare counts and amounts created before loading with counts and amounts (flash totals) derived on the target data warehouse. Data reconciliation issues Referential integrity violations Any failures that require reprocessing Counts & amounts = Flash totals Oracle 10g: Data Warehousing Fundamentals

245 Steps for Verifying Data Integrity
Source files Source files Source files Target 3 Control 4 1 Extract SQL*Loader 5 6 2 7 Steps for Verifying Data Integrity You may find it useful to load the detail and summary records into intermediate files so that you can compare record counts and sample totals before loading on the target data warehouse. If the counts and totals do not match, you must preserve and inspect the intermediate files without loading and without compromising on data integrity. The steps depicted in the slide are discussed on the following page. .log .bad Oracle 10g: Data Warehousing Fundamentals

246 Standard Quality Assurance Checks
Load status Completion of the process Completeness of the data Data reconciliation Referential integrity violations and reprocessing Comparison of counts and amounts 1 + 1 = 3 Standard Quality Assurance Checks The following are standard quality assurance checks for the data loaded into the warehouse: Status of the warehouse load Completion of the load process Completeness of the data Data reconciliation Referential integrity violations and reprocessing Comparison of load control counts and amounts Oracle 10g: Data Warehousing Fundamentals

247 Oracle 10g: Data Warehousing Fundamentals 1 - 329
Summary In this lesson, you should have learned how to: Explain key concepts in loading data into the warehouse Outline how to build the loading process for the initial load Identify loading techniques Describe the loading techniques provided by Oracle Identify the tasks that take place after data is loaded Explain the issues involved in designing the transportation, loading, and scheduling processes Oracle 10g: Data Warehousing Fundamentals

248 Oracle 10g: Data Warehousing Fundamentals 1 - 330
Practice 7-1: Overview This practice covers the following topics: Identifying the fastest way to move the metadata between the staging area and warehouse for the RISD DW Identifying the issues, and suitable loading techniques based on the RISD scenario Exploring the viewlet based demonstrations on ETL features of OWB Oracle 10g: Data Warehousing Fundamentals

249 Refreshing Warehouse Data
Schedule: Timing Topic 60 minutes Lecture 15 minutes Practice 75 minutes Total

250 Oracle 10g: Data Warehousing Fundamentals 1 - 333
Objectives After completing this lesson, you should be able to do the following: Describe methods for capturing changed data Explain techniques for applying the changes Describe refresh mechanisms supported in Oracle Database 10g Describe techniques for purging and archiving data and outline the techniques supported by Oracle Outline final tasks, such as publishing the data and automating processes Lesson Aim In the last lesson, you examined the first-time load of the warehouse. In this lesson, you examine methods for updating the warehouse with changed data, after the first-time load. Oracle 10g: Data Warehousing Fundamentals

251 Developing a Refresh Strategy for Capturing Changed Data
Consider load window. Identify data volumes. Identify cycle. Know the technical infrastructure. Plan a staging area. Determine how to detect changes. Developing a Refresh Strategy for Capturing Changed Data You must have a strategy for maintaining changes to the data warehouse, including changes to facts, dimension data, and summary data. There are no concrete rules about when the data warehouse should be refreshed, but there are several factors to consider: What is the total load window available? What is the volume of data to be transferred? How often does the warehouse data need to be updated? When are you going to move the data? Will you refresh monthly, weekly, or at another time interval? Will you use continuous refresh for nearly real-time data? The connectivity gear available for moving the data into the data warehouse. How are you going to move the data? Will you move data in batch mode, which is feasible for less time-critical applications? Are you going to move data from operational systems to an intermediate area? Is this area an operational data store? Is it a flat file? Is it an Oracle database? Or is it something completely unique to your implementation? How are changes in data to be detected? Are you going to push the changes through when detected? Are you going to capture the changes? Where are you going to store the changes? Could you use triggers to force changes into an alternative store? Operational databases T1 T2 T3 Oracle 10g: Data Warehousing Fundamentals

252 User Requirements and Assistance
Users define the refresh cycle. IT balances requirements against technical issues. Document all tasks and processes. Employ user skills. User Requirements and Assistance The strategy is primarily defined by user requirements, but they must be balanced against the available technology and windows for loads. All must be documented and understood by everyone involved in the project. The users can also provide expertise for load verification, validation, run-to-run, and load controls. Operational databases T1 T2 T3 Oracle 10g: Data Warehousing Fundamentals

253 Load Window Requirements
Time available for the entire ETL process Plan Test Prove Monitor a.m p.m User access period Load window Load Window Requirements The load window is the amount of time available to extract, transform, load, postload process data, and make the data warehouse available to the user. The load performs many sequential tasks that take time to execute. You must ensure that every event that occurs during the load window is planned, tested, proven, and constantly monitored. Poor planning will extend the load time and prevent users from accessing the data when it is needed. Careful planning, defining, testing, and scheduling are critical. Oracle 10g: Data Warehousing Fundamentals

254 Planning the Load Window
Plan and build processes according to a strategy. Consider volumes of data. Identify technical infrastructure. Ensure currency of data. Consider user access requirements first. High availability requirements may mean a small load window. a.m p.m User access period Planning the Load Window Load Window Strategy The load time is dependent upon a number of factors, such as data volumes, network capacity, and load utility capabilities. You must not forget that the aim is to ensure the currency of data for the users, who require access to the data for analysis. To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point. Determining the Load Window It is usual to define the user access requirements first and work the load schedule backward from that point. After the user access time is defined, you can establish the load cycles. Some of the processes overlap to enable all processes to run within the window. More realistically, an almost 24-hour access is required. This means the load window is significantly smaller than the example shown here. In that event, you need to consider how to process the refresh and keep users presented with current realistic data. This is where you can use partitioning strategies. Oracle 10g: Data Warehousing Fundamentals

255 Scheduling the Load Window
1 2 Requirements Load cycle Control File File names File types Number of files Number of loads First-time load or refresh Date of file Date range Records in file–counts Totals–amounts Receive data Open and read files to verify and analyze FTP 3 Scheduling the Load Window From the example, you can see that the transportation of data (that is, moving the data to the server and loading into the warehouse tables) is a complex task involving many steps. To work out an effective load window strategy, consider the user requirements first, and then work out the load schedule backward from that point. The steps involved in scheduling the load window, according to the user requirements, is given in the example on the next page. 4 Control process 3 a.m. Oracle 10g: Data Warehousing Fundamentals

256 Scheduling the Load Window
6 Create summaries 8 Load into warehouse 5 Verify, analyze, reapply Index data 7 9 Update metadata Parallel load Scheduling the Load Window (continued) Example of Scheduling the Load Window (continued) 5. The data is then loaded into the warehouse. 6. Each load requires verification and analysis (and maybe, reanalysis, after any load exceptions are reapplied). You need to ensure that the data is successfully loaded by performing checks against the row counts and amounts available in the control files. Any loading errors yielding potentially bad data need to be reapplied. This adds time to the load, and contingency should be built into the cycle to cope with this. If you are using SQL*Loader to move the data, the bad data resides in a file called <filename>.bad. 7. Indexes are constructed. 8. Summarization takes place. 9. Metadata is updated to ensure it contains information about the current load. 3 a.m. 6 a.m. 9 a.m. Oracle 10g: Data Warehousing Fundamentals

257 Scheduling the Load Window
11 Create views for specialized tools 12 13 10 Users access summary data Publish Backup warehouse User access Scheduling the Load Window (continued) Example of Scheduling the Load Window (continued) 10. The warehouse is backed up. With many database servers today, there are typically two mechanisms for backup: hot, with users online, and cold, with users offline. You should consider cold backups before user access. The backup should include: All warehouse data Summary tables Database schema Metadata Note: If the information is supplied to the warehouse on tape, a full cold backup may not be necessary. The summaries created at the target server may be all that you need to back up. Create the views, if required by specialized user tools. Give users access to the summary data. 13. Publish information to the users, specifying the changes to the data warehouse and allowing them access. Note: These steps identify one solution and assume that summarization and indexing occur after load, and that the job is executed from a batch file. 6 a.m. 9 a.m. Oracle 10g: Data Warehousing Fundamentals

258 Capturing Changed Data for Refresh
Capture new fact data. Capture changed dimension data. Determine method of capture in each case. Methods: Wholesale data replacement Comparison of database instances Time-stamping Database triggers Database log Capturing Changed Data for Refresh There are two major categories of changed data: New fact data Changed dimension data For each, a different capture mechanism will be discussed. In addition, consider how you will process the load. The fact data might easily be loaded by adding another partition of data, a relatively straightforward process (for a database administrator). Changes to dimension data need more selective update. You need to evaluate whether the change is to replace or add to an existing record, or whether you want to maintain history (keeping old and new records). For example, the description of a product may change over its lifetime, even if its primary (and unique) part number remains the same. It is important to see that the change is reflected. Another common example is sales districts in a sales organization that reorganizes. Oracle 10g: Data Warehousing Fundamentals

259 Wholesale Data Replacement
It is expensive. It is useful for data marts with less data. Limited historical data analysis is possible. Time period often exceeds load window. Mirroring techniques can be used to provide access to the users. Wholesale Data Replacement This method refreshes the entire warehouse in every business cycle. This method is understandably very expensive. Every refresh needs to extract, transform, and load the entire warehouse. In fact, this method is similar to using a first-time load on a regular basis. Some data mart and online analytical processing server implementations use this method because they hold less data (a subset of the data warehouse), and in which case, wholesale replacement is less complex than programming mirroring and update procedures. On the other hand, historical data analysis is limited because you are restricted by the sheer volume of data loaded each time. The time window required for wholesale replacement can often exceed the time that the data is contracted to be offline (and unavailable to the users). However, with mirroring strategies, users can be directed to an image copy of the data warehouse while maintenance is being performed. The changes that occur during the maintenance cycle must be applied to the current online image (production version). The production version should then be backed up or mirrored. Oracle 10g: Data Warehousing Fundamentals

260 Comparison of Database Instances
Delta file: Contains changes to operational data since the last refresh Is used to update the warehouse Simple to perform, but expensive in terms of time and processing Efficient for smaller volumes of data Yesterday’s operational database Database comparison Today’s operational database Comparison of Database Instances In this method, you capture the differences between two instances of the same database to find out the changes that have occurred since the last time the data warehouse was refreshed. The changes are held in an intermediate (or delta) file and are used to update the warehouse. Delta File or Database The delta database (or file) contains only the changes that have been made to the operational system since the last refresh. An operational application may need to be modified to create the delta file structure and contain the new logic that captures changes and adds the rows to the delta file. This method is a simple but an expensive way to determine changes. It works more efficiently and effectively if the volumes of data are small, as with wholesale replacement. Delta file holds changed data. Oracle 10g: Data Warehousing Fundamentals

261 Time- and Date-Stamping
Fast scanning for records changed since last refresh cycle Useful for data with updated date field No detection of deleted data Operational data Delta file holds changed data based on the time stamp. Time- and Date-Stamping A time and date stamp on changed data quickly shows you the data that has been changed since the last refresh cycle. The time and date stamp is normally made part of a key value, making it an efficient way to search and find the changed data. The advantage of this approach is that the process that creates the delta database needs to look only at the time key and identify the records with the required time and date stamp. Depending upon the frequency of refresh and the mechanism chosen for time- and date-stamping, the search for the time value may be a specific date, for example, all Time_Key = ‘01-Jan-97’, or a date range such as Time_Key BETWEEN ‘01-Jan-97’ and ‘07-JAN-97’, or Time_Key LIKE ‘%Jan-97’. You can use this method only if the database contains an updated date field, which may not be the case in many operational systems. This is one issue that may be resolved by reengineering source system applications or database server code. (You might add database triggers to perform the required date field updates. Database triggers are discussed on the next page.) Note: Time and date stamps do not catch deleted data. Oracle 10g: Data Warehousing Fundamentals

262 Database Triggers Changed data intersected at the server level
Extra I/O required Maintenance overhead Operational server (RDBMS) Operational data Delta file holds changed data. Triggers on server Database Triggers Procedural code in the form of database triggers captures and identifies changed data at the database level. Extra I/O is required while the system is online to track changes as they occur and maintain a delta file if needed. You must modify the database to add server (database level) triggers that capture before and after images of the records. The triggers and associated code (code in PL/SQL, if using Oracle) write the changes to a delta database or file. Of course, to use this method, the server must support database triggers. Oracle 10g: Data Warehousing Fundamentals

263 Oracle 10g: Data Warehousing Fundamentals 1 - 348
Using a Database Log Contains before and after images Requires system checkpoint Is a common technique Operational server (DBMS) Operational data Log Log analysis and data extraction Using a Database Log Database log file contains information from which you can extract changed data; it logs “before” and “after” images of the data. You may analyze the log file in batch mode to identify the differences that become the delta file. The following are some of the issues to be kept in mind: The format of the log file may be difficult to interpret and use. The log tape is not really intended for use by the warehouse, and often contains a lot of data not required by the warehouse. The system must wait for a checkpoint in order to get a stable log. This is a process that many ETL tools use, but this method can be used only on databases that provide a log, such as Oracle and DB2. Note: Oracle snapshot and replication facilities log changes into another table. Delta file holds changed data. Oracle 10g: Data Warehousing Fundamentals

264 Choosing a Method for Change Data Capture
Consider each method on merit. Consider a hybrid approach if one approach is not suitable. Consider current technical, operational, and application issues. Choosing a Method for Change Data Capture Each of the methods discussed has its own advantages and disadvantages. In reality, your data warehousing environment might actually use a combination of these mechanisms. For example, you might time-stamp changed dimension data, and simply extract data that exists within a database partition for the new fact data, but use wholesale replacement to supply your dependent data marts with updated data. The choice you make is based on the many factors identified earlier in this lesson. Note: Oracle Database 10g offers both synchronous and asynchronous Change Data Capture. Oracle CDC was already discussed in the lesson titled “The ETL Process: Extracting Data.” Oracle 10g: Data Warehousing Fundamentals

265 Refresh Mechanisms in Oracle Database 10g
Refresh modes for materialized views: ON COMMIT ON DEMAND: DBMS_MVIEW.REFRESH DBMS_MVIEW.REFRESH_ALL_MVIEWS DBMS_MVIEW.REFRESH_DEPENDENT Refresh options for materialized views: COMPLETE FAST FORCE Use partitioning for refresh. Optimize DML during refresh. Refresh Mechanisms in Oracle Database 10g Refresh modes for materialized views: When creating a materialized view, you have the option of specifying whether the refresh occurs ON DEMAND or ON COMMIT. In the case of ON COMMIT, the materialized view is changed every time a transaction commits, which changes data used by the materialized view, thus ensuring that the materialized view always contains the latest data. Alternatively, you can control the time when refresh of the materialized views occurs by specifying ON DEMAND. In this case, the materialized view can be refreshed only by calling one of the procedures in the DBMS_MVIEW package. DBMS_MVIEW.REFRESH: Refresh one or more materialized views. DBMS_MVIEW.REFRESH_ALL_MVIEWS: Refresh all materialized views. DBMS_MVIEW.REFRESH_DEPENDENT: Refresh all table-based materialized views that depend on a specified detail table or list of detail tables. Refresh options for materialized views: You can also specify the options for materialized views, based on how you want the materialized views to be refreshed from the detail tables by selecting one of these options: COMPLETE, FAST, and FORCE. Note: Materialized views are discussed in more detail in the lesson titled “Summary Management.” Oracle 10g: Data Warehousing Fundamentals

266 Applying the Changes to Data
You have a choice of techniques: Overwrite a record. Add a record. Add a field. Maintain history. Add version numbers. Applying the Changes to Data There are a number of methods for applying changes to existing data in dimension tables: Overwrite a record. Add a new record. Add a current field. Maintain history records. Add version numbers. These are discussed in detail in the following pages. Oracle 10g: Data Warehousing Fundamentals

267 Oracle 10g: Data Warehousing Fundamentals 1 - 353
Overwriting a Record Easy to implement Loses all history Not recommended 42135 John Doe Married 42135 John Doe Single Overwriting a Record This method is easy to implement, but it is useful only if you are not interested in maintaining the history of data. If the data you are changing is critical to the context of information and analysis of the business, then overwriting a record is to be avoided at all costs. For example, by overwriting dimension data, you lose all track of history—you can never see that John Doe was single if the value “Single” is overwritten with the value “Married” from the operational system. The Customer_Id for John Doe remains constant throughout the life of the warehouse because only one record for John Doe is stored. Oracle 10g: Data Warehousing Fundamentals

268 Oracle 10g: Data Warehousing Fundamentals 1 - 354
Adding a New Record History is preserved; dimensions grow. Time constraints are not required. Generalized key is created. Metadata tracks usage of keys. 42135 John Doe Single 42135_01 John Doe Married Adding a New Record Using this method, you add another dimension record for John Doe. One record shows that he was “single” until December 31, 2000, and another that he has been “married” from January 1, Using this method, history is accurately preserved, but the dimension tables get bigger. A generalized (or derived) key is created for the second John Doe record. The generalized key is a derived value that ensures that a record remains unique. However, you now have more than one key to manage. You also need to ensure that the record keeps track of when the change has occurred. The Customer_Id for John Doe does not remain constant throughout the life of the warehouse because each record added for John Doe contains a unique key. The key value is usually a combination of the operational system identifier with characters or digits appended to it. Consider using real data keys. The example here shows a method that is commonly identified in warehouse reference material. Oracle 10g: Data Warehousing Fundamentals

269 Oracle 10g: Data Warehousing Fundamentals 1 - 355
Adding a Current Field Maintains some history Loses intermediate values Is enhanced by adding an Effective Date field 42135 John Doe Single 42135 John Doe Single Married 1-Jan-01 Adding a Current Field In this method, you add a new field to the dimension table to hold the current value of the attribute. Using this method, you can keep some track of history. You know that John Doe was “single” before he was “married.” Each time John’s marital status changes, the two status attributes are updated and a new Effective Date is entered. However, what you cannot see from this method is what changes have taken place between the two records you are storing for John Doe—intermediate values are lost. Consider using an Effective Date attribute to show when the status changed. Partitioning of data can then be performed by effective date. The method you choose is again determined by the business requirements. If you want to maintain history, this method is a logical choice that can be enhanced by using a generalized key. Oracle 10g: Data Warehousing Fundamentals

270 Limitations of Methods for Applying Changes
Difficult to maintain history Dimensions may grow large Maintenance overhead Limitations of Methods for Applying Changes Consider the following illustration to understand the limitations. Assume a customer record as follows: If you overwrite the record, history is lost, and there is no record of this company ever existing at 1 Main Street. Oracle 10g: Data Warehousing Fundamentals

271 Maintaining History: Techniques
History tables One-to-many relationships Versioning Preservation of complete history Maintaining History: Techniques As it has been discussed earlier, historical data is very important for the time-based data analysis of data warehouses. This slide lists the techniques, which are useful for maintaining history. These techniques are discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

272 Maintaining History: Techniques
History tables: Normalize dimensions Hold current and historical data One-to-many relationships: One current record and many history records Products Times Sales HIST_CUST Customers History Tables and One-to-Many Relationships History Tables One of the solutions is to use history tables, which involve normalizing the dimensions to hold current and historical data. This method is one of the comprehensive, effective, and easily managed solutions. One-to-Many Relationships Using this method, you keep one current record of the customer and many history records in the customer history table (a one-to-many relationship between the tables), thus maintaining history in a more normalized data model. In the Customer table, the customer operational unique identifier is retained in the CUSTOMER.Id column. In the HIST_CUST table, the operational key is maintained in the HIST_CUST.Id column and the generalized key in the HIST_CUST.G_Id column. You can use this to keep all the keys needed and multiple records for the customer. The table on the next page shows you how the data might appear. Oracle 10g: Data Warehousing Fundamentals

273 Oracle 10g: Data Warehousing Fundamentals 1 - 361
Versioning Avoid double counting. Facts hold version number. Times Customers Sales Customer.CustId Version Customer Name 1234 1 Comer 2 Sales.CustId Sales Facts $11,000 $12,000 Products Versioning You can also maintain a version number for the customer in the Customer dimension: You must ensure that the measures in the fact table, such as sales figures, also contain the customer version number to avoid any double counting: For Comer Version 1, the sales total is $36,000. For Comer Version 2, the sales total is $87,000. Oracle 10g: Data Warehousing Fundamentals

274 Preserving Complete History
Enables realistic historical analysis Retains context of data Model must be able to: Reflect business changes Maintain context between fact and dimension data Retain sufficient data to relate old to new Preserving Complete History This method completely preserves history and is therefore very effective for performing analysis over time where data has changed substantially. The context of information is still preserved. A good example of where this applies is in a sales organization. Assume that you have a model containing a sales fact and dimensions such as Customer, Sales Region, and Product. Your warehouse contains sales figures for sales region Europe for the years 1992 and In 1994, the European region reorganizes and splits into East Europe and West Europe. The warehouse is now maintaining data for each region from 1994 onward. In 1997, users are asked to put together some projections based on the last five years’ sales in Europe. The data you are currently using for East and West Europe for 1992 and 1993 does not have the data split this way. That is not an issue because you still have the ability to roll up East and West regions into a total for Europe, and perform analysis over a five-year period. If you reverse the scenario, two regions become one and the solution is the same. The issue with retaining history and context is building a model that is able to: Reflect changes as the business changes Keep the context of information accurate between dimension and fact data Retain sufficient data to be able to relate old and new records where needed Oracle 10g: Data Warehousing Fundamentals

275 Purging and Archiving Data
As data ages, its value depreciates. Remove old data from the warehouse: Archive for later use (if needed) Purge without copy Purging and Archiving Data Data may reside in the warehouse for many more years than it would in an operational system; however, it does not remain forever. The value of data to the business diminishes over time. During analysis, the analysts determine the useful life span of the data. In addition, old data may simply be summarized; the details are not needed. What Is Purging? If there is no chance of ever needing the data again, even for summaries, then you can purge it. This removes the data entirely; no copy is retained. What Is Archiving? If you feel you may need the data in the future to build summaries (for example), then archive the data to low-cost storage devices that are not associated with the data warehouse. You need to ensure that you have the strategies in place that meet determined business requirements for purge and archive. Oracle 10g: Data Warehousing Fundamentals

276 Oracle-Supported Techniques for Purging Data
Using SQL: TRUNCATE TABLE: Retains no redo data DELETE: Retains redo data ALTER TABLE ... DROP PARTITION: Removes a partition Using PL/SQL: Database triggers Oracle-Supported Techniques for Purging Data Using SQL TRUNCATE TABLE Command: The TRUNCATE TABLE command is the fastest way to purge data from a table. It does not retain redo data and rollback is impossible. It is also useful for emptying a temporary table that is used repeatedly as part of a regular load or summary process. Indexes on the table are also truncated. DELETE Command: The DELETE command is used generally when the data has not been partitioned. DELETE retains redo information, so you need to size the rollback segments carefully. NOLOGGING does not apply to DELETE or UPDATE. DELETE works only in parallel on partitioned tables. When you delete rows from a table, the corresponding entries in every index on the table must also be deleted. This has a performance impact. Oracle 10g: Data Warehousing Fundamentals

277 Oracle-Supported Techniques for Archiving Data
Export and import utilities ALTER TABLE ... EXCHANGE PARTITION EXP IMP Database .dmp Techniques for Archiving Data Import and Export Utilities You can use the export utility to move data from tables to a dump file (called filename.dmp). This dump file can be stored on a storage device. The import utility can then read that dump file and load data back into the same or another user. ALTER TABLE ... EXCHANGE PARTITION Using this SQL command, you can switch a partition of data with an empty table, drop the empty partition, and export the table. Archive the exported table when you have time. The methods you employ depend upon your individual business requirement, although the history model is a popular choice in the current warehousing environment. You must ensure that someone in the data warehouse administration is responsible for managing and tracking these changes. Oracle 10g: Data Warehousing Fundamentals

278 Oracle 10g: Data Warehousing Fundamentals 1 - 367
Final Tasks Update metadata. Publish data. Use database roles to control access to the warehouse. Final Tasks Update the Metadata After your data has been loaded successfully, ensure that the metadata is updated. You need to consider many aspects, including information about the processes themselves. The most important aspect at this time is to ensure that the metadata reflects the new information available. Users must be made aware of the changes—for example, of the validity of data, date of data, any new data available, revised summaries, removed summaries, new algorithms, and the new meaning of values. Publish Data Publish data, so that users are presented with a consistent view of the data. Ensure that user access is denied while the ETL processes are executing. You should allow access only when all tasks are complete, validation has occurred, and metadata is updated. Note: You may choose to do this on a subject area basis, user basis, or for the entire warehouse. Again, like many other tasks, this is dependent upon your individual data warehouse or data mart implementation. Sources Stage Rules Publish Extract Transform Load Query Oracle 10g: Data Warehousing Fundamentals

279 Oracle 10g: Data Warehousing Fundamentals 1 - 369
Publishing Data Control access using database roles. Compromise between load action and user access. Consider: Staggering updates Using temporary tables Using separate tables Publishing Data The term “publishing data” is used to describe when the data is loaded and made available to the users. As a rule, you prevent access to the data while the load process is active to ensure that the users are presented with an accurate view of data and summaries. If Service Level Agreements state that users require access virtually 24 hours a day, then revoking and granting access as discussed is not appropriate. You need to consider how you can perform the load action while still allowing access, and ensuring that the data is as consistent as possible. There are different techniques depending upon the availability needs of the users: Stagger the updates to the different subject areas. Update on different nights of the week (for example, Tuesday and Wednesday) even though the revised source data might be made available days earlier. Use temporary tables (that the users cannot access) for load, filtering, and summarizing. Make the database unavailable only for the short time it takes to instantiate these as permanent objects. Load the data into a separate table and perform all the processing required. These actions are invisible to the user. Then when all tasks are complete, swap the contents of the temporary table into a database partition. The same technique is employed for the indexes. Oracle 10g: Data Warehousing Fundamentals

280 Oracle 10g: Data Warehousing Fundamentals 1 - 370
Summary In this lesson, you should have learned how to: Describe methods for capturing changed data Explain techniques for applying the changes Describe refresh mechanisms supported in Oracle Database 10g Describe techniques for purging and archiving data and outline the techniques supported by Oracle Outline final tasks, such as publishing the data and automating processes Oracle 10g: Data Warehousing Fundamentals

281 Oracle 10g: Data Warehousing Fundamentals 1 - 371
Practice 8-1: Overview This practice covers the following topics: Identifying the possible refresh strategy for the RISD data warehouse Identifying the strategy for archiving and purging data for RISD data warehouse Answering a series of questions based on the RISD scenario Oracle 10g: Data Warehousing Fundamentals

282 Summary Management ILT Schedule: Timing Topic 40 minutes Lecture
15 minutes Practice 55 minutes Total

283 Oracle 10g: Data Warehousing Fundamentals 1 - 375
Objectives After completing this lesson, you should be able to do the following: Discuss summary management and Oracle implementation of summaries Describe materialized views Identify the types, build modes, and refresh methods for materialized views Explain the query rewrite mechanism in Oracle Describe the significance of Oracle dimensions Lesson Aim This lesson introduces the concepts of aggregate data and the use of materialized views. Additionally, the cost-based optimizer and dimensions are explored. Oracle 10g: Data Warehousing Fundamentals

284 Summary Management Need
How can you improve query’s response time? Using indexes Partitioning your data What about precomputing query results? Create summaries: Use normal tables before Oracle8i: Need to rewrite applications Need to manually maintain data Use materialized views beginning with Oracle8i: Automatically rewrite SQL applications. Automatically refresh data. Summary Management Need There are many well-known techniques one can use to increase queries performance. For example, you can create additional indexes, or you can partition your data. Many data warehouses are also using a technique called summaries. The basic idea is to precompute the result of a long-running query and store this result into a database table called a summary table (something comparable to a CREATE TABLE AS SELECT statement). Thus, instead of precomputing the same query’s result many times, one can directly access the summary table. Nevertheless, although this approach has the benefit of enhancing queries’ response time, it also has many drawbacks. Indeed, the user needs to be aware of the summary table’s existence in order to rewrite its queries to use that table instead. Also, the data contained in a summary table are frozen, and must be manually refreshed whenever modifications occur on the real tables. With the introduction of summary management in Oracle databases(from Oracle8i), the end user no longer has to be aware of the summaries that have been defined. The DBA is responsible to create materialized views, which are automatically used by the system to rewrite incoming SQL queries if possible. Using materialized views offers another big advantage over manually creating summaries tables: the possibility to refresh data automatically. Note: The name summary comes from the fact that most of the time, users in data warehouse environments are computing expensive joins with aggregations. Nevertheless, when creating summaries using materialized views in Oracle, you are not restricted to having joins or aggregations. Oracle 10g: Data Warehousing Fundamentals

285 Oracle 10g: Data Warehousing Fundamentals 1 - 377
Summary Management Summary management: Improves query response time Is the key to the performance of data warehouses A summary is a table that: Stores preaggregated and prejoined data Is based on user query requirements Summary Management Summary management is one of the most important considerations for data warehouse implementations. When used correctly, it can improve query-response time significantly, resulting in queries that take seconds rather than hours. Managing summaries is the key to good performance of the data warehouse implementations. A summary (or aggregate) is a table that stores preaggregated and prejoined data from the fact and dimension tables. The aggregate stores computed results such as sums, counts, and averages. Summary tables (an extension of the physical design) are important in the design of the data warehouse because they: Improve service to end users by providing better response time to analytical queries Provide improved and optimized use of resources, storage, and CPU Enhance the analysis process that allows you to drill down from higher levels of detail, and drill up from lower levels of detail Implementing Summaries The summary management process often begins with identifying the dimensions and hierarchies that describe the business relationships, and common query patterns in the database. The need to define summaries is often linked to the overriding need to contain costs and enhance performance. With costs increasing due to storage requirements for grain data (growing algebraically over a period of five to ten years), some form of capacity planning must be realized. Scalability as a result of aggregation becomes more feasible. Oracle 10g: Data Warehousing Fundamentals

286 Oracle 10g: Data Warehousing Fundamentals 1 - 378
Summary Navigation Effective use of summary tables requires summary table awareness. Methods for summary navigation: Warehouse database engine Proprietary summary-aware products Open summary-aware middleware 3GL and metadata solutions Which summaries? select total_sales... Summary Navigation Having developed summary tables, you are now challenged with using them appropriately. The tool or the query mechanism must be summary table–aware. That is, the existence of the summary tables must be known to the query. Summary (aggregate) navigators: Summary navigators are software components that intercept the end user’s SQL and transform it so that it uses the best available summary (usually the smallest available table that can answer the user’s request). The summary navigator maintains special metadata, which describes the current profile of summary tables stored in the data warehouse. It also should maintain statistics on queries, which shows which aggregates are being used and which should be built to help slow-running queries. Methods for summary navigation: Summary navigators can be located in: The warehouse database engine. This is the best-case scenario because this approach makes summary redirection (or query rewrite) accessible to all applications. For example, Oracle database contains its own built-in summary navigator. End-user query tools. In recent years, this approach has been the most common, primarily due to the fact that few database engines incorporated their own navigator. However, this approach requires that all query tools maintain their own summary navigation facility and metadata layer. For example, OracleBI Discoverer, an ad hoc query tool by Oracle, has this capability of creating and managing summaries. Middleware tools that facilitate the navigation to the summary tables (older systems) Custom-built summary navigation techniques using 3GL code and metadata (older systems) Oracle 10g: Data Warehousing Fundamentals

287 Managing Historical Summary Data in the Warehouse
Yearly summary data Quarterly summary data Monthly summary data Last 12 months daily detail Managing Historical Summary Data in the Warehouse Developing a strategy to manage summary tables in the warehouse is another major design consideration. Because use of the data is the most important factor in determining which summaries should be created, you may not be able to determine this strategy immediately. Summaries do not have to be consistently applied across the warehouse. For example, you may want to examine more recent data in greater detail than older data. Therefore, you might keep daily data for the last twelve months, along with summarized monthly data. Older data might be summarized to the month, quarter, or year. 1997/1998 1999 2000 2001 Oracle 10g: Data Warehousing Fundamentals

288 Summary Management in Oracle Database 10g
Materialized views: Store precomputed aggregates and joins. Results are stored in the database. Use query rewrite. Improve query performance. Summary Advisor: Is a collection of functions and procedures (the DBMS_OLAP package) Helps in defining and analyzing materialized views DBMS_OLAP Summary Management in Oracle Database 10g Materialized Views Oracle data warehouses use materialized views to store summarized data. (Materialized views are introduced in Oracle8i database.) Queries to large databases in data warehouses often involve joins between tables or aggregations such as SUM, or both. Materialized views improve query performance by precomputing expensive join and aggregation operations on the database before execution of the query and storing the results in the database. The query optimizer automatically recognizes when an existing materialized view can and should be used to satisfy a request (query). It then transparently rewrites the request to use the materialized view. Materialized views can be created by using the CREATE MATERIALIZED VIEW SQL statement. Oracle 10g: Data Warehousing Fundamentals

289 Before Materialized Views
SELECT c.cust_id, SUM(amount_sold) FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id; CREATE TABLE cust_sales_sum AS SELECT c.cust_id, SUM(amount_sold) AS amount FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id; Before Materialized Views Before the introduction of materialized views, organizations using summaries spent a significant amount of time creating summaries manually, identifying which summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. For example, in order to enhance performance for the SQL query of the preceding application, the DBA can create a summary table called cust_sales_sum, and inform the users of its existence. In turn, users are then using this summary table instead of the original query. Obviously, the time required to execute the SQL query from the preceding summary table is minimal compared to the original SQL query. On the other hand, users must be aware of summaries tables and need to rewrite their applications to use them. Also, the DBA must manually refresh the summaries tables in order to keep them up-to-date with the corresponding original tables. Very quickly, it can be difficult for the DBA to maintain such a system. SELECT * FROM cust_sales_sum; Oracle 10g: Data Warehousing Fundamentals

290 After Materialized Views
CREATE MATERIALIZED VIEW cust_sales_mv ENABLE QUERY REWRITE AS SELECT c.cust_id, SUM(amount_sold) AS amount FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id; SELECT c.cust_id, SUM(amount_sold) FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id; After Materialized Views The introduction of summary management in Oracle databases eased the workload of the database administrator and this means that the end user no longer needs be aware of the summaries that have been defined. The database administrator creates one or more materialized views that are the equivalent of summary tables. The advantage offered by creating a materialized view instead of a CREATE TABLE AS SELECT (CTAS) is that a materialized view not only materializes the result of a query into a database table, but also generates metadata information used by the query rewrite engine to automatically rewrite the SQL query to use the summary tables. Materialized views within the data warehouse are transparent to the end user or to the database application. Also, a materialized view optionally offers another important possibility: refreshing data automatically. This slide shows that using materialized views is transparent to the user. If your application wants to execute the same SQL query as in the slide, all the DBA needs to do is to create the preceding materialized view called cust_sales_mv. Then, whenever the application executes the SQL query, Oracle silently rewrites it to use the materialized view instead. Compared to the CTAS approach, the query response time is the same, but the big difference is that the application need not be rewritten. The rewrite phase is automatically handled by the system. Also, the SQL statement that defines the materialized view does not have to match the SQL statement of the query itself. SELECT STATEMENT TABLE ACCESS (FULL) OF cust_sales_mv Oracle 10g: Data Warehousing Fundamentals

291 Types of Materialized Views
Materialized views with aggregates Materialized views containing only joins CREATE MATERIALIZED VIEW cust_sales_mv AS SELECT c.cust_id, s.channel_id, SUM(amount_sold) AS amount FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id, s.channel_id; CREATE MATERIALIZED VIEW sales_products_mv AS SELECT s.time_id, p.prod_name FROM sales s, products p WHERE s.prod_id = p.prod_id(+); Types of Materialized Views The SELECT clause in the materialized view creation statement defines the data that the materialized view is to contain. Only a few restrictions limit what can be specified. Any number of tables can be joined together. However, they cannot be remote tables if you want to take advantage of query rewrite. Besides tables, other elements such as views, inline views (subqueries in the FROM clause of a SELECT statement), subqueries in the WHERE clause, and materialized views can all be joined or referenced in the SELECT clause. In data warehouses, materialized views normally contain aggregates. This is the origin of the term “summaries” in data warehouses. Some materialized views contain only joins and no aggregates. The advantage of creating this type of materialized view is that expensive joins will be precalculated. As explained later in this course, the distinction made between the types of materialized views is useful for describing capabilities and restrictions of materialized views regarding refresh and query rewrite. Note: The CREATE statements in the slide should include the ENABLE QUERY REWRITE clause; otherwise, rewrite will not occur and these MVs will be treated as replication MVs. Oracle 10g: Data Warehousing Fundamentals

292 Nested Materialized Views
A materialized view whose definition is based on another materialized view Definition that can also reference normal tables Common data warehouse situation: sales_time_prod_mv sales_time_prod_mv sales_prod_time_mv sales products times sales_prod_time_mv sales_prod_time_join Nested Materialized Views A nested materialized view is a materialized view whose definition is based on another materialized view. A nested materialized view can reference other relations in the database in addition to referencing materialized views. In a data warehouse, you typically create many aggregate views on a single join. This slide gives you an idea of this situation on its left part where two different materialized views are created on the same join but on different grouping columns. Incrementally, maintaining these distinct materialized aggregate views can take a long time because the underlying join has to be performed many times, although it is the same. By using nested materialized views, as shown on the right part, the join is performed just once. Incremental maintenance of single-table aggregate materialized views is very fast due to various refresh optimizations on this class of views. Note: In some cases, it might be advantageous to use materialized views containing aggregation groups for OLAP (see next lessons). sales sales products times products times Oracle 10g: Data Warehousing Fundamentals

293 Materialized View: Example
CREATE MATERIALIZED VIEW cust_sales_mv PCTFREE 0 TABLESPACE summ STORAGE (initial 1M next 1M pctincrease 0) BUILD DEFERRED REFRESH COMPLETE ENABLE QUERY REWRITE USING NO INDEX AS SELECT c.cust_id, s.channel_id, SUM(amount_sold) AS amount FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id, s.channel_id ORDER BY c.cust_id, s.channel_id; Name Storage options When to build it How to refresh the data Use this for query rewrite Do not create an index Detail query Detail tables Materialized View: Example For a complete description of the CREATE MATERIALIZED VIEW statement, refer to the Oracle Database 10g SQL Reference Guide. Unless the materialized view is based on a user-defined prebuilt table, it requires and occupies storage space in the database. When you create a materialized view, Oracle may create at least one additional index if the materialized view is fast refreshable (for example). The USING INDEX clause can be specified to establish the value of the INITRANS, MAXTRANS, and STORAGE parameters for this index. It is also possible to use the USING NO INDEX clause in order to avoid index creation. You must specify the ENABLE QUERY REWRITE clause if the materialized view is to be considered available for rewriting queries. This can be altered later using the ALTER MATERIALIZED VIEW statement. Note that you can enable query rewrite only if all user-defined functions in the materialized view are DETERMINISTIC. Note: Since release 8.1.6, an ORDER BY clause is allowed in the CREATE MATERIALIZED VIEW statement. It is used only during the initial creation of the materialized view and is not considered part of the materialized view definition. Storing rows in a specified order may help query performance because it provides physical clustering of the data, which is very useful when using indexes. MV keys Oracle 10g: Data Warehousing Fundamentals

294 Materialized Views: Build Modes
BUILD DEFERRED: MV created but not populated BUILD IMMEDIATE: MV created and populated ON PREBUILT TABLE: Existing table is converted to MV: Same name and same schema. Table is retained after MV is dropped. Column aliases in detail query must correspond. Detail tables and kidnapped table columns data types must match exactly by default. WITH[OUT] REDUCED PRECISION clause It is possible to have unmanaged columns in the table. MV’s STALENESS is set to UNKNOWN. BUILD_MODE in DBA_MVIEWS Materialized Views: Build Modes BUILD DEFERRED: Create the materialized view definition but do not populate it with data. The materialized view is not eligible for query rewrite until it is populated during the next complete refresh. Until then, its status is UNUSABLE. BUILD IMMEDIATE: Create the materialized view and then populate it with data. ON PREBUILT TABLE: The goal of this functionality is to make an existing table a materialized view. Many existing database installations make extensive use of materialized views. In some cases, the existing materialized views are very large, making it desirable to register an existing materialized view table rather than force the user to regenerate the materialized view table from the beginning. After it is registered, the materialized view can be used for query rewrites and maintained by the refresh methods. When you drop the materialized view, the table retains its identity and is not dropped. Each column alias in the subquery must correspond to a column name in the kidnapped table, and corresponding columns in the detail tables must have exact matching data types unless the WITH REDUCED PRECISION clause is specified to allow the precision of columns in the detail tables to be compatible with that of the kidnapped table columns. The kidnapped table and the materialized view must have the same name, and be in the same schema. However, the kidnapped table can contain unmanaged columns that are not referenced in the detail query of the materialized view. During a refresh operation, each unmanaged column is set to its default value. Therefore, the unmanaged columns cannot have NOT NULL constraints unless they also have default values. Oracle 10g: Data Warehousing Fundamentals

295 Materialized Views: Refresh Methods
COMPLETE FAST FORCE NEVER Materialized Views: Refresh Methods COMPLETE: Refreshes by recalculating the detail query of the materialized view. This can be accomplished by deleting the table or truncating it. Complete refresh reexecutes the materialized view query, thereby completely recomputing the contents of the materialized view from the detail tables. FAST: Refreshes by incrementally adding the new data that has been inserted or updated in the tables FORCE: Applies fast refresh if possible; otherwise, applies COMPLETE refresh NEVER: Prevents the materialized view from being refreshed with any Oracle refresh mechanism or procedure Oracle 10g: Data Warehousing Fundamentals

296 Materialized Views: Refresh Modes
ON DEMAND: Manual ON COMMIT: Refresh is done at transaction commit. It is possible only for fast-refreshable materialized views. In case of failure, subsequent refreshes are manual. Schedule: At regular intervals Materialized Views: Refresh Modes ON COMMIT: Refresh occurs automatically when a transaction that modified one of the detail tables of the materialized view commits. This can be specified as long as the materialized view is fast refreshable. If a materialized view fails during refresh at commit time, the user must explicitly invoke the refresh procedure using the DBMS_MVIEW package after addressing the errors specified in the trace files. Until this is done, the view is no longer refreshed automatically at commit time. ON DEMAND (the default): Refresh occurs at user demand by using the DBMS_MVIEW package. This package provides a number of procedures and functions to manage materialized views, including the REFRESH, REFRESH_DEPENDENT, and REFRESH_ALL_MVIEWS procedures. At a specified time: Refresh of a materialized view can be scheduled to occur at a specified time. For example, it can be refreshed every Monday at 9:00 a.m. by using the START WITH and NEXT clauses. In order for such refreshes to occur, the instance must initiate job processes with the JOB_QUEUE_PROCESSES parameter. Note: If you specify ON COMMIT or ON DEMAND, you cannot also specify START WITH or NEXT. Oracle 10g: Data Warehousing Fundamentals

297 Query Rewrite Mechanism in Oracle Database
Generate plan Generate plan Choose (based on cost) Execute Query Rewrite Mechanism in Oracle Database The slide illustrates how query rewrite works; it involves the following steps: 1. You enter a query. 2. Oracle generates the plan for your query using the base tables. 3. Oracle rewrites the statement to direct it against the corresponding materialized view, and generates an alternative plan. 4. Oracle compares the cost of the two execution plans. 5. Oracle chooses the best execution plan and uses it to execute the query. A query is rewritten only when certain conditions are met: Query rewrite must be enabled for the session. Either all or part of the results requested by the query must be obtainable from the precomputed result stored in the materialized view. The rewrite integrity level should allow the use of the materialized view. For example, if a materialized view is not fresh and query rewrite integrity is set to enforced, then the materialized view will not be used. To determine this, the optimizer may depend on some of the data relationships that are declared by the user using constraints and dimensions. Such data relationships include hierarchies, referential integrity, uniqueness of key data, and so on. Oracle 10g: Data Warehousing Fundamentals

298 Oracle 10g: Data Warehousing Fundamentals 1 - 391
Query Rewrite Execute query: SELECT c.cust_id, s.channel_id, SUM(amount_sold) AS amount FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_id, s.channel_id Observe the execution plan: OPERATION NAME SELECT STATEMENT TABLE ACCESS FULL cust_sales_mv Query Rewrite Accessing a materialized view can be significantly faster than accessing the underlying base tables, so the cost-based optimizer will rewrite a query to access the view when the query allows it. Query rewrite is the primary benefit enabled by materialized views. The query rewrite activity is transparent to applications. In this respect, its use is similar to the use of an index. Users do not need explicit privileges on materialized views to use them. Queries executed by any user with privileges on the underlying tables can be rewritten to access the materialized view. A materialized view can be enabled or disabled. A materialized view that is enabled is available for query rewrites. Example In the example, the optimizer is able to perform a query rewrite and use the summary created earlier to satisfy the query instead of the base table. Oracle 10g: Data Warehousing Fundamentals

299 Guidelines for Creating Materialized Views
Define a single materialized view including all measures. Include COUNT(x) when using the aggregating measure AVG(x). Guidelines for Creating Materialized Views Create materialized views that satisfy the largest number of queries. For example, if you identify 20 queries that are commonly applied to the detail or fact tables, then you might be able to satisfy them with five or six well-defined materialized views. A materialized view definition can include any number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG, VARIANCE, STDDEV, MIN, and MAX). It can also include any number of joins. Define a single materialized view that includes all measures instead of defining multiple materialized views on the same tables with the same GROUP BY columns but with different measures. Include COUNT(x) when using the aggregating measure AVG(x) to support incremental refresh. Similarly, if VARIANCE(x) or STDDEV(x) is present, then always include COUNT(x) and SUM(x) to support incremental refresh. Oracle 10g: Data Warehousing Fundamentals

300 How to Find the Best Materialized Views?
One materialized view can be used to satisfy multiple queries. Multiple materialized views can satisfy the same query. A balance between performance and space usage must be found. Which one is the best? Analyze your workload. Use Summary Advisor. Use EXPLAIN_REWRITE to see why a materialized view is used or ignored. How to Find the Best Materialized Views? A query rewrite can use the same materialized view to satisfy different queries. Conversely, different materialized views can satisfy one particular query. Thus, it is a difficult task to identify the best materialized view to create, given a set of queries. As a DBA, you must find a balance between queries performance and disk space used to store the materialized views. In order to solve this dilemma, you should analyze your workload manually. If you are unsure of which materialized views to create, Oracle provides a set of advisory procedures and functions in the DBMS_OLAP package to help in designing and analyzing materialized views for query rewrite. This collection of procedures and functions is also known as the Summary Advisor. The DBMS_MVIEW.EXPLAIN_REWRITE procedure can also be used to learn more about what materialized views will be used or ignored in order to resolve a SQL query. Oracle database provides many enhanced features to facilitate summary management. For example, an EXPLAIN_MVIEW procedure has been added to the DBMS_MVIEW that enables you to analyze materialized views. The Summary Advisor in Oracle Database 10g has procedures such as GENERATE_MVIEW_REPORT (to generate an HTML report about the recommendations from Summary Advisor) and GENERATE_MVIEW_SCRIPT (to generate the SQL statements you need to use based on the recommendations). Oracle 10g: Data Warehousing Fundamentals

301 Why Are Dimensions Important?
Dimensions are data dictionary structures that have zero or more hierarchies based on existing columns. Create more hierarchies in dimensions for the following reasons: They enable additional query rewrites without the use of constraints. They help document hierarchies. They can be used by online analytical processing (OLAP) tools. Why Are Dimensions Important? Dimensions can have zero or more hierarchies defined based on existing columns of a table. They are data dictionary structures that define hierarchies based on columns in existing database tables. Create as many hierarchies as possible for the following reasons: They enable additional rewrite possibilities without the use of constraints. Implementation of constraints may not be desirable in a data warehouse for performance reasons. They help document dimensions and hierarchies explicitly. They can be used by online analytical processing (OLAP) tools. Oracle 10g: Data Warehousing Fundamentals

302 Dimensions and Hierarchies
ALL Year_Key Calendar hierarchy Level keys Quarter_Key Attribute Month_Key Month_Desc Dimensions and Hierarchies Dimensions: Dimensions describe analytic business entities such as products, departments, and time in a hierarchical, categorized manner. A dimension can consist of one or more hierarchies. In the example shown, the time dimension consists of a calendar hierarchy. Hierarchies: A hierarchy consists of multiple levels. Each value at a lower level in the hierarchy is the child of one and only one parent at the next higher level. A hierarchy consists of a 1:n relationship between levels, with the parent level representing a level of aggregation of the child level. In the example, the calendar hierarchy consists of sales date, month, quarter, and year. The arrows indicate the direction of traversing a hierarchy to roll up data at one level to get aggregate information at the next level. For example, rolling up daily data yields monthly data, rolling up monthly data yields quarterly data, and so on. Level keys and attributes: A level key is used to identify one level in a hierarchy. The use of surrogate keys to identify hierarchical elements during the dimensional design phase further leverages the performance advantage provided by level keys. There may be additional attributes for a level, which can be determined given the level key. Attributes can be used as aliases for a level. In the example, Month_Key (defined as two digits) is the level key that identifies a month, and Month_Desc is an attribute that can be used as an alias for a month. Sales_Date Oracle 10g: Data Warehousing Fundamentals

303 Oracle 10g: Data Warehousing Fundamentals 1 - 396
Dimension Example Table TIME - YEAR_KEY - QUARTER_KEY - MONTH_KEY - MONTH_DESC - SALES_DATE Dimension TIME_DIM - YR - QTR - MON, MONTH_DESC - SDATE Dimension Example Dimensions, and the hierarchical relationships established between dimensions, can be based on columns in a single table (or columns from several tables in the case of normalized or snowflake schemas). In the example, the TIME_DIM dimension is based on the TIME table and has four levels: The highest level in the hierarchy consists of the YEAR_KEY column. The next level is derived from the QUARTER_KEY column. The third level has the MONTH_KEY column as the key and MONTH_DESC as an attribute. The lowest level is based on the SALES_DATE column. Oracle 10g: Data Warehousing Fundamentals

304 Defining Dimensions and Hierarchies
Year CREATE DIMENSION time_dim LEVEL sdate IS time.sales_date LEVEL mon IS time.month_key LEVEL qtr IS time.quarter_key LEVEL yr IS time.year_key HIERARCHY calendar_rollup ( sdate CHILD OF mon CHILD OF qtr CHILD OF yr ) ATTRIBUTE mon DETERMINES month_desc; Quarter Month Sales date Defining Dimensions and Hierarchies The system privilege, CREATE DIMENSION, is required to create a dimension in one’s own schema based on tables that are within the same schema. Another new privilege, CREATE ANY DIMENSION, allows a user to create dimensions in any schema. In the example shown, the TIME_DIM dimension is based on the TIME table. Oracle 10g: Data Warehousing Fundamentals

305 Dimensions with Multiple Hierarchies
Calendar hierarchy YR YR Week hierarchy = QTR WK MON DT DT Dimensions with Multiple Hierarchies The previous example showed a single hierarchy within the TIME dimension, but it is possible to have multiple hierarchies. For example, the pair of hierarchies shown in the slide can be created within a single dimension. The statement to do this is as follows: CREATE DIMENSION time_dim LEVEL dt IS time.sales_date LEVEL wk IS time.week_key LEVEL mon IS time.month_key LEVEL qtr IS time.quarter_key LEVEL yr IS time.year_key HIERARCHY cal ( dt CHILD OF mon CHILD OF qtr CHILD OF yr) HIERARCHY week ( wk child of yr); = Oracle 10g: Data Warehousing Fundamentals

306 Rewrites Using Dimensions
SELECT t.year, p.brand , c.city_name, SUM(s.amt) FROM sales s, city c, time t, product p WHERE s.sales_date = t.sdate AND s.city_name = c.city_name AND s.state_code = c.state_code AND s.prod_code = p.prod_code GROUP BY t.year, p.brand, c.city_name; SELECT v.year, s.brand, s.city_name, SUM(s.tot_sales) FROM sales_sumry s, (SELECT distinct t.month, t.year FROM time t) v WHERE s.month = v.month GROUP BY v.year, s.brand, s.city_name; Rewrites Using Dimensions The example in this slide shows a rewrite that is enabled by the TIME_DIM dimension. The relationship between month and year is inferred from the definition of the dimension and is used to roll up the sales summary data to obtain yearly sales. Oracle Database 10g supports the American National Standards Institute (ANSI) join syntax with complete support for one-sided and full outer joins. Oracle 10g: Data Warehousing Fundamentals

307 Oracle 10g: Data Warehousing Fundamentals 1 - 400
Summary In this lesson, you should have learned how to: Discuss summary management and Oracle implementation of summaries Describe materialized views Identify the types, build modes, and refresh methods for materialized views Explain the query rewrite mechanism in Oracle Describe the significance of Oracle dimensions Oracle 10g: Data Warehousing Fundamentals

308 Oracle 10g: Data Warehousing Fundamentals 1 - 401
Practice 9-1: Overview This practice covers the following topics: Identifying the importance of summary management for RISD data warehouse Identifying the refresh strategy for the RISD materialized views Examining materialized views and other concepts discussed in the lesson Oracle 10g: Data Warehousing Fundamentals

309 Leaving a Metadata Trail
Schedule: Timing Topic 40 minutes Lecture 30 minutes Practice 70 minutes Total

310 Oracle 10g: Data Warehousing Fundamentals 1 - 405
Objectives After completing this lesson, you should be able to do the following: Define warehouse metadata, its types, and its role in a warehouse environment Examine each type of warehouse metadata Develop a metadata strategy Outline the Common Warehouse Metamodel (CWM) Describe Oracle Warehouse Builder’s compliance with Object Management Group’s Common Warehouse Metamodel (OMG-CWM) Lesson Aim Metadata has already been referenced a number of times in this course. It is critical to every phase of warehouse design and development. This lesson examines the role of warehouse metadata in greater detail. Oracle 10g: Data Warehousing Fundamentals

311 Defining Warehouse Metadata
Descriptive data about warehouse data and processes Vital component of data warehouse Used by everyone The key to understanding warehouse information Defining Warehouse Metadata Data About Data Metadata is “data about data.” Warehouse metadata is descriptive data about warehouse data and the processes that are used in creating the warehouse. Warehouse metadata contains detailed descriptions about the location, structure, and meaning of data. It describes keys and indexes of the data. It contains mapping information, and it documents the algorithms and business rules that are used to transform and summarize data. Metadata is used throughout the warehouse, from the extraction stage through the access stage. Vital Component of the Data Warehouse Metadata plays a vital role in the successful implementation of the data warehouses. A warehouse with poor metadata is analogous to a filing cabinet filled with folders stored in no particular order. It is very difficult to find your information in the cabinet. Metadata Oracle 10g: Data Warehousing Fundamentals

312 Oracle 10g: Data Warehousing Fundamentals 1 - 408
Metadata Users End users Metadata repository Metadata Users In the warehouse, metadata is employed directly or indirectly by all warehouse users for many different tasks. End Users The decision-support analyst (or user) uses metadata directly. The user does not have the high degree of knowledge that the IT professional has, and metadata is the map to the warehouse information. One measure of a successful warehouse is the strength and ease of use of end-user metadata. Developers and IT Professionals For the developer (or an IT professional), metadata contains information about the location, structure, and meaning of data, information about mappings, and a guide to the algorithms used for summarization between detail and summary data. Developers IT professionals Oracle 10g: Data Warehousing Fundamentals

313 Oracle 10g: Data Warehousing Fundamentals 1 - 409
Types of Metadata End-user metadata: Key to a good warehouse Navigation aid Information provider Context ETL metadata: Maps structure Source and target information Transformations Operational metadata: Load, management, scheduling processes Performance Types of Metadata End-User Metadata End-user metadata describes the location and structure of data for user access. It describes data volumes and algorithms. Essentially, this is the floor plan that the knowledge worker uses to navigate through and around the data. End-user metadata is sometimes referred to as business metadata. ETL Metadata Extraction, transformation, and loading metadata (sometimes called warehouse metadata or ETL metadata) maps the structure of source systems and how the data is to be transformed into its new format for the warehouse. It contains all the rules for extracting, scrubbing, summarizing, transporting, and loading data. This is often the most difficult metadata model to construct. Operational Metadata Operational metadata is used by the load, management, and access processes for scheduling data loads or end-user access. It contains information about housekeeping activities, statistics of table usage, and information about every aspect of performance. That is, operational metadata is considered in each phase of the ETL process. Oracle 10g: Data Warehousing Fundamentals

314 Examining Types of Metadata
ETL metadata End-user metadata Metadata repository ETL External sources End user Examining Types of Metadata Now you examine more closely the primary types of warehouse metadata. This includes ETL metadata generated during warehouse development considering the external and operational data, as well as end-user metadata. Operational data sources Warehouse Oracle 10g: Data Warehousing Fundamentals

315 Examining Metadata: ETL Metadata
Business rules Source tables, fields, and key values Ownership Field conversions Encoding and reference table Name changes Key value changes Default values Logic to handle multiple sources Algorithms Time stamp Examining Metadata: ETL Metadata You should more closely examine the different types of warehouse metadata. Especially, ETL metadata generated during warehouse development, and the end-user metadata. ETL metadata defines how data from the physical level in the source system maps to the physical level in the data warehouse. ETL metadata holds: The business rules that are applied to the warehouse data Names of the source tables, source fields, and source key values Information about the owner of the source data The rules that are applied to field conversions on a field-by-field basis Encoding and reference table conversions Field name and key value changes Default values assigned to NULL fields Logic to extract records from multiple source systems and create records (or a single record) for the load process Algorithms that create derived data (for example, data such as: Units_Sold / Total_Sales = Selling_Price) Time stamp details Oracle 10g: Data Warehousing Fundamentals

316 Oracle 10g: Data Warehousing Fundamentals 1 - 413
Extraction Metadata Space and storage requirements Source location information Diverse source data Access information Security Contacts Program names Frequency details Failure procedures Validity checking information Metadata repository Extraction Extraction Metadata Extraction metadata contains: Space requirement information Storage frequency and duration details Source location information such as hardware platform information, gateway information, operating system, file system, database, origin and destination information, and loading rules Diverse system information with details about the source type such as whether the data is production, internal, external, or archive; structure information such as file type, name, field type, and data granularity Access information such as alias names, versions, relationships, and data volatility Security information, table owners, data owners, authorization levels, and audit trail information Source data contact and owner details (for example, their names, telephone numbers, and identifiers) Extraction program names Temporary storage details, names of storage files, and procedure for removing storage files Oracle 10g: Data Warehousing Fundamentals

317 Transformation Metadata
Duplication routines Exception handling Key restructuring Grain conversions Program names Frequency Summarization Metadata repository Transformation Transformation Metadata Transformation metadata contains: Duplication routines for elimination, consolidation, ordering, and summarization of data Exception handling and validation procedures Key restructuring rules Granularity conversions Transformation program names and locations Frequency of the transformation Summarization procedures Oracle 10g: Data Warehousing Fundamentals

318 Oracle 10g: Data Warehousing Fundamentals 1 - 416
Loading Metadata Method of transfer Frequency Validation procedures Failure procedures Deployment rules Contact information Metadata repository Loading Loading Metadata Loading metadata contains: Data-transfer methods Frequency of transportation Validation procedures Failure procedures Rules for deployment Contact information, in case of any issue with the data or the movement of data Oracle 10g: Data Warehousing Fundamentals

319 Examining Metadata: End-User Metadata
Location of fact and dimensions Availability Description of contents and algorithms used for derived and summary data Data ownership details Metadata repository Examining Metadata: End-User Metadata The user never accesses end-user metadata directly. This metadata is viewed from the end user’s tool and is used to navigate around the data. Using this metadata, users can see the data available in the warehouse environment and establish the meaning of elements within the warehouse. User metadata describes: The physical location of fact and dimension data The availability of the data. Not all data components of the warehouse are available to every user. Some facts may be sensitive to specific user groups. The exact description of the contents and business algorithms used to create summary data. Users should never be in a position where they are guessing how a summary has been calculated. How derived data has been created, the source data, and any algorithms used Data ownership details, so that if there are any problems with the data content, the user can ask the appropriate person questions about the data and identify or rectify the problems found. This information must supply telephone number, fax number, or address. Data ownership details are possibly the most important aspect of end-user metadata. If there is an issue with the data, it must be resolved quickly and appropriately. End users Oracle 10g: Data Warehousing Fundamentals

320 End-User Metadata: Context
Table Name Column Name Data Meaning Products Prod_ID 739516 Unique identifier for the product Prod_valid A Whether the product is Available (A) or not Supplier_Id 1 The supplier ID, who supplies the product Prod_Eff_from 1-Jan-1998 The effective date on which the product is made available Prod_category_Id 201 The category of the product (for example, Electronics) Prod_Weight_Class Depends on packed shipping weight in kilograms End-User Metadata: Context If the following data is warehouse data, how much can you deduce? A11-Jan You can deduce nothing tangible from this data other than a series of numbers. It could represent product codes, map coordinates, or employee salaries. The only way to deduce information from this data is to know the context of the table you are querying. Also associate the metadata with its description to understand the exact meaning. For example, if you are querying the Products table, the metadata may look like the table given in the slide. When you associate the data with its metadata description, the data becomes information. Oracle 10g: Data Warehousing Fundamentals

321 Historic Context of Data
Supports change history Maintains the context of information Metadata repository End users Historic Context of Data Historic data often has business rules and algorithms applied that are different from those applied to current data. In the operational environment, there is only one definition of the database structure at any time. In the warehouse environment, data definitions change over a period of time. It is important to record the date when data changes, names, key values, default values, and algorithms to allow knowledge workers to analyze the data in the correct context. This ensures that you can understand and identify the differences in the context of the data in historical files. For example, you may store data for 2002– 2005 offline. Suppose you want to store 2006 data online. The default value for an amount field changed from a series of 9s to 0s in You can run a query to identify amounts between 2003 and 2005, but if you do not understand when and how default amounts were recorded, you may not be able to explain or understand why both 9s and 0s are stored, or realize the impact that the change has on calculations or reports. Another example arises with products such as personal computers that had very few components when they were first available. Consider the changes they have gone through and the many components they contain today. There is a rapid and voluminous history of change. Oracle 10g: Data Warehousing Fundamentals

322 Oracle 10g: Data Warehousing Fundamentals 1 - 420
Types of Context Simple: Data structures Naming conventions Metrics Complex: Product definitions Markets Pricing External: Economic Political Metadata repository End users Types of Context The context of data in the warehouse may be: Simple contextual information such as data structures, data coding, naming conventions, and data metrics Complex contextual information such as product definitions, market territories, pricing, packaging, and rule changes External contextual information such as economic forecasts, political information, and competitive information Oracle 10g: Data Warehousing Fundamentals

323 Developing a Metadata Strategy
Define a strategy to ensure high-quality metadata useful to users and developers. Primary strategy considerations: Define goals and intended use. Identify target users. Choose tools and techniques. Choose the metadata location. Manage the metadata. Manage access to the metadata. Integrate metadata from multiple tools. Manage change. Developing a Metadata Strategy Like every other aspect of the data warehouse implementation, metadata should be the subject of a well-considered, well-planned strategy. You must ensure that the metadata is of a high quality, provides the right information to users and developers, and is able to take into account the various tools that employ metalayers. Integrating these layers is critical. Primary Considerations Among many other considerations, you need to resolve these key issues for the strategy: Define the goals and intended use of the warehouse metadata. Identify the target users of warehouse metadata. Choose tools and techniques for creating and managing metadata. Choose the metadata location. Manage the metadata. Manage access to the metadata. Integrate multiple sets of metadata from different tools. Manage changes to metadata. Oracle 10g: Data Warehousing Fundamentals

324 Defining Metadata Goals and Intended Usage
Define clear goals. Identify requirements. Identify intended usage. Defining Metadata Goals and Intended Usage Define clear goals and identify the intention of the metadata that you develop. Outline main requirements such as maintaining history, context, and algorithms. Metadata Oracle 10g: Data Warehousing Fundamentals

325 Identifying Target Metadata Users
Who are the metadata users? Developers Data warehouse or BI analysts Data warehouse architects Database administrators (DBAs) Report specialists End users Identifying Target Metadata Users Consider who, among the developers, data warehouse or business intelligence (BI) analysts, architects, DBAs, report specialists, and the end users, is to access metadata. What is the type of metadata (information) they need? For example, a DBA may be accessing Data Dictionary frequently, architects and developers may access ETL metadata frequently, and so on. Determine the means by which they will access the metadata. For example, report specialists may access metadata through reporting tools such as Discoverer, and architects may access ETL metadata through the ETL tool such as Warehouse Builder. Oracle 10g: Data Warehousing Fundamentals

326 Choosing Metadata Tools and Techniques
Data modeling ETL End user (query and analysis) Database schema definitions Middleware tools Choosing Tools and Techniques Data Modeling Tools These tools are also known as computer-aided software engineering (CASE) tools. Some of these tools are better than others at physically modeling metadata. Consider using a tool that either is specifically designed to model warehouse features or is extensible. For example, can the tool model a star or a snowflake schema? ETL Tools These tools are used for extracting, transforming, and loading data into a warehouse, and they also generate metadata. Generally, these tools are expensive purchases, and cannot be employed for the first iteration during development. However, these tools have the advantage of being able to create and maintain a metadata layer. The tool must have all the information to take source data to the warehouse, so it is logical that the tool itself contains this layer. Note: Oracle Warehouse Builder (OWB) is used for modeling as well as for ETL tasks. Oracle 10g: Data Warehousing Fundamentals

327 Choosing the Metadata Location
Usually the warehouse server Possibly on operational platforms Desktop tool with metalayer Choosing the Metadata Location Metadata exists for every process and product that is employed in the data warehouse environment. The storage of metadata is product specific. The location of metadata is often determined by the tool you use to create it. If you are using a relational database management system, then by default the metadata resides in the database and usually on the warehouse server. This is the preferred method. You may locate the metadata on a separate database on another machine. Some ETL and query tools have their own metalayer. When this is the case, you need to ensure that each metalayer can communicate with others. Metadata Oracle 10g: Data Warehousing Fundamentals

328 Oracle 10g: Data Warehousing Fundamentals 1 - 427
Managing the Metadata Managed by the metadata manager Maintained by the metadata architect Must follow standards Managing the Metadata Management: Given the critical importance of metadata within the warehouse environment, it must be subject to strict control and management. Metadata is such a vital component in your warehouse implementation that someone should be responsible for managing and maintaining it. It is also important to ensure that creation of or changes to metadata are agreed upon with a formal sign-off. Maintenance: A metadata architect is usually responsible for defining the strategy and implementing metadata. This person is primarily responsible for ensuring that metadata remains up-to-date and consistently reflects any changes within the business infrastructure. If there are different metalayers, the architect must control integration of the metadata among products and tools. Standards: As with any development project, standards are critical. Determine standards for every aspect of metadata from simple naming conventions to versioning requirements to documenting complex algorithms. Standards for metadata are emerging within the industry. It is worth monitoring the changes that vendors are considering, as well as the collaborative exercises between large computing companies who are looking to define standards. Oracle 10g: Data Warehousing Fundamentals

329 Integrating Multiple Sets of Metadata
Multiple tools may generate their own metadata. These metalayers should be properly integrated. Metadata exchangeability is desirable. Integrating Multiple Sets of Metadata Each of the tools that you use in your warehouse environment might generate its own set of metadata. One of the biggest problems with metadata is integrating all of the different layers. Some vendors provide tools that can exchange metadata.Later in this lesson, you learn how Common Warehouse Metamodel (CWM) addresses the sharing of metadata among Oracle tools. Oracle 10g: Data Warehousing Fundamentals

330 Managing Changes to Metadata
Different types of metadata have different rates of change. Consider metadata changes resulting from refresh cycles. Managing Changes to Metadata Metadata changes at different rates according to the type of data stored. For example, models of operational and warehouse databases might remain static for a substantial period of time; however, metadata that maintains information about the warehouse data changes frequently. Each data refresh brings in more data each cycle. With it, summaries may change, dimensions may change, and more. Oracle 10g: Data Warehousing Fundamentals

331 Additional Metadata Content and Considerations
Summarization algorithms Relationships Stewardship Permissions Pattern analysis Reference tables Additional Metadata Content and Considerations The following sources of metadata cannot be ignored while managing the metadata: Summarization Algorithms You have seen that the warehouse contains fully detailed fact records and summary records that are created according to predefined algorithms. The meaning of the summaries is maintained in the metadata. Relationships Relationships show how tables are related, their constraints and rules, and the cardinality of data. This relationship information is maintained in the metadata. This information is documented along with ownership information and text descriptions of tables and keys. Stewardship Metadata must identify the originator of data. Remember that the data in the warehouse has come from many different source systems, with different suppliers, different owners, and different transformation issues. Oracle 10g: Data Warehousing Fundamentals

332 Common Warehouse Metamodel
Design and administration Analytic applications Any source ERP Operational External Any access Reporting Ad hoc query & analysis Data mining Warehouse Data integration Information delivery Marts Common Warehouse Metamodel Object Management Group’s (OMG’s) Common Warehouse Metamodel (CWM) is an open standard that Oracle and other vendors originally submitted to the Object Management Group and that has since become the de facto metadata standard for data warehousing and business intelligence. This open standard enables metadata integration between various applications such as query analysis and reporting tools, data mining tools, ERP applications, and so on. OMG-CWM enables tight integration of metadata among Oracle’s products as well as industry-leading tools from Oracle partners, resulting in reduced implementation complexity and greater user productivity. Instructor Note There are many metadata management tools available in the market: generic repository tools such as Data Shopper from Platinum Technology and Manager Link from Manager Software Products; and also tools specifically used for data warehouses and data marts, such as Prism Directory Manager from Prism Solutions, Meta Agent from Information Advantage, and so on. CWM metadata repository Oracle 10g: Data Warehousing Fundamentals

333 Oracle Warehouse Builder: Compliance with OMG-CWM
Is compliant with OMG-CWM standard Supports metadata management Integrates with other Oracle products Provides a graphical user interface and a repository Provides bridges to exchange metadata Oracle Warehouse Builder: Compliance with OMG-CWM As described earlier, Oracle Warehouse Builder 10g (OWB) is an enterprise business intelligence integration design tool that manages the full life cycle of design, deployment, and management for BI solutions on Oracle Database 10g. It provides an easy-to-use graphical environment to rapidly develop business intelligence systems. The Common Warehouse Metamodel (CWM) standard, in conjunction with supporting tools developed by Oracle, addresses the metadata integration and management challenge. CWM provides a standard for warehouse metadata so that disparate vendor tools can interoperate at the metadata level. CWM is based on other open standards with XML for Metadata Interchange (XMI) and Extensible Markup Language (XML) for interchange, and Unified Modeling Language (UML) as the modeling language. CWM is defined in UML as a set of core classes. These classes are divided into packages (or submodels), each representing a specific domain of data warehousing—for example, Relational, OLAP and Transformation. CWM provides a powerful object model that spans the spectrum of metadata relating to the extraction, transformation, loading, integration, and analysis phases within data warehousing. No single model can ever meet the diverse needs of all application and tool developers, but CWM will provide extensibility for tool-specific extensions. It is designed to support rapidly evolving metadata requirements, enabling customers to extend the model to meet their specific needs. Oracle 10g: Data Warehousing Fundamentals

334 Oracle 10g: Data Warehousing Fundamentals 1 - 435
Summary In this lesson, you should have learned how to: Define warehouse metadata, its types, and its role in a warehouse environment Examine each type of warehouse metadata Develop a metadata strategy Outline the Common Warehouse Metamodel (CWM) Describe Oracle Warehouse Builder’s compliance with OMG-CWM Oracle 10g: Data Warehousing Fundamentals

335 Oracle 10g: Data Warehousing Fundamentals 1 - 436
Practice 10-1: Overview This practice covers the following topics: Answering the questions based on the scenario Exploring the viewlets demonstrating the metadata management features of Warehouse Builder Oracle 10g: Data Warehousing Fundamentals

336 OLAP and Data Mining Schedule: Timing Topic 30 minutes Lecture
15 minutes Practice 45 minutes Total

337 Oracle 10g: Data Warehousing Fundamentals 1 - 439
Objectives After completing this lesson, you should be able to do the following: Define online analytical processing and the Oracle Database 10g OLAP option Compare ROLAP and MOLAP List the benefits of OLAP and RDBMS integration List the benefits of using OLAP for end users and IT Describe the data mining concepts Describe the tools and technology offered by Oracle for OLAP and data mining Objectives In large data warehouse environments, many different types of analysis can occur. In addition to SQL queries, you may also apply more advanced analytical operations to your data. Two major types of such analysis are online analytical processing (OLAP) and data mining. The goal of this lesson is to describe the fundamental concepts of OLAP and data mining. This lesson introduces the concepts of OLAP. You learn about the Oracle Database 10g OLAP option that is integrated into Oracle Database 10g. A comparison of ROLAP versus MOLAP is presented, and also the benefits of OLAP are discussed. The tools and technology offered by Oracle for OLAP and data mining are briefly discussed. Oracle 10g: Data Warehousing Fundamentals

338 Oracle 10g: Data Warehousing Fundamentals 1 - 440
OLAP: Overview OLAP stands for online analytical processing. Online: You have access to live data (rather than static data). Analytical processing: You can analyze your data for reporting. You can create reports that are: Multidimensional Calculation intensive Supported by time-based analysis Ideal for applications with unpredictable, ad hoc query requirements OLAP: Overview Online analytical processing (OLAP) is a term that has been used since the early 1990s to describe a class of computer systems that are designed and optimized for complex analysis and better query performance. By using this term, you can differentiate the requirements of the data analyst from the requirements of the users of OLTP. In the context of business intelligence today, the emphasis is more on “online” and “analytical.” Online: Although most OLAP tools and applications enable development of reports that can be saved and printed when not connected to live data, OLAP emphasizes live access to data rather than static reporting. Analytic queries are submitted against the database in real time, and the results are returned to your computer screen. Analytical processing: This is the key concept with OLAP. End users can: Easily navigate multidimensional data to perform unpredictable ad hoc queries and to display the results in a variety of interesting layouts Drill through levels of detail to uncover significant aspects of the data Rapidly and efficiently obtain the results of sophisticated data calculation and selection across multiple dimensions of the data Oracle 10g: Data Warehousing Fundamentals

339 Typical Example of an OLAP Query
An OLAP question is a multidimensional query, as in the following: “What was the percentage change in revenue for a grouping of our top 20% products from one year ago over a rolling three-month time period compared to this year for each region of the world?” This is a simple business question, but the actual query can be quite complex. Typical Example of an OLAP Query The questions that business users tend to ask are naturally multidimensional. They use a multidimensional language to express the business questions that they ask (such as the one shown in the slide). OLAP Questions Are Multidimensional Queries The OLAP question shown in the slide is a common example of a multidimensional query. It describes both the data that the user wants to examine and the structural form of that data. Business users typically want to answer questions that include terms such as what, where, who, and when. For example, you find the following essential questions embedded in the sample question: What products are selling best? (“…top 20%…”) Where are they selling? (“…each region of the world…”) When have they performed the best? (“…percentage change in revenue…”) Oracle 10g: Data Warehousing Fundamentals

340 An Overview of Multidimensional Model
The multidimensional logical model has the following elements: Measures Dimensions Hierarchies Levels Attributes Region Time Sales Customer Product An Overview of Multidimensional Model Multidimensional model is essentially made up of measures and dimensions. Measures contain or calculate the data, and dimensions organize the data. Typically, you may want to perform analyses of sales (measure) by product, region, customer, and time (these are the dimensions). Note: The fundamental concepts of measures and dimensions are already explained in the lesson titled “Business, Logical, and Dimensional Modeling.” Dimensions are mandatory in a dimensional model. If you do not have dimensions, you cannot have measures. Dimensions are what describe measures; they are fundamental to dimensional analysis. Dimensions may contain the following elements: Dimensions optionally have hierarchies, which are logical structures that group like members of a dimension together for the purposes of analysis, aggregation, or allocation. Hierarchies may or may not have levels because some hierarchies are not level based. Dimensions may also have attributes, which are used to provide more information about members of the dimension, and are useful when filtering that dimension for analysis. Oracle 10g: Data Warehousing Fundamentals

341 OLAP Implementation Before Oracle Database 10g
Deploy a specialized database that physically stores data in multidimensional form. e.g., Oracle Express, Hyperion Essbase, Microsoft Analysis Services Implement the logical dimensional model using a “star” or “snowflake” schema in a relational database. e.g., Oracle8i, IBM DB2, Microsoft SQL Server OLAP tools OLAP API SQL tools SQL OLAP Implementation Before Oracle Database 10g Historically, multidimensional technology has been compelling because business users think dimensionally, and multidimensional technology presents data in a way that reflects the users’ picture of their business data. However, organizations have always been forced to make choices— technical and architectural choices—and each of these have their own advantages and disadvantages resulting in compromising on the choice of the OLAP solution. Deploying a Stand-Alone Multidimensional Database For many years, the most common solution was to purchase and deploy a separate specialized multidimensional database that is tuned for the dimensional model, and transfer data to it from the source systems (most of which are running on relational database management systems, or flat file legacy systems). Implementing a Dimensional Schema in an RDBMS Other organizations, while recognizing that specialized multidimensional OLAP servers had certain advantages, would prefer to standardize on their relational database of choice, and deliver the dimensional model via a star or snowflake schema. Oracle 10g: Data Warehousing Fundamentals

342 OLAP Implementation with Oracle Database 10g
Data and business rules Tools Multidimensional data types Relational OLAP API SQL OLAP Implementation with Oracle Database 10g With Oracle Database 10g OLAP, these choices and compromises on OLAP are no longer necessary. Oracle OLAP uniquely combines relational and multidimensional database technology into a single database: Oracle Database 10g. Oracle’s objective in developing the Oracle10g OLAP option was to continue to offer not only the leading relational platform for business intelligence, but to incorporate highly advanced and powerful multidimensional technology inside Oracle database, where it could be managed, secured, and accessed just like other data in the database thus removing most, if not all, of the fundamental drawbacks of using multidimensional database technology in the past. With the Oracle OLAP option, you also benefit from the following: There is no need to copy and transfer data into separate specialized databases. All the data is in one place. All the business and calculation rules are defined and stored in one place. The data is secured and managed by the Oracle database. OLAP systems, including those based upon multidimensional data types, benefit from the scalability and availability features of the Oracle database. All data, relational and multidimensional, can be queried with regular SQL. Oracle 10g: Data Warehousing Fundamentals

343 Oracle 10g: Data Warehousing Fundamentals 1 - 447
ROLAP Versus MOLAP How to store the data for OLAP? In relational database tables to be used by OLAP metadata (relational OLAP or ROLAP) In analytical workspaces (multidimensional OLAP or MOLAP) ROLAP Versus MOLAP The types of analyses performed by your application determine the best choice of a data repository. You must examine the benefits of each storage method with regard to your application and decide which one most closely matches your requirements. You can choose to store the data for your business analysis applications from these alternatives: Entirely in the relational database. Data is stored entirely in relational tables in a data warehouse and made available to applications by OLAP metadata. During user sessions, data is selected and manipulated in the relational database. This method is typically called Relational OLAP or ROLAP. Entirely in the multidimensional analytical workspace. As a routine maintenance task, data is loaded into dimensions and variables in the workspace from one or more sources (including the relational database and flat files) and saved for use by all sessions. During user sessions, data is selected and manipulated in the analytical workspace. This method is typically called Multidimensional OLAP or MOLAP. (Typically, an analytical workspace is stored as a binary large object (BLOB) data type in the database. Note: The data can also be distributed between the relational database and the analytical workspace using a combination of ROLAP and MOLAP. A distributed solution may be desirable when an application requires the advanced calculation capabilities and speed of a MOLAP solution combined with the efficient storage of a relational database. This method is typically called Hybrid OLAP or HOLAP. Oracle 10g: Data Warehousing Fundamentals

344 Benefits of RDBMS Integration with OLAP in Oracle Database 10g
Scalability Availability Manageability Backup and Recovery Security Benefits of RDBMS Integration with OLAP in Oracle Database 10g Basing an OLAP system on the Oracle server offers the following benefits: Scalability Oracle Database10g is highly scalable. In today’s environment, there is tremendous growth along three dimensions of analytic applications: number of users, size of data, and complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multimillion row dimension tables and very large volumes of fact data. Oracle Database can handle very large data sets using parallel execution and partitioning (discussed previously in the lesson titled “Physical Modeling: Sizing, Storage, Performance, and Security Considerations”), as well as offering support for advanced hardware and clustering. Availability Oracle Database10g includes many features that support high availability. One of the most significant is partitioning, which allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time that the data is unavailable. Another feature supporting high availability is transportable tablespaces. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates. Oracle 10g: Data Warehousing Fundamentals

345 Oracle OLAP 10g: Components
The OLAP option of the Oracle Database 10g includes the following components: The OLAP analytic engine Analytical workspaces (AWs) Analytical Workspace Manager (AWM) OLAP Worksheet OLAP Catalog Interfaces for developing OLAP applications in SQL and Java Oracle OLAP 10g: Components The OLAP option of Oracle Database 10g includes the following components: The OLAP analytic engine, which supports the selection and rapid calculation of multidimensional data within Oracle Database Analytical workspaces, which store data in a multidimensional format where it can be manipulated by the OLAP engine. That is, an analytical workspace is a container for multidimensional data types. Analytical Workspace Manager, a graphical user interface for creating and maintaining analytical workspaces OLAP Worksheet, an interactive environment for executing OLAP DML, the data definition and manipulation language for interacting with analytical workspaces OLAP Catalog, the metadata repository that represents a star schema as a logical cube. The OLAP Catalog enables OLAP applications to access relational data. Interfaces for developing OLAP applications in SQL and Java Note: Analytical workspaces can also be created by using Warehouse Builder by selecting the “Multidimensional Storage” option while designing the cubes and dimensions. Oracle 10g: Data Warehousing Fundamentals

346 Oracle Tools for AW Administration and Querying
Tools for analytical workspace (AW) administration Analytical Workspace Manager (AWM) Oracle Warehouse Builder (OWB) Tools for querying OLAP data sources OracleBI Discoverer Plus OLAP OracleBI Spreadsheet Add-In Oracle Tools for AW Administration and Querying Tools for Administration Creating and managing the analytical workspaces is one of the primary administrative tasks of the Oracle OLAP 10g option. You can use either of the following tools for this task: Analytical Workspace Manager: AWM provides a user interface for extracting data from a relational schema and creating an analytical workspace in the database standard form. This form enables the analytical workspace to be used with various tools for modifying the logical model, loading new data, aggregating the data, and making the data accessible to OLAP applications. Oracle Warehouse Builder: OWB can extract data from many different sources, transform it into a relational schema, and create a standard form analytical workspace. To further define the contents of the workspace, you can use Analytical Workspace Manager. Note: AWM mainly enables you to build data from cleansed data, whereas OWB is a complete ETL tool, which also helps in performing data cleansing. The OLAP query tools are discussed in the following pages. Oracle 10g: Data Warehousing Fundamentals

347 OracleBI Spreadsheet Add-In
OracleBI Spreadsheet Add-In makes it easy to access OLAP data through the familiar spreadsheet environment of Microsoft Excel. After the installation of the Add-In, “OracleBI” appears as a new menu item in Excel. By using Spreadsheet Add-In, you can establish a secure connection to the OLAP data source and use Excel as the front-end access tool to the data in the database. Spreadsheet Add-In combines the flexibility and familiarity of Excel and the power, scalability, and security of the Oracle OLAP option. Here are some of the features of the Spreadsheet Add-In: OracleBI Query and Calc Builders: After the connection is established, you can use the wizard-driven interface to drill, pivot, navigate through large cubes, and create reports. Excel features: You can use all the powerful data-formatting features of Excel, combine Oracle OLAP data with other Excel data, and write Excel macros that leverage all your data. You can also create formulas and graphs in Excel, thereby combining the powerful analytic capabilities of Oracle OLAP with standard Excel functions that you know and use each day. Oracle multidimensional data types queried in OracleBI Spreadsheet Add-In Oracle 10g: Data Warehousing Fundamentals

348 OracleBI Discoverer Plus OLAP
OracleBI Discoverer Plus OLAP is another Oracle Business Intelligence tool that can directly access Oracle OLAP data. Discoverer Plus OLAP is an ad hoc query, reporting, analysis, and Web-publishing tool. It enables you to: Perform OLAP query, reporting, and analysis on both multidimensional data models (analytical workspaces), and relational OLAP data models (star or snowflake schemas) Access and analyze multidimensional data from your company’s database without having to understand complex database concepts. The wizards and menus of Discoverer Plus OLAP guide you through the steps to retrieve and analyze multidimensional data. Because Discoverer Plus OLAP understands the dimensional data model, you formulate your queries in the language of business—you use dimensions, hierarchies, levels, and measures through a simple interface. You can also exploit the rich features of OLAP through dimensionally aware query and calculation builders, thereby simplifying the tasks of defining queries and calculations. Worksheets that are authored in Discoverer Plus OLAP can also be published on the Web through OracleAS Portal so that enterprisewide users can access them. Oracle 10g: Data Warehousing Fundamentals

349 Oracle 10g: Data Warehousing Fundamentals 1 - 456
OracleBI Beans OracleBI Beans OracleBI Beans is used by business intelligence and OLAP developers. This technology is used for developing applications such as Enterprise Planning and Budgeting and tools such as Discoverer and Spreadsheet Add-In. BI Beans is also available to third-party software developers to accelerate development of custom OLAP applications. JDeveloper Integration BI Beans is a set of standards-based Java beans that is integrated into Oracle JDeveloper. It provides analysis-aware application building blocks designed for the OLAP option of Oracle Database. Using BI Beans, you can create customized business intelligence applications that take advantage of the robust analytic capabilities of Oracle OLAP. Applications can include advanced features such as interactive user interfaces, drill-to-detail reports, forecasting, and what-if analysis. BI Beans includes Java beans for acquiring the data from the Oracle database, presenting the data in a variety of cross-tab and graph formats, and saving report definitions, custom measures, and data selections. Using BI Beans, you can develop business intelligence applications from Oracle JDeveloper, or any Java application development environment, and deploy them through any application server as a thin or thick client. Oracle 10g: Data Warehousing Fundamentals

350 Oracle Data Mining: An Overview
What is data mining? Creating models to find hidden patterns in large volumes of data Oracle Data Mining: Offers embedded mining algorithms in Oracle Database Provides context-specific recommendations and predictive monitoring of critical processes Supports supervised and unsupervised data mining Offers better data security and up-to-date data Oracle Data Mining: An Overview What Is Data Mining? Data mining creates models to find hidden patterns in large, complex collections of data. These patterns sometimes elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of patterns, or the difficulty in performing the analysis. The following are a few scenarios where data mining can be helpful: A retailer wants to increase revenues by identifying all potentially high-value customers in order to offer incentives to them. The retailer also wants guidance in store layout by determining the products most likely to be purchased together. A government agency wants faster and more accurate methods of highlighting possible fraudulent activity for further investigation, and so on. Oracle Data Mining (ODM) Oracle Database 10g has embedded data mining algorithms to sift through the large volumes of data generated by businesses to produce, evaluate, and deploy predictive and descriptive models. It enriches mission critical applications in CRM, manufacturing, inventory management, customer service and support, Web portals, wireless devices, and other fields with context-specific recommendations and predictive monitoring of critical processes. Oracle 10g: Data Warehousing Fundamentals

351 Oracle Data Mining: Interfaces
Programmatic interface: DBMS_DATA_MINING, DBMS_DATA_MINING_TRANSFORM, and DBMS_PREDICTIVE_ANALYTICS PL/SQL packages Data mining functions in Oracle SQL for high-performance scoring of data PL/SQL interface for model creation, description, analysis, and deployment Java interface based on the Java Data Mining standard for model creation, description, analysis, and deployment Graphical user interface (GUI): Oracle Data Miner Oracle Data Mining: Interfaces In addition to supporting programmatic interfaces by supplying a rich set of PL/SQL packages, SQL functions, Java interface, Oracle data mining is also supported through a graphical user interface (GUI). Oracle Data Miner (ODM) is a GUI provided with the Oracle Data Mining option of the database (release 10.1 and later) that enables data analysts to mine their Oracle data to find valuable hidden information, patterns, and new insights. ODM finds valuable information that can help users better understand customers or clients and anticipate customer behavior. ODM insights can be revealing, significant, and valuable. For example, ODM can be used to: Predict those customers, who are likely to change their service providers Discover the factors involved with a disease Identify fraudulent behavior of taxpayers, and so on Oracle 10g: Data Warehousing Fundamentals

352 Oracle 10g: Data Warehousing Fundamentals 1 - 460
Summary In this lesson, you should have learned how to: Define online analytical processing and the Oracle Database 10g OLAP option Compare ROLAP and MOLAP List the benefits of OLAP and RDBMS integration List the benefits of using OLAP for end users and IT Describe the data mining concepts Describe the tools and technology offered by Oracle for OLAP and data mining Oracle 10g: Data Warehousing Fundamentals

353 Oracle 10g: Data Warehousing Fundamentals 1 - 461
Practice 11-1: Overview This practice involves answering the questions based on the concepts of Oracle OLAP and data mining covered in this lesson. Oracle 10g: Data Warehousing Fundamentals

354 Data Warehouse Implementation Considerations
Schedule: Timing Topic 30 minutes Lecture 30 minutes Total

355 Oracle 10g: Data Warehousing Fundamentals 1 - 464
Objectives After completing this lesson, you should be able to do the following: Describe the project management plan Specify the requirements for the implementation Describe the metadata repository, technical architecture, and other considerations Describe post implementation change management considerations Objectives This lesson provides the essence of the data warehouse implementation considerations. Most of these have been already discussed in individual lessons. (You are given some of the handouts based on the RISD case study, which can serve as the templates for a data warehouse implementation.) Note that there is no one solution that fits all data warehousing scenarios of various sizes and requirements. Oracle 10g: Data Warehousing Fundamentals

356 Oracle 10g: Data Warehousing Fundamentals 1 - 465
Project Management Develop a project management plan (PMP) with: Purpose Scope Technical architecture Data warehouse design Data quality and data acquisition Data access and security Objectives Goals and success factors Approach Implementation methods Administration Control and reporting Project tasks, work products, and milestones Project Management When you agree on the proposal for a data warehouse implementation project, develop a project management plan (PMP). The purpose of the PMP is to confirm the scope of work to be performed as agreed upon in the proposal and the corresponding contract. Keep the following points in mind while developing the PMP: Clearly state the purpose of the project as agreed upon in the proposal document and the contract document. Define the scope of the project in terms of: Technical architecture to be developed by your team (to do with the creation of the database required, and so on) Data warehouse design (creating logical model and converting them to physical models) Data quality and data acquisition (defining who is responsible for data cleansing, and what kind of ETL tasks will be performed by the project team) Data access and security scope (defining how the data is accessed, and what kind of security will be provided—for example, a role-based security or a virtual private database, or both) Define the objectives of the project clearly stating the goals and the critical success factors of the project. Oracle 10g: Data Warehousing Fundamentals

357 Requirements Specification or Definition
Conform to the detailed business and system objectives. Review the initial data model. Define requirement standards and guidelines. Define data access requirements. Define reporting requirements. Define security requirements. Data access Reporting Portal Requirements Specification or Definition Prepare the requirements specification document (sometimes, there may be one for each of the requirements) conforming to the detailed business and system objectives defined in the PMP. Analyze and define the requirements, considering the following points: Review the initial data model and define the requirements of the business model. Define the requirement standards and guidelines for the project. Define data access requirements. Define common reporting requirements such as what reports are required to be generated by the system. Define the security requirements at each of these levels: database, reporting, and portal. (For example, the data and reports are stored in the database, you may be using Discoverer as the reporting tool, and these reports are published on OraceAS portal. So what is the security that you will be implementing at each level?) Note: Refer to the template of the security requirements document for the RISD case study, M3_RISD_Security_Require_3.0.doc. Oracle 10g: Data Warehousing Fundamentals

358 Logical, Dimensional, and Physical Data Models
Define logical model. Identify entities, relationships between them, and cardinality. Define dimensional model. Identify facts, dimensions, and hierarchies. Design time dimension. Define physical model. Document all the models. Logical, Dimensional, and Physical Models Data model is a structured representation of the information that supports RISD assessment reporting requirements. Data model is based on the requirements gathered during the definition phase of the “data warehouse” initiative. The functional or business data model typically avoids technological aspects and depicts only the business data. By identifying information by the business names and providing definitions and descriptions, it provides a common, consistent point of view that should be shared in an organization. The base of the data model is the entity relationship diagram (ERD) because it identifies the major pieces of information and their relationships. In summary, you identify the entities, the relationships between them, and the cardinality. For example, the slide depicts the relationship between a student and the student access facts. STUDENT is an example of an entity for RISD. In the diagram, entities are represented as boxes with rounded corners. A line that joins two entity boxes together illustrates a relationship between the entities. There are two relationship aspects that are depicted by the line: mandatory or optional associations and cardinality. Mandatory or optional associations: A solid line from an entity box means that an occurrence of that entity must be associated with an occurrence of the entity at the other end of the line. A broken line from an entity box means that an occurrence of the entity may be associated with an occurrence of the entity at the other end of the line. Each end of the line is given a title that describes the nature of the relationship for the entity at that end of the line. Oracle 10g: Data Warehousing Fundamentals

359 Data Warehouse Architecture
Define technical architecture. Define functional architecture. Decide which tools to use: For data modeling For database design For ETL For reporting, and so on Decide which Oracle database features to leverage: Partitioning Bitmap indexes, bitmap join indexes Materialized views Parallelism, and so on Data Warehouse Architecture Database read operations requirements drive the technical architecture in a data warehouse environment. As expected, write operations, which usually occur only during the load process, must be considered and some data access compromises might have to be made to accommodate database availability and data integrity during the update window. After the technical architecture is decided, decide the major components of the proposed business intelligence system architecture needed to support business needs. That is, in addition to the database, decide how the reporting tools should be installed, how should the load balancing be done between portal instances, or (in general) how many the Application Server instances to use, and so on. The final selection should be based on several factors including total cost of ownership, the ability to meet growth projections (scalability), and the ability to satisfy system requirements and constraints. Also, decide which tools to use for data modeling, database design, ETL, reporting, and so on. For example, RISD uses Designer for modeling and database design, Warehouse Builder for ETL, Discoverer for reporting, and Portal for publishing the reports. Another important consideration for data warehouse architecture is to identify and decide which of the Oracle database features to leverage to gain maximum benefits in data warehouse implementation. Note: Refer to the template M6_RISD_DW_Architecture_V3.0.doc for architecture planning of RISD. Oracle 10g: Data Warehousing Fundamentals

360 Oracle 10g: Data Warehousing Fundamentals 1 - 471
ETL Considerations Select the right ETL tools. Identify the extraction techniques to be used. Identify the types of data transformations needed, and also identify the techniques. Identify the tables to be loaded for the initial load. Load the dimension and fact tables. ETL Considerations ETL tools are the backbone of the data warehouse, moving data from a source to a transaction repository and on to data marts. They must deal with issues of performance of load for large volumes and with complex transformation of data in a repeatable, scheduled environment. These tools build the interfaces between components in the architecture and will also often work with data cleansing elements to ensure that the most accurate data is available. The need for a standard approach to ETL design within a project is paramount. Developers will often create an intricate and complicated solution for which there is a simple solution, often requiring little compromise. Any compromise in the deliverable is usually accepted by the business after they understand that these simple approaches will save them a great deal of cash in terms of time taken to design, develop, test, and ultimately support. It is also equally important to identify the extraction techniques and methods that are most suitable for your implementation. Similarly, it is important to decide which type of transformation techniques suit your data to provide better data quality. For example, pipelined data transformation is useful when you do not want to have an intermediate staging area, thereby facilitating quick transformations. Similarly, the MERGE statement is useful when you want to update the already existing rows and insert new rows into multiple target tables from a source table. Also, identify the fact and dimension table to be loaded in the initial load. Note: ETL processes, techniques, and tools are discussed in detail in the lessons titled “The ETL Process: Extracting Data,” “The ETL Process: Transforming Data,” and “The ETL Process: Loading Warehouse Data” of this course. Oracle 10g: Data Warehousing Fundamentals

361 Reporting Considerations
Identify the common reports to be generated by the system. Identify the reporting tool according to system requirements. How to improve query performance: Indexing Star query optimization Parallel query Summaries or materialized views Reporting Considerations The analysis and reporting tools are the main interface through which the users interact with the system. Depending on the business querying requirements, you can classify these tools as simple reporting tools, complex ad hoc query tools, statistical and data mining packages, and tools that provided complex analytic analyses. Oracle offers tools for the entire spectrum of business requirements. Identify the reporting tool that best fits the business query requirements of the system. Further, to improve the query performance, you can employ Oracle database–supported techniques such as indexing, star query optimization, parallelism in queries by providing hints, materialized views for precomputed results, and so on. Note: A discussion about various OracleBI tools and applications is presented in the lesson titled “Data Warehousing and Business Intelligence” of this course. Also, note that all the techniques mentioned here for query performance have been discussed in the respective lessons with examples. Oracle 10g: Data Warehousing Fundamentals

362 Security Considerations
How many levels of security need to be implemented? Database level: Role-based access Row-level and column-level security Fine-grained access Application level Portal level Security Considerations Because security plays a crucial role in data warehouses, it is very important to consider the layers of security that you need to implement for a data warehouse. Decide the levels of security that you want to implement. Decide the database security technique: whether to use role-based access control or Virtual Private Database (VPD), or a combination. For example, the RISD system can be implemented as a user role–based access with a combination of VPD. Some of the user roles can be: Public – Achievement Aggregate Data, School Board – Achievement Aggregate Data, District Level Assessment Team personnel – Access to all data in the Data Warehouse with no restrictions and Discoverer Plus access. Central Staff (includes Superintendent, Central Administrators, Researchers) – Superintendent, Central Administrators, and Researchers can see all aggregate and all detail student data (current and former students) but only access to data that has been marked as “final,” and so on. Oracle 10g: Data Warehousing Fundamentals

363 Oracle 10g: Data Warehousing Fundamentals 1 - 474
Metadata Management Metadata capture Metadata sharing Metadata update Metadata publishing Metadata Management This is one of the important tasks of a data warehouse implementation, and consists of the following subtasks: Metadata Capture In the data warehouse, data is stored in different components such as the staging area, operational data store (ODS), and target data warehouse. The models are designed by using Oracle Warehouse Builder (OWB) or other modeling tools. OWB can use those models as input and generate transformations based on these database objects. OWB is able to import the metadata from Designer through a metadata bridge. OWB can also directly capture metadata from Oracle Data Dictionary and from other non-Oracle environments, if necessary. The ability to obtain metadata directly from non-Oracle applications may be useful for future extensibility. Metadata Update Changes to the data warehouse will happen almost as soon as it is built. OWB is built to allow changes to the data warehouse structure. If an ETL process needs to change, then the metadata for this process gets updated in the OWB tool first, and then OWB will generate new data definition language (DDL) and load scripts, or the PL/SQL packages as required. These replacement objects will then be propagated to the data warehouse. Oracle 10g: Data Warehousing Fundamentals

364 Testing the Implementation
Develop test plans and strategies for: Unit testing System testing Integration (user acceptance testing) Develop test cases covering all important phases: For ETL (especially load) For reports For data access and security Testing the implementation Typically, there are three phases to the testing process: unit test, system test, and integrated (user acceptance) test. Also, it is important to develop test cases for all the important phases in data warehousing implementation such as ETL, reporting, data access, and security implementation. For example, for ETL in RISD, the unit test can start with a snapshot of data provided in the operational data store (ODS). A snapshot of the ODS data is copied to the data warehouse machine and used for unit testing. Some conditions will need to be added to the “live” data to ensure that all scenarios are addressed. For example, some records can be changed to have invalid RISD student IDs to make sure that the record causes the process to abort (referential integrity check). Similarly, for testing the role-based VPD security for RISD, you need to develop the test cases to see whether the high-privileged users and low-privileged users are able to access the data that is pertaining to them. Alternatively, you have to design a test case to log in as the low-privileged user and access highly important data, to which you otherwise do not have access. Note: Refer to M7_RISD_Final_ETL_Test_Results_V1.doc and M8_RISD_Reports_Final_Test_Results_V1.1.doc of the RISD case study. Oracle 10g: Data Warehousing Fundamentals

365 Post Implementation Change Management
Requests for data warehouse enhancements are very frequent. Such user requests require the following: Make changes to logical model. Propagate the metadata into OWB. Make corresponding changes to the physical database (add or drop objects). OWB integrates the change management pack from Oracle Enterprise Manager (OEM). It enables you to propagate incremental changes to the database (without dropping objects). Post Implementation Change Management Almost as soon as the data warehouse goes into production, enhancements will be requested by the users or necessitated by changes dictated by the state. This section walks you through the change management process for the Data Warehouse Architecture and introduces the data warehouse upgrade features of OWB.  When, based on a change request made by the users, the warehouse development team needs to drop, reconfigure, rename, and upgrade or otherwise modify the data objects within the data warehouse. Changes usually begin with the logical data model. Then the metadata changes are to be propagated into the OWB repository using the respective metadata bridge. Next, the changes in the logical design have to be applied to the physical database instances. There are several ways to do this: In the traditional way, the DBA can manually create the scripts to CREATE, DROP, or ALTER objects, identify the affected objects, recompile the packages, grant the privileges, and re-create the synonyms.   OWB 10g provides the data warehouse upgrade or drop functionality by integrating with the Change Management Pack from the OEM. OWB also enables you to directly propagate incremental changes in your logical warehouse design to your physical instances, without having to drop objects or lose existing data. Oracle 10g: Data Warehousing Fundamentals

366 Some Useful Resources and White Papers
Many Oracle white papers and handouts are provided along with the course. For more information, refer to: /index.html For the entire range of BI product–based tutorials and hands-on, refer to: Oracle 10g: Data Warehousing Fundamentals

367 Oracle 10g: Data Warehousing Fundamentals 1 - 480
Summary In this lesson, you should have learned how to: Describe the project management plan Specify the requirements for the implementation Describe the metadata repository, technical architecture, and other considerations Describe post implementation change management considerations Oracle 10g: Data Warehousing Fundamentals

368 Practice Solutions

369 Self-Guided Practices on AWM and OWB
Schedule: Timing Topic 50 minutes Practice 50 minutes Total

370 Oracle 10g: Data Warehousing Fundamentals 1 - 506
Practice Overview This practice consists of two parts based on: Analytical Workspace Manager and OracleBI Spreadsheet Add-In Start AWM and create a database connection. Explore the Sales_AW workspace. Identify the dimensions and cubes in the workspace. Start OracleBI Spreadsheet Add-In and create a simple worksheet. Oracle Warehouse Builder Create a dimension. Create a cube. Create mappings (in the SH_HANDSON project). Instructor Note The practices are based on prebuilt COMPLETED_BI_DEMO, SH_HANDSON projects in OWB, and a prebuilt SH_AW_Tutor analytical workspace in AWM. Because this is a fundamentals course, the practices are designed in such a way that they are simple and self-driven. More elaborate practices are included in the courses on OWB and AWM titled “Oracle Database 10g: Using OLAP” and “Oracle Warehouse Builder 10g R2: Implementation Part I and Part II”. Students can take these up as the follow-up courses. The practices do not have exercises on deploying in OWB because deployment may be a slightly complex process for the participants of the fundamentals course. However, the COMPLETED_BI_DEMO project is complete in all respects, and students can explore the project and view the data. Similarly, the SH_HANDSON project also has the cubes and dimensions already deployed so that the students can view the data. Also, when you want to view data, OWB may display a warning saying “Connection Failed to location <Loc> ” and will prompt you to edit the details for the database location. Click Yes, enter the appropriate password, and test the connection. This is an additional security mechanism introduced in Oracle Warehouse Builder10g, Release 2. Oracle 10g: Data Warehousing Fundamentals


Download ppt "Data Warehousing and Business Intelligence"

Similar presentations


Ads by Google