Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehousing Michael, Ashley, Joshua, Nic.  Brief History  Data Warehouse Definitions Bill Inmon vs. Ralph Kimball, Data Warehouse, Data Mart 

Similar presentations


Presentation on theme: "Data Warehousing Michael, Ashley, Joshua, Nic.  Brief History  Data Warehouse Definitions Bill Inmon vs. Ralph Kimball, Data Warehouse, Data Mart "— Presentation transcript:

1 Data Warehousing Michael, Ashley, Joshua, Nic

2  Brief History  Data Warehouse Definitions Bill Inmon vs. Ralph Kimball, Data Warehouse, Data Mart  Data Warehouse Design Conceptual Design, Logical Design, Physical Design  Data Integrity Database Level, ETL Level, Access Level  Dimension Slowly Changing Dimensions, Dimensional Data Model  Functions OLAP, MOLAP, ROLAP, HOLAP  DW Architecture Factless Fact Table, Junk Dimension, Conformed Dimension  Major DWs Teradata, Oracle

3 Data Warehousing was conceptualized by Bill Inmon 1970s He founded the “Prism” company in the 1990s. He developed the concept of the Corporate Information Factory He developed a model for centralizing the data. Ralph Kimball developed the OLAP concepts in the 1990s He developed a more decentralized star schema model. Data Warehousing of today The Internet plays a major role in data warehousing. The government is very interested in monitoring this data. http://www.dataversity.net/a-short-history-of-data-warehousing/

4 Bill Inmon defines it as a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management decision making process Ralph Kimball defines it as a copy of transaction data specifically structured for query and analysis.

5  A data mart is a repository of data that is designed to serve a particular community of knowledge workers.

6 Independent or stand-alone data marts are of marginal use. All enterprises require a means to store, analyze and interpret the data they generate

7 enterprise has one data warehouse, and data marts source their information from the data warehouse.-Bill Inmon Data warehouse is the conglomerate of all data marts within the enterprise. - Ralph Kimball https://www.1keydata.com/datawarehousing/what-is-olap.html

8 What is Conceptual Design a concise description of the users’ data requirements without taking into account implementation details. Advantages Conceptual models facilitate communication between users and designers since they do not require knowledge about specific features of the underlying implementation platform. Schema composed of a set of dimensions and a set of facts. Dimension composed of either one level or one or more hierarchies. file:///C:/Users/Michael%20Davis%20Jr/Downloads/9783642546549-c1.pdf

9

10  What is Logical Design  conceptual and abstract  Snowflake schema  arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape  Attribute  a component of an entity that helps define the uniqueness of the entity  Unique Identifier  something you add to tables so that you can differentiate between the same item when it appears in different places. In a physical design, this is usually a primary key https://docs.oracle.com/cd/B10501_01/server.920/a96520/logical.htm

11

12  What is Physical Design  is the key to maximizing and speeding the return on investment from your data warehouse implementation  Implementation  results in an operational database  Prototypes  Plan to build prototypes of the physical data model at regular intervals during the data warehouse design process.  Test Environment  must have a database that reflects the production database  Star Schema  most widely used to develop data warehouses and dimensional data marts file:///C:/Users/Michael%20Davis%20Jr/Downloads/DB2BP_Warehouse_Design_0912.pdf

13

14  Conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another)  Logical data model (so we understand the details of our data without worrying about how they will actually implemented)  Physical data model (so we know exactly how to implement our data model in the database of choice). https://www.1keydata.com/datawarehousing/data-modeling- levels.html

15  Refers to the validity of data, meaning data is consistent and correct  "Garbage In or Garbage Out."  No data integrity in the data warehouse, any resulting report and analysis will not be useful

16  In a data warehouse or a data mart, there are three areas of where data integrity should be enforced: 1. Database Level 2. ETL Process 3. Access Level

17  Common ways of enforcing data integrity include: 1. Referential integrity  The relationship between the primary key of one table and the foreign key of another table must always be maintained.  For example, a primary key cannot be deleted if there is still a foreign key that refers to this primary key.

18 2. Primary Key / Unique Constraint  Primary keys and the Unique constraint are used to make sure every row in a table can be uniquely identified. 3. Not NULL vs. NULL-able  For columns identified as NOT NULL, they may not have a NULL value.

19 4. Valid Values  Only allowed values are permitted in the database  For example, if a column can only have positive integers, a value of '-1' cannot be allowed

20  Extract, Transformation and Load layer  Most common checks include record counts or record sums

21  Cleanse Data  Filter Records  Standardize values  Decode Values  Apply Business Rules  Householding  Merge Records

22  Ensures that data is not altered by any unauthorized means either during the ETL process or in the data warehouse  Data integrity can only be ensured if there is no unauthorized access to the data

23  Most common problem in particular to data warehousing  Applies to cases where the attribute for a record varies over time Slowly Changing Dimension

24  There are in general three ways to solve this type of problem, and they are categorized as follows: Type 1Type 1: The new record replaces the original record. No trace of the old record exists. Type 2Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people. Type 3Type 3: The original record is modified to reflect the change.

25  Type 1 Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. Usage: About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

26  Type 2 Advantages: -This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

27  Type 3 Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: -Type 3 will not be able to keep all history where an attribute is changed more than once Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

28  Most often used in data warehousing systems  Different from the 3rd normal form, commonly used for transactional (OLTP) type systems  The same data would then be stored differently in a dimensional model than in a 3rd normal form model

29  Logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access.  Adheres to a discipline that uses the relational model with some important restrictions.  Composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables.

30  Four Step Dimensional Design Process: Step 1: Select the business process to model - The first step in converting an ER diagram to a set of DM diagrams is to separate the ER diagram into its discrete business processes and to model each one separately. Step 2: Choose The Grain of the Business Process - The grain is the fundamental atomic level of data to be represented in the fact table.

31  Four Step Dimensional Design Process: Step 3: Designate the Fact Tables - The third step is to select those many to many relationships in the ER model containing numeric and additive non-key facts and to designate them as fact tables Step 4: Choose the dimensions that will apply to each fact table record

32 for Business Performance Management Planning, Budgeting, Forecasting, Financial Reporting, Analysis Simulation Models, Knowledge Discovery, and Data Warehouse Reporting.

33 OLAP enables end-users to perform ad hoc analysis of data in multiple dimensions Powerful technology for data discovery capabilities for limitless report viewing, complex analytical calculations predictive “what if” scenario (budget, forecast) planning.

34 performs multidimensional analysis of business data provides the capability for complex calculations, trend analysis Performs sophisticated data modelling https://www.1keydata.com/datawarehousing/ what-is-olap.html

35 Traditional way of OLAP analysis data is stored in a multidimensional cube storage is not in the relational database, but in proprietary formats.

36 Excellent performance: fast data retrieval optimal for slicing and dicing operations. perform complex calculations Complex calculations return quickly

37 manipulating the data stored in the relational database to show OLAP's slicing and dicing functionality. slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

38 Can handle large amounts of data. There is no limitation of data amount Can leverage functionalities inherent in the relational database Often, relational database already comes with a host of functionalities.

39 HOLAP technologies combine the advantages of MOLAP and ROLAP. It relies on cube technology for faster performance

40 A factless fact table: o Does not have any measures. o Captures events that occur only at the information level and are not included in the calculations level. o Captures many-to-many relationships between dimensions. There are two types of factless tables: o One is for capturing an event o The other is for describing conditions. http://www.jamesserra.com/archive/2011/12/factless-fact-table/ The Factless Fact Table http://dwhlaureate.blogspot.com/2012/08/factless-fact-table.html Data Warehouse Architecture

41 An event establishes the relationships among the dimension members from various dimensions. The existence of the relationship itself is the fact. This type of fact table generates reports. You can count the number of occurrences with various criteria. All the queries are based on the COUNT() with the GROUP BY queries. http://www.jamesserra.com/archive/2011/12/factless-fact-table/ Event Factless Fact Table

42 Student_ID Course_ID Instructor_ID Date Student_Attendance Dim_Student Dim_Date Dim_Course Dim_Instructor Event Factless Fact Table

43 The “coverage table” o It was coined by Ralph Kimball and is meant to describe conditions. o It is used to support negative analysis reports. http://www.jamesserra.com/archive/2011/12/factless-fact-table/ Coverage Factless Fact Table

44 http://dwhlaureate.blogspot.com/2012/08/factless-fact-table.html Coverage Factless Fact Table This table shows us 3 things: Which products have promotions. Which products that have promotions that sell. Which products that have promotions but did not sell.

45 OLTP (On-line Transaction Processing) o It is characterized by a large number of short on-line transactions. o The main emphasis is put on:  Very fast query processing.  Maintaining data integrity in multi-access environments.  Measuring the effectiveness based on the number of transactions per second. http://dwhlaureate.blogspot.com/2012/08/junk-dimension.html Junk Dimension: Introducing OLTP

46 The junk dimension o It is a collection of random things that are unrelated to any particular dimension. OLTP tables contain many of these random things. o This random data does not integrate easily into conventional variables. o A couple of bad alternatives exist when trying to put the more relevant junk data into conventional variables. Junk Dimension http://dwhlaureate.blogspot.com/2012/08/junk-dimension.html

47 Create a new dimension for each of the remaining attributes. o Why it is a bad idea: It could be necessary to create multitudes of new dimensions, resulting in a fact table with a very large number of foreign keys. Leave the remaining attributes in the fact table. o Why this is a bad idea: It could make the row length of the table unnecessarily large if, for example, the attribute is a long text string. Junk Dimension—Bad Alternatives https://en.wikipedia.org/wiki/Dimension_(data_warehouse)

48 Identify all the attributes and then put them into one or several Junk Dimensions. One Junk Dimension can hold several indicators that have no correlation with each other The designer can choose to build the dimension table so it ends up holding all the indicators occurring. The solution is appropriate in situations where the designer would expect to encounter a lot of different combinations Junk Dimension—The Solution https://en.wikipedia.org/wiki/Dimension_(data_warehouse)

49 Junk dimensions are also appropriate for placing attributes like non-generic comments from the fact table. The junk dimension should contain a single row representing the blanks as a surrogate key that will be used in the fact table for every row returned with a blank comment field. Junk Dimension—Closing Notes https://en.wikipedia.org/wiki/Dimension_(data_warehouse)

50 A conformed dimension is a dimension that has the same values for all areas of the business. The dimension that lends itself most readily to this concept is the date dimension. Conformed Dimension http://smdbi.blogspot.com/2009/05/what-is-conformed-dimension.html

51 There are two courses of action to take when there are non-conformed dimensions interacting with conformed dimensions: o Create a custom dimension that results in a non-conformed dimension. o Add the additional information into the conformed dimension. Conformed Dimension—Nonconformist http://smdbi.blogspot.com/2009/05/what-is-conformed-dimension.html

52 Customer dimensions are a little more difficult to conform. This is why a good business analysts needs to work with an experienced data modeler. Conformed Dimension—Nonconformist http://smdbi.blogspot.com/2009/05/what-is-conformed-dimension.html

53 Creating the integration processes present the largest challenge in data. Some dimension values remain static. Other dimension values change over time. Some changes are slow, while others are rapid. Conformed Dimension—Nonconformist http://smdbi.blogspot.com/2009/05/what-is-conformed-dimension.html

54  Teradata  Oracle

55 Teradata is a warehousing company that has been around for 35 years. Its warehousing benefits include: o Providing a complete view of an organization’s business in support of decision making processes. o Providing one detailed view of the customer from multiple data sources. o Providing advanced in-database analytics and intelligent in-memory processing. o It works with Teradata QueryGrid to give organizations a gateway to actionable big data insights. o Eliminating the time required to reconcile conflicting data between individuals and teams. o Possesses an optimized business analytic engine that provides the great business value at the lowest cost. Teradata http://www.teradata.com/products-and-services/Data-Warehouse- Overview/?LangType=1033&LangSelect=true http://www.teradata.com/about-us/?LangType=1033&LangSelect=true

56 Cloud Solutions. Database. Middleware. Applications. Industries. Engineered systems. Servers. Storage. Services. Oracle Accelerate for Midsize Companies. Oracle http://www.oracle.com/us/corporate/oracle-fact-sheet-079219.pdf

57 Any Questions?


Download ppt "Data Warehousing Michael, Ashley, Joshua, Nic.  Brief History  Data Warehouse Definitions Bill Inmon vs. Ralph Kimball, Data Warehouse, Data Mart "

Similar presentations


Ads by Google