CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA
Normalization In creating a database, normalization is the process of organizing it into tables in such a way that the results of using the database are always unambiguous and as intended (usually divide large tables into smaller for easier to maintain it). The process of making your data and tables match these standards is called normalizing data or data normalization. Normalization is the process of efficiently organizing data in a database. T here are two goals of the normalization process: 1- eliminating redundant data 2- ensuring data dependencies make sense.data dependencies Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
A simple example of normalizing data might consist of a table showing: CustomerItem purchasedPurchase price ThomasShirt$40 MariaTennis shoes$35 EvelynShirt$40 PajaroTrousers$25 If this table is used for the purpose of keeping track of the price of items and you want to delete one of the customers, you will also delete a price. Normalizing the data would mean understanding this and solving the problem by dividing this table into two tables, one with information about each customer and a product they bought and the second about each product and its price.
Normalization degrees: First normal form (1NF). This is the "basic" level of normalization and generally corresponds to the definition of any database: It contains two-dimensional tables with rows and columns. Each column corresponds to a sub-object or an attribute of the object represented by the entire table. Each row represents a unique instance of that sub-object or attribute and must be different in some way from any other row (that is, no duplicate rows are possible). All entries in any column must be of the same kind. For example, in the column labeled "Customer," only customer names or numbers are permitted.
Second normal form (2NF). At this level of normalization, each column in a table that is not a determiner of the contents of another column must itself be a function of the other columns in the table. For example, in a table with three columns containing customer ID, product sold, and price of the product when sold, the price would be a function of the customer ID (entitled to a discount) and the specific product.
Third normal form (3NF). At the second normal form, modifications are still possible because a change to one row in a table may affect data that refers to this information from another table. For example, using the customer table just cited, removing a row describing a customer purchase (because of a return perhaps) will also remove the fact that the product has a certain price. In the third normal form, these tables would be divided into two tables so that product pricing would be tracked separately.
Snowflake Schema The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy. Snowflake schema consists of a fact table surrounded by multiple dimension tables which can be connected to other dimension tables via many-to-one relationship. The normalization of dimension tables tends to increase number of dimension tables or sub-dimension table that require more foreign key joins when querying the data therefore reduce the query performance. The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables. Therefore in snowflake schema, instead of having big dimension tables connected to a fact table, we have a group of multiple dimension tables. The snowflake schema helps save storage however it increases the number of dimension tables.
Snowflake schema advantages: Snowflake schema help to save space by normalizing dimension tables. It is more difficult for business users who use data warehouse system using snowflake schema because they have to work with more tables than star schema. Snowflake schema is designed from star schema by further normalizing dimension tables to eliminate data redundancy. Small savings in storage space. Normalized structures are easier to update and maintain.
Snowflake schema disadvantages: The normalization of dimension tables tends to increase number of dimension tables or sub-dimension table that require more foreign key joins when querying the data therefore reduce the query performance. The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.
Snowflake schema example Snowflake Schema Example
Lets examine the snowflake schema above in a greater detail: DIM_STORE dimension table is normalized to add one more dimension table called DIM_GEOGRAPHY DIM_PRODUCT dimension table is normalized to add 2 more dimension tables called DIM_BRAND and DIM_PRODUCT_CATEGORY DIM_DATE dimension table is now connecting with three other dimension tables: DIM_DAY_OF_WEEK, DIM_MONTH and DIM_QUARTER. Fact table remains the same as star schema.
Star Schema vs. Snowflake Schema Star schema vs. Snowflake schema Star SchemaSnowflake Schema Understandability Easier for business users and analysts to query data. May be more difficult for business users and analysts due to number of tables they have to deal with. Dimension table Only have one dimension table for each dimension that groups related attributes. Dimension tables are not in the third normal form. May have more than 1 dimension table for each dimension due to the further normalization of each dimension table. Query complexity The query is very simple and easy to understand More complex query due to multiple foreign key joins between dimension tables
Star schema vs. Snowflake schema Star SchemaSnowflake Schema Query performance High performance. Database engine can optimize and boost the query performance based on predictable framework. More foreign key joins therefore longer execution time of query in compare with star schema When to use When dimension tables store relative small number of rows, space is not a big issue we can use star schema. When dimension tables store large number of rows with redundancy data and space is such an issue, we can choose snowflake schema to save space. Foreign Key Joins Fewer JoinsHigher number of joins Data warehouse system Work best in any data warehouse / data mart Better for small data warehouse/ data mart
1. Data optimization: Snowflake model uses normalized data, i.e. the data is organized inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.normalized dataredundancy Figure 1 – Snow flake model
Star model on the other hand uses de-normalized data. In the star model, dimensions directly refer to fact table and business hierarchy is not implemented via referential integrity between dimensions. Figure 2 – Star model
2. Business model: Primary key is a single unique key (data attribute) that is selected for a particular data. In the previous advertiser example, the Advertiser_ID will be the primary key (business key) of a dimension table. The foreign key (referential attribute) is just a field in one table that matches a primary key of another dimension table. In our example, the Advertiser_ID could be a foreign key in Account_dimension. In the snowflake model, the business hierarchy of data model is represented in a primary key –Foreign key relationship between the various dimension tables. In the star model all required dimension-tables have only foreign keys in the fact tables.
3. Performance: The third differentiator in this Star schema vs Snowflake schema face off is the performance of these models. The Snowflake model has higher number of joins between dimension table and then again the fact table and hence the performance is slower. For instance, if you want to know the Advertiser details, this model will ask for a lot of information such as the Advertiser Name, ID and address for which advertiser and account table needs to be joined with each other and then joined with fact table. The Star model on the other hand has lesser joins between dimension tables and the facts table. In this model if you need information on the advertiser you will just have to join Advertiser dimension table with fact table.
4. ETL Snowflake model loads the data marts and hence the ELT job is more complex in design and cannot be parallelized as dependency model restricts it. The Star model loads dimension table without dependency between dimensions and hence the ETL job is simpler and can achieve higher parallelism. Extract, Transform, Load (ETL) In managing databases, extract, transform, load (ETL) refers to three separate functions combined into a single programming tool. The extract function reads data from a specified source database and extracts a desired subset of data. The transform function works with the acquired data - using rules or lookup tables, or creating combinations with other data - to convert it to the desired state. The load function is used to write the resulting data (either all of the subset or just the changes) to a target database, which may or may not previously exist.