2NormalizationThe process of making your data and tables match these standards is called normalizing data or data normalization.Normalization is the process of efficiently organizing data in a database.There are two goals of the normalization process:1- eliminating redundant data2- ensuring data dependencies make sense.Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.In creating a database, normalization is the process of organizing it into tables in such a way that the results of using the database are always unambiguous and as intended (usually divide large tables into smaller for easier to maintain it).
3A simple example of normalizing data might consist of a table showing: CustomerItem purchasedPurchase priceThomasShirt$40MariaTennis shoes$35EvelynPajaroTrousers$25If this table is used for the purpose of keeping track of the price of items and you want to delete one of the customers, you will also delete a price.Normalizing the data would mean understanding this and solving the problem by dividing this table into two tables, one with information about each customer and a product they bought and the second about each product and its price.
4Normalization degrees: First normal form (1NF). This is the "basic" level of normalization and generally corresponds to the definition of any database:It contains two-dimensional tables with rows and columns.Each column corresponds to a sub-object or an attribute of the object represented by the entire table.Each row represents a unique instance of that sub-object or attribute and must be different in some way from any other row (that is, no duplicate rows are possible).All entries in any column must be of the same kind. For example, in the column labeled "Customer," only customer names or numbers are permitted.
5Second normal form (2NF) Second normal form (2NF). At this level of normalization, each column in a table that is not a determiner of the contents of another column must itself be a function of the other columns in the table. For example, in a table with three columns containing customer ID, product sold, and price of the product when sold, the price would be a function of the customer ID (entitled to a discount) and the specific product.
6Third normal form (3NF). At the second normal form, modifications are still possible because a change to one row in a table may affect data that refers to this information from another table. For example, using the customer table just cited, removing a row describing a customer purchase (because of a return perhaps) will also remove the fact that the product has a certain price. In the third normal form, these tables would be divided into two tables so that product pricing would be tracked separately.
7Snowflake SchemaThe snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.Snowflake schema consists of a fact table surrounded by multiple dimension tables which can be connected to other dimension tables via many-to-one relationship.The normalization of dimension tables tends to increase number of dimension tables or sub-dimension table that require more foreign key joins when querying the data therefore reduce the query performance.The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.Therefore in snowflake schema, instead of having big dimension tables connected to a fact table, we have a group of multiple dimension tables.The snowflake schema helps save storage however it increases the number of dimension tables.
10Snowflake schema advantages: Snowflake schema help to save space by normalizing dimension tables.It is more difficult for business users who use data warehouse system using snowflake schema because they have to work with more tables than star schema.Snowflake schema is designed from star schema by further normalizing dimension tables to eliminate data redundancy.Small savings in storage space.Normalized structures are easier to update and maintain.
11Snowflake schema disadvantages: The normalization of dimension tables tends to increase number of dimension tables or sub-dimension table that require more foreign key joins when querying the data therefore reduce the query performance.The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.
13Let’s examine the snowflake schema above in a greater detail: DIM_STORE dimension table is normalized to add one more dimension table called DIM_GEOGRAPHYDIM_PRODUCT dimension table is normalized to add 2 more dimension tables called DIM_BRAND and DIM_PRODUCT_CATEGORYDIM_DATE dimension table is now connecting with three other dimension tables: DIM_DAY_OF_WEEK, DIM_MONTH and DIM_QUARTER.Fact table remains the same as star schema.
14Star schema vs. Snowflake schema UnderstandabilityEasier for business users and analysts to query data.May be more difficult for business users and analysts due to number of tables they have to deal with.Dimension tableOnly have one dimension table for each dimension that groups related attributes. Dimension tables are not in the third normal form.May have more than 1 dimension table for each dimension due to the further normalization of each dimension table.Query complexityThe query is very simple and easy to understandMore complex query due to multiple foreign key joins between dimension tables
15Star schema vs. Snowflake schema Query performanceHigh performance. Database engine can optimize and boost the query performance based on predictable framework.More foreign key joins therefore longer execution time of query in compare with star schemaWhen to useWhen dimension tables store relative small number of rows, space is not a big issue we can use star schema.When dimension tables store large number of rows with redundancy data and space is such an issue, we can choose snowflake schema to save space.Foreign Key JoinsFewer JoinsHigher number of joinsData warehouse systemWork best in any data warehouse / data martBetter for small data warehouse/ data mart
161. Data optimization:Snowflake model uses normalized data, i.e. the data is organized inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.Figure 1 – Snow flake model
17Star model on the other hand uses de-normalized data Star model on the other hand uses de-normalized data. In the star model, dimensions directly refer to fact table and business hierarchy is not implemented via referential integrity between dimensions.Figure 2 – Star model
182. Business model:Primary key is a single unique key (data attribute) that is selected for a particular data. In the previous ‘advertiser’ example, the Advertiser_ID will be the primary key (business key) of a dimension table. The foreign key (referential attribute) is just a field in one table that matches a primary key of another dimension table. In our example, the Advertiser_ID could be a foreign key in Account_dimension.In the snowflake model, the business hierarchy of data model is represented in a primary key –Foreign key relationship between the various dimension tables.In the star model all required dimension-tables have only foreign keys in the fact tables.
193. Performance:The third differentiator in this Star schema vs Snowflake schema face off is the performance of these models.The Snowflake model has higher number of joins between dimension table and then again the fact table and hence the performance is slower. For instance, if you want to know the Advertiser details, this model will ask for a lot of information such as the Advertiser Name, ID and address for which advertiser and account table needs to be joined with each other and then joined with fact table.The Star model on the other hand has lesser joins between dimension tables and the facts table. In this model if you need information on the advertiser you will just have to join Advertiser dimension table with fact table.
204. ETLSnowflake model loads the data marts and hence the ELT job is more complex in design and cannot be parallelized as dependency model restricts it.The Star model loads dimension table without dependency between dimensions and hence the ETL job is simpler and can achieve higher parallelism.Extract, Transform, Load (ETL)In managing databases, extract, transform, load (ETL) refers to three separate functions combined into a single programming tool.The extract function reads data from a specified source database and extracts a desired subset of data.The transform function works with the acquired data - using rules or lookup tables, or creating combinations with other data - to convert it to the desired state.The load function is used to write the resulting data (either all of the subset or just the changes) to a target database, which may or may not previously exist.