Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.

Similar presentations


Presentation on theme: "ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1."— Presentation transcript:

1 ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1

2 Term project 3-4 students form a team to fulfill a data mart development project. ◦ Stage 1 (10%): Project proposal. March 9 ◦ Stage 2 (20%): Data mart development. March 25 ◦ Stage 3 (20%): Data mart populating. Due April 6 ◦ Stage 4 (20%): Analysis report. Due April 20 ◦ Stage 5 (30%): The compilation of the previous deliverables with modifications and enhancements. Due May 1 Detailed instructions: http://zlin.ba.ttu.edu/3358/Projects09.html http://zlin.ba.ttu.edu/3358/Projects09.html

3 Merits of the project outcomes Carefully developed project proposal demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc. Advanced OLAP cube design, and/or optional MDX scripting by self-taught Rich data analysis outcomes Well-presented final report Demonstrating the creative ideas and skillful data warehousing ability

4 More about dimensional modeling - How to develop a more professional data mart 4

5 Slowly Changing Dimensions The attribute values in a dimension may change over time, which are critical to understand the dynamics of the business. The ability to track the changes of facts over time is critical to a DW/BI system. ◦ Employees changed their departments ◦ Home moving (16.8% American moved per year) – zip code changes possible. The dimensions that have changeable attribute values slowly changing dimensions (SCDs) ◦ Type 1 SCD overwrites the existing attribute value with a new value. You don’t care about keeping track of historical values ◦ Type 2 SCD change tracking – ETL process creates a new row in the dimension table to capture the new values of the changed item ◦ Type 3 SCD – Similar to Type 2 SCD but only track current state and the original state; two additional attribute: SCD Start Date, SCD Initial Value Ask business users how they will use the information to determine which SCDs are of type 1 and which are of type 2. 5

6 Surrogate Key A surrogate key has a unique value assigned to each row in the dimension. It becomes primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table Benefits of surrogate keys: ◦ Protect the DW/BI system from changes in the source system ◦ Allow the DW/BI system to integrate data from multiple source system ◦ Enable developers to add rows to dimensions that do not exist in the source system ◦ Provide the means for tacking changes in dimension ◦ Are efficient in the relational database and analysis services Surrogate keys are created when doing data warehousing. They are new from the keys in original database They are also called meaningless keys, substitute keys, non-natural keys, artificial keys Specifically, surrogate keys are used in slowly changed dimensions (SCD) management 6

7 Aggregate Dimensions Situation: data at different levels of granularities Two resolutions ◦ Removing a dimension ◦ Rolling up a dimension’s hierarchy and provide a new, shrunken dimension at the aggregate level 7

8 Many-to-many or Multivalued Dimensions Relationship between a dimension table and fact table is called one-to-many: one row in the dimension table may join to many rows in the fact table. Many-to-many or Multivalued Dimensions are referred to as there are more than one row in a dimension table joining to multiple rows in a fact table Bridge table supports many-to-many relationship: ◦ fact-dimension ◦ dimension-dimension. 8

9 Hierarchies Meaningful, standard ways to group the data within a dimension ◦ Variable-depth hierarchies ◦ Frequently changing hierarchies Examples of hierarchy in a dimension ◦ Address: street, city, state, country ◦ Organization: section, division, branch, region ◦ Time: year, quarter, month, date 9

10 Heterogeneous Products Several products with differentiated attributes Problem: sharing one dimension or use different dimension? Resolutions ◦ One family-oriented dimension with core fact and product tables plus specific information for each line of product 10

11 The Three Fact Table Types Transaction fact table Periodic snapshot fact table ◦ Cumulative performance over specific time intervals; valuable to combine data across several business processes in the value chain. Accumulating snapshot fact table ◦ Constantly updated over time. 11

12 Junk Dimensions Also called miscellaneous or mystery dimensions They are miscellaneous attributes that don’t belong to any existing dimension. Typically flags or indictors that describe or categorize the transaction in some way. Contents are often important Four alternatives for dealing with them ◦ Leave them in the fact table ◦ Create a separate dimension for each attribute ◦ Omit them ◦ Group them into a single junk dimension 12

13 Degenerate Dimensions No description of its own No joining to an actual dimension table No attributes Example: transaction ID 13

14 Data Warehousing Methodology - Implementing data warehouse systematically 14

15 Dimensional Modeling Process Preparation ◦ Identify roles and participants ◦ Understanding the data architecture strategy ◦ Setting up the modeling environment ◦ Establishing naming conventions Data profiling and research ◦ Data profiling and source system exploration ◦ Interacting with source system experts ◦ Identifying core business users ◦ Studying existing reporting systems Building Dimensional models ◦ High-level dimensional model design ◦ Identifying dimension and fact attributes Developing the detailed dimensional model Testing the model Reviewing and validating the model

16 Business Dimensional Lifecycle 16 Project Planning Business Req’ts definition Technical Arch. Design Product Selection & Installation Dimensional Modeling Physical Design BI Appl. Specification BI Application Development ETL design & Development Deployment Maintenance Growth Project Management

17 ETL Methodology Develop a high-level map Build a sandbox source system (optional) Detailed data profiling Make decisions ◦ The source-to-target mapping ◦ How often loading tables ◦ The strategy for partitioning the relational and Analysis Services fact table ◦ The strategy for extracting data from each source system De-duplicate key data from each source system (optional) Develop a strategy for distributing dimension tables across multiple database servers (optional) 17

18 Sandbox Source System Sandbox ◦ A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system. ◦ A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play. In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks How to create ◦ Set up a static snapshot of the database ◦ By sampling 18

19 Data Profiling Data profiling is a methodology for learning about he characteristics of the data It is a hierarchical process that attempt to build an assessment of the metadata associated with a collection of data sets. Three levels ◦ Bottom – characterizing the values associated with individual attributes ◦ Middle – the assessment looking at relationships between multiple columns within a single table. ◦ Highest level – the profile describing relationships that exist between data attributes across different tables. Can run a program against the sandbox source system to obtain the needed information. 19

20 Decision Issues in ELT System Design Source-to-target mapping Load frequency How much history is needed 20

21 Strategies for Extracting Data Extracting data from packaged source systems –self- contained data sources ◦ May not be good to use their APIs ◦ May not be good to use their add-on analytic system Extracting directly from the source databases ◦ Strategies vary depending on the nature of the source database Extracting data from incremental loads ◦ How the source database records the changes of the rows Extracting historical data 21


Download ppt "ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1."

Similar presentations


Ads by Google