Architecture and Configuration

Architecture and Configuration
Every enterprise data platform has a starting point, and architecture and configuration is where organizations need to start. This foundation addresses infrastructure concerns such as hosting, network, storage and licensing as well as strategic components. It also includes enterprise platforms such as SQL Server, analytic tools and visualization products. A solid architecture considers emerging and growing technologies, like cloud and nontraditional data. It focuses on value creation, capabilities and agility through delivering a platform the business, analysts and development organizations can depend on Availability and Continuity The world we live in can be volatile and make our data unsafe. A mature data lifecycle faces minimal or no disruptions from the world around it. For years, highly valuable solutions have been built when organizations only cared about databases and applications on premise. With analytics and non-traditional data analysis, third-party tools and cloud implications, these architectures are more challenging to successfully design and implement. A resolute enterprise “data-structure” provides for continuity of the entire business to ensure optimal performance in any situation. Performance and Optimization Performance is always a priority. Businesses are focused on forward movement and that means more work for the data platform and the teams that support it. Understanding the right way to identify performance goals, assessing current enterprise and implementing an optimization plan are imperative to making your data lifecycle best in business Business Intelligence The backbone of a good analysis platform is enterprise business intelligence. This starts with a departmental effort and expands to become a true enterprise solution. Comprised of a mixture of data warehouses, ETL and OLAP platforms and reporting, these solutions are some of the most applauded or criticized in the organization. As they grow, they may become diluted. They require regular checkups to ensure they are meeting the needs of the business and providing the promised value. A truly valuable enterprise business intelligence solution does not operate in a vacuum. It is tightly integrated with transactional or business systems, pacing their changes and release schedules. It also opens its data to more advanced specialized analysis tasks such as the ones performed by data scientists or analysis developers in non-traditional platforms like Hadoop Big Data Architecture and Deployment Companies are beginning to explore the options for expanding their platform capabilities to include less traditional forms of analysis. These efforts are largely focused on providing another platform option within their “data-structure” to provide scale out of batch types of workloads that statisticians, marketing, business analysts and operational teams are begging for. Technologies supporting these efforts are more accessible to organizations and integrated solutions, like the cloud, provide unparalleled access to these tools and infrastructures. Business and Predictive Analytics The business of improving business is big business, and that’s what analytics is all about for many of our customers. Whether it’s researching more insight on customers, better product design and marketing or operational efficiency, analytics are powering more and more of today’s decisions. This is the pinnacle of a mature data lifecycle. Successful analytics ecosystems require all the other stages to align in supporting the right data availability, at the right speed and with the right integration to other sources of information. Getting this right often involves some “clean-up” or enhancement work on other stages but is worth the effort.

SQL Saturday, Cleveland
Data Modeling for BI Talking Points for SQL Saturday, Cleveland Delora J Bradish, Sr. Consultant February 6, 2016

Agenda BI Fundamental Review EDW Modeling
Additional Modeling Considerations Migrating to MS BI

BI Fundamentals

Purpose of BI

Purpose of BI = Reporting & Analytics
Talking Points: If you have not defined a single report, how will your end result be optimized for reporting? How do we frequently sacrifice the permanent, R&A, on the alter of the immediate, a faster ETL?

Reporting Components Business Intelligence Components 6 Reporting 5
Initiated by your development team and grown through self-service BI. 5 SharePoint Services One time configuration by your SharePoint Administrator 4 SharePoint, O-365 & SSAS Security Coordinated with Active Directory, roles and groups 3 SQL Server Analysis Services (SSAS) Deployed Multidimensional Cubes or Tabular Models 2 SQL Server Data Warehouse Star schema EDW (Enterprise Data Warehouse) 1 Information Data Store (IDS) Consolidated and cleansed 3NF data warehouse Take time to build a strong foundation!

BI Blueprint Reporting Sources Data model considerations
Today’s focus will be on modeling for pipes #5 and #7 (the red boxes) understanding that the plan is for all analytics to source from pipes #7 and #8 (the green arrow) Reporting Sources Data model considerations

Information Data Store Enterprise Data Warehouse
IDS vs EDW Information Data Store Enterprise Data Warehouse OLTP OLAP Production Reporting Analytics 3NF or Snowflake 2NF Star Schema Optimized for Data Integration Optimized for Date Delivery MDM & DQS NO Data Cleansing! Base Data Business Logic / Analytics Bill Inmon Ralph Kimball IDS NIC (Normalization, Integration & Cleansing) 3NF (3rd normal form) or snowflake MDM (master data management) DQS (data quality services) “Bill Inmon” approach EDW Denormalization NO cleansing Business Logic / analytics Star schema Ralph Kimball ABC ETL metadata Data Verification results

Reporting vs Analytics
Production Reporting Analysis of Business Processes A table A subject area Normalized Denormalized Parent – Child Combined Parent with Child PRODUCT PRODUCT_SUBCATEGORY PRODUCT_CATEGORY (combined) DimProduct VISIT VISIT_LINE VISIT_LINE_DETAIL (combined) FactVisitLineDetail DimVisitLineDetail

Return on Investment Cube Design Scalability
Optimized Query Performance Uncluttered GUI Cube Design Scalability As your business grows, the cubes should grow. This isn’t just about number of rows, but growth in additional attributes and measures. The current cube design should be able to scale to many more measure groups and shared dimensions. A cube designed with 3NF dimensions is not scalable! A solid OLAP data model is a hill worth dying upon. Optimized Query Performance Modeling best practices have been established over time. They may not feel right as they are different from OLTP, but like the laws of gravity, they just “are”. Choosing to ignore them, doesn’t mean they go away. A good OLAP model will result in best query performance of a processed cube. The model is your house foundation. Compromise here and you are compromising your house roof – reporting and analytics. Uncluttered GUI Keywords of OLAP design are ‘combine’, ‘duplicate’ and ‘flatten’. This will give you user the cleanest GUI. A good OLAP model will not … Contain duplicated dimensions. It will use role playing dimensions. Contain duplicated attributes between dimensions. It will have combined related 3NF tables into a single flat table. Contain dimension attributes that are measure group specific. It will have planned for shared dimensions between multiple measure groups. Be in 3NF. It will be in 2NF. Look like a SAP BO Universe. It will look like an OLAP model optimized for reporting and analytics.

Modeling

Talking Points Dimensions vs Facts Slowly Changing Dimensions
Deleted Rows Denormalization Degenerate Dimensions Many-to-Many Predictive Analytics

Dimensions vs Facts Dimension Fact A Set of Nouns A Set of Verbs
Strings Numeric An Entity A Process Attributes Measures Group / Slice & Filter Aggregate Primary Keys & Business Keys* Foreign Keys Only Regular Junk Degenerate Slowly Changing Role Playing Transactional Accumulating Snapshot Periodic Snapshot Important dimensional modeling themes (from “Your design goal is ease-of-use, not elegance. In the final step of preparing data for consumption by end users, we should be willing to stand on our heads to make our BI systems understandable and fast. That means 1) transferring work into the ETL back room, and 2) tolerating more storage overhead in order to simplify the final data presentation. In correctly designed models, there is never a meaningful difference in data content between two opposing approaches. Stop arguing that “you CAN’T do the query if you model it that way”. That is almost never true. The issues that you should focus on are ease of application development, and understandability when presented through an end user interface.” * Business keys are often stored on disc in the fact table, but exposed in the cube as a dimension

Slowly Changing Dimensions
Attribute “at time of fact” Type 1 – no history Type 2 – multiple rows Type 3 – multiple columns Incorporate slowly changing dimensions (SCD) only with full understanding of their impact on analytics and reporting (A&R). Users tend to like the concept of knowing dimension data “at time of fact”. BI data administrators often find the knowledge of “when did this value change” to be extremely helpful when researching data issues. However, very few business users truly understand the impact of grouping data by [last year’s] dimension data values. The majority of reporting requirements are often “current”: current customer, current product, current bill of material, current policy … data. To recap briefly, there are six types of SCDs, but the following two are most common: Type 1 – overwrite changed information and therefore all the dimension rows are “current” Type 2 – track historical data by creating second and subsequent rows held together by a key column The type of dimension table chosen heavily impacts table design and each ETL package. Current dimension values from a Type 2 SCD table can be obtained. The SELECT statement may look something like this: --get the current value SELECT [columns] FROM dbo.FactInvoiceLine fil LEFT JOIN dbo.DimCustomer dc on fil.cusID = dc.cusID AND dc.IsCurrent = 'Y' SCDs are not governed by a parameter. It is not easy to switch back and forth. The same SELECT statement that pulls the current value is not parameter driven to pull the historical value on demand. --get the historical value AND fil.InvoiceDate between dc.Eff_date and dc.Exp_date) Talking Points: If users generally want current dimension data, but occasionally would like to look back and evaluate facts by attributes “at time of fact”, how will this functionality be provided? Will there be two sets of views for each dimension table? What type of dimension view will the cubes and tabular models use by default? How will a dbo.CurrentCustomerView from a dbo.HistoricalCustomerView for self-service BI be differentiated? How will a single version of the truth be maintained with users slicing from the dimension type of choice? How will the automated SSRS reports communicate the type of dimension used in the USP data source? When choosing a dimension type, there are several other factors to consider with many considerations being unique to each client. Do not make the mistake of assuming that “reporting will figure that out later”. Remember that the data model and the ETL process exist for the purpose of analytics and reporting (A&R), so well thought out A&R decisions need to be made prior to the data model and data integration steps.

Deleted Rows “Is Deleted” Flag Deleted Schema
Model deleted source records with thought for reporting performance. This best practice suggestion follows #5 because a delete strategy often falls victim to development time constraints. How the tables are designed has a significant impact on the ETL process. Two popular methods are as follows. In either case, deleted rows can be referenced at a future time to help explain a data anomaly. Add an ‘IsDeleted’ flag to each table. This solution is most appropriate for clients who know they will want to include deleted source system data in their daily reporting and analytics. Pros The ETL process does not need to INSERT to another location and DELETE, but only UPDATE Queries can ignore the IsDeleted flag and select all rows as they were at any point in time. Cons This column will need to be indexed as every query that wants only existing source system records will need to now filter “Where TableName.IsDeletedFlag = ‘N’” or something similar. Even if a set of views is built for each table that filters out deleted rows, there is still an index consideration. There is overhead associated with keeping two views per table in sync with DML changes. Without an extra set of views, all of the queries that want the current state of data will now have to filter out deleted rows. Here is an actual example of a query from a data warehouse system that elected to keep deleted rows with an indicator flag. Every SELECT and every JOIN had to continually filter on the delete indicator column. SELECT [columns] FROM dbo.BILL_ENTY_DIM bill_enty WITH (NOLOCK) INNER JOIN dbo.CMSN_REL_FACT cmsn_rel WITH (NOLOCK) on bill_enty.BILL_ENTY_SK = cmsn_rel.BILL_ENTY_SK AND cmsn_rel.LOGIC_DEL_IND = 'N' INNER JOIN dbo.CMSN_ENTY_DIM cmsn WITH (NOLOCK) on cmsn_rel.CMSN_ENTY_SK = cmsn.CMSN_ENTY_SK AND cmsn.LOGIC_DEL_IND = 'N' between cmsn.CMSN_ENTY_EDW_EFF_DT_SK and cmsn.CMSN_ENTY_EDW_TRMN_DT_SK AND >= LEFT(CONVERT(int,CONVERT(char(8),cmsn.CMSN_ENTY_EFF_DT_SK,112)),6) <= LEFT(CONVERT(int,CONVERT(char(8),cmsn.CMSN_ENTY_TRMN_DT_SK,112)),6) WHERE bill_enty.LOGIC_DEL_IND = 'N' between bill_enty.BILL_ENTY_EDW_EFF_DT_SK and bill_enty.BILL_ENTY_EDW_TRMN_DT_SK Option #2: Move the deleted records to a ‘deleted’ schema. This method is most appropriate for clients who want to be able to look up what was deleted from their source system, but 99% of the time, they are not interested in including deleted data in their analytics. The index logic and query scripts do not have to continually be aware of an IsDeleted flag column. Deleted source rows are by default omitted from all queries, which is highly beneficial when opening up the data warehouse to self-service BI. This method is more work for the ETL process as the ETL has to identify the deleted row, figure out how many table(s) in the CRF, IDS and EDW it impacts, copy the deleted data into a different table, then finally, delete it from the database. Depending on how this is implemented, the CRF, IDS and/or EDW databases may end up with two identically structured tables. One named “dbo.Customer” and one named “deleted.Customer”, for example. If the user wants to include deleted source rows in a report, they will have to UNION the two tables together Note: a “schema” in this context refers to a “container for grouping tables”. It is the leading prefix of a table name in many data warehouse systems. “ods.Customer”, “edw.Customer”, “etl.Customer” and “deleted.Customer” may or may not have different columns, but they do all belong to separate “containers” or “schemas” within a single database.

Denormalization Snowflake (Parent-child Related) tables
Role Playing Dimensions Degenerate Dimensions Parent-child relationships Think “subject area” instead. Product with Product Subgroup with Product Group Visit header with Visit Line with Visit Line Item Patient with Patient Demographic Universe Role Playing Queries Geography – city & state of … mailing, billing, actual etc. … Dates – date of… visit, payment, procedure, follow-up Business Keys aka Natural Keys Example: Visit Number, Patient Number, Line Number Use a degenerate dimension

Denormalization Illustrated

YES No Denormalization Illustrated YES

Degenerate Dimensions
1-1 with a Fact Natural Keys ‘Fact’ Cube Relationship Has a one-to-one relationship with a fact table sharing the same PK Consequently, grows in proportion to the fact table Consequently, an identity seed PK in the fact is highly beneficial vs composite unique key When based on a fact table, can take advantage of ‘fact’ relationship type in the cube Are not ‘junk dimensions’ which are “a convenient grouping of flags and indicators” Read about junk dimensions here  Read about degenerate dimensions here 

Degenerate Dimensions Illustrated

Many-to-Many Facts Contains FKs only aka “Factless Fact Table”
Requires an Intermediate Fact Table aka Bridge One ore more dimensions must be shared Values are distinct sums Relationships must be defined in the DSV

Many-to-Many Illustrated
? ?

No Yes A many-to-many relationships occurs when a [Fact Table] has many [Dimension Attribute Values] A sales transaction (fact) has many reasons (dimension) A surgical procedure (fact) has many doctors (dimension) A surgical procedure (fact) uses many items (dimension) A patient visits (fact) for many medical procedures (dimension)

Rules from The intermediate measure group must have at least one dimension in common with the selected measure group. The granularity of the relationship between the intermediate measure group and the common dimension must be greater than or equal to the granularity between the common dimension and the selected measure group. Note: You must have an “Intermediate measure group” already defined prior to creating a many-to-many relationship!

PA & Migration

Predictive Analytics in SSAS
Consuming Unstructured Data Snapshot Facts Completely flat Excel-type Dataset

Additional Considerations
Grain Agile Methodologies Indexing ABC (Audit, Balance & Control) Grain from “When developing fact tables, aggregated data is NOT the place to start. To avoid “mixed granularity” woes including bad and overlapping data, stick to rich, expressive, atomic-level data that’s closely connected to the original source and collection process.” Model in step with agile development methodologies. As the BI project cycles through iterations, the IDS and EDW model plan needs to remain flexible. Each iteration tends to serve as a learning curve for end users and requirements are consequently changed under established change control. If data modeling gets too far ahead of the data integration, information delivery, testing and production cycle, rework will be the result. Manage the source system indexes. What is the first t-SQL script seen executing when a dimension is processed? SELECT DISTINCT column1, column2, column3, column4, column…[n] FROM dbo.DIMTableName If the EDW table does not have a clustered index, SSAS has to add a distinct sort to the query plan. This adds time. Adding a column store index on the columns used in the dimension may improve performance; however, all objects have to work together. An index strategy for SSAS may not be the best solution for ad hoc end user queries being executed against the EDW for other purposes. Model for ABC (Audit, Balance and Control). ABC handles metadata collection, restartability at point of failure and data verification. Examples of metadata collection is the retention of ‘Date entered’, ‘last date updated’, row counts, SSIS package IDs, variable values, package success or failure flags, and when needed, hash values. The individual table ABC columns do not need to result in heavy overhead, especially if well-designed audit (ABC) database is kept. Minimally, the SSIS package run time ID that last performed a DML statement on the table should be retained. From this ID ‘data added to the warehouse’, ‘date last updated in the warehouse’, and “row source system’ can be determined. It is not advisable, but sometimes row-level source system metadata is retained in the IDS. Again, model the standard, and sometimes unique, ABC columns right up front keeping each table’s ABC columns as consistent as possible. Coordinate the ABC model with the ETL team. SSIS has built-in event handlers and the ABC model for SQL Server will not be identical to the ABC model for another tool, such as Informatica.

Migration Modeling for MS BI Tool Selection Migration or Replacement?
User Expectations

“If you would hit the mark, you must aim a little above it;
Every arrow that flies feels the attraction of earth.” ~ Henry W. Longfellow

Supporting Material

GR1 Group 1 Name GR2 Group 2 Name GR3 Group 3 Name GR1 CAT1 Cat 1 Name GR1 CAT2 Cat 2 Name GR1 CAT3 Cat 3 Name GR2 CAT4 Cat 4 Name CAT1 Pat1 Pat 1 Name CAT1 Pat2 Pat 2 Name CAT2 Pat3 Pat 3 Name CAT3 Pat4 Pat 4 Name Patient.DIM PK PK FK GR1 Group 1 Name CAT1 Cat 1 Name Pat1 Pat 1 Name GR1 Group 1 Name CAT1 Cat 1 Name Pat2 Pat 2 Name GR1 Group 1 Name CAT2 Cat 2 Name Pat3 Pat3Name GR1 Group 1 Name CAT3 Cat 3 Name Pat4 Pat 4 Name GR2 Group 2 Name CAT4 Cat 4 Name NULL NULL Fact1 Pat1 Fact2 Pat1 Fact3 Pat2 Fact4 Pat3

GR1 Group 1 Name GR2 Group 2 Name GR3 Group 3 Name CAT1 GRP1 Cat 1 Name CAT1 GRP2 Cat 1 Name CAT1 GRP3 Cat 1 Name CAT2 GRP1 Cat 2 Name CAT2 GRP3 Cat 2 Name Pat1 CAT1 Pat 1 Name Pat1 CAT2 Pat 1 Name Pat2 CAT1 Pat 2 Name Pat2 CAT2 Pat 2 Name PK FK AMT Patient.DIM Parent-child relationships Think “subject area” instead. Product with Product Subgroup with Product Group Visit header with Visit Line with Visit Line Item Patient with Patient Demographic Universe Role Playing Queries Geography – city & state of … mailing, billing, actual etc. … Dates – date of… visit, payment, procedure, follow-up Business Keys aka Natural Keys Example: Visit Number, Patient Number, Line Number Use a degenerate dimension Fact1 Pat1 $10 GR1 Group 1 Name CAT1 Cat 1 Name Pat1 Pat 1 Name GR2 Group 2 Name CAT1 Cat 1 Name Pat1 Pat 1 Name GR3 Group 3 Name CAT1 Cat 1 Name Pat1 Pat 1 Name GR1 Group 1 Name CAT2 Cat 2 Name Pat1 Pat 1 Name GR3 Group 3 Name CAT2 Cat 2 Name Pat1 Pat 1 Name

Fundamentals of Cube Design
Understand Your Data Use a star schema Design for SSAS Denormalize! Model many-to-many with a bridge Key using integer data types Remove NULL values Use role playing attributes (-1, etc.) Use role playing dimensions Create at least one hierarchy / dimension Push business logic back to the EDW, not the DSV or MDX Summary from ‘SSAS Tips & Tricks’ document – Fundamentals of Cube Design section.

DIM Review Checklist – Required
Non-Negotiable Dimensions are indicative of a complete subject area. The same PK that is defined in the DSV is used as the PK in the dimension There are fewer than 25 or 30 dimensions in a cube Dimensions are related to multiple measure groups No dimension is a copy of another Every dimension attribute name is unique between multiple dimensions. Dimension attribute names are not measure group specific Role playing dimensions have been used for multiple dates or geography keys found in a single measure group. (There is only one DATE.DIM and one GEOGRAPHY.DIM) Attribute relationships are showing no warnings Parent-child hierarchies have been avoided Each dimension passes a BIDS Helper Dimension Health Check

DIM Review Checklist – Good Idea
Every dimension has multiple attributes Every dimension has a hierarchy Degenerate dimensions are used in the cube and generally do not contain a hierarchy (sometimes you will find 'Order Number' + 'Order Line' hierarchy) Attributes used in a hierarchy are not exposed as independent dimension attributes All PK and FK are hidden in the dimension attribute properties, not in a perspective Naming conventions have been implemented for a cleaner user experience. Integer keys (attribute property) are in use whenever possible with the NameColumn pointing to the varchar() value

Reporting Phases [need notes here]

Waterfall vs. Agile

Architecture and Configuration

Similar presentations

Presentation on theme: "Architecture and Configuration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architecture and Configuration

Similar presentations

Presentation on theme: "Architecture and Configuration"— Presentation transcript:

Similar presentations

About project

Feedback