Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design for Flexibility and Performance - ETL Patterns with SSIS and Beyond And without further ado, here is Daniel with Using SSIS to Prepare Data for.

Similar presentations


Presentation on theme: "Design for Flexibility and Performance - ETL Patterns with SSIS and Beyond And without further ado, here is Daniel with Using SSIS to Prepare Data for."— Presentation transcript:

1 Design for Flexibility and Performance - ETL Patterns with SSIS and Beyond
And without further ado, here is Daniel with Using SSIS to Prepare Data for Analytics. {speaker begins} Daniel Cai, Principal Developer, KingswaySoft

2 Daniel Cai Principal Developer @kingswaysoft Disclaimer
At KingswaySoft, I worked in the product management role of identifying and defining solutions that help solve some of the most challenging integration scenarios using SSIS as the ETL platform Disclaimer I use some premium components we have developed at KingswaySoft to show you the much simplified design patterns. Speaking is not my greatest strength.

3 Survey What are your typical day-to-day ETL development challenges?
The sheer amount of data? Poor data quality include duplicates? Constant changing data schema? Referential integrity? What else?

4 Load data incrementally

5 Incremental Changes - Reading
It is important to only read those records that have been changed in the source system before pushing them into the ETL pipeline for the best performance CDC Timestamp Fields Lookup + HashValues Diff Detector Use CDC component to capture changes in the source system Requires change data capture support by the database platform An enterprise feature, which means higher license cost Depends on database log reader Use a timestamp field in the source system to detect changes Save current timestamp value after each execution Generate HashValue for each input row and compare with the hash values in the destination table to detect changes. Use lookup component (or equivalent component) to detect new rows. Be mindful with hash collisions. A strong hash is more preferable Use the third-party Diff Detector component to compare two inputs and find out the differences between the two inputs. A total of 4 outputs available. Added Rows Deleted Rows Changed Rows Unchanged Rows Comes with much greater flexibility and support any data source types in SSIS Pipeline SELECT * FROM Customer WHERE LastModifiedDate AND LastModifiedDate

6 Demos - Incremental Reading

7 Incremental Changes - Writing
Upsert (Update/Insert) is typically the most efficient way of synchronizing databases. Simultaneously write new rows and updates existing rows. There are 3 main methods for performing Upsert in SSIS Lookup Transform Custom Script Upsert Write Action Requires developers to setup lookups to filter out new rows, updated rows, and unchanged rows then use OLEDB Command to update records (performance can be very slow) or leveraging staging tables using MERGE command for better performance Develop a custom script to write your own upsert strategy. Use staging database table Leverage SQL Merge command (if supported) through OLEDB connection Can be time consuming to implement. Available in third-party destination components. These components will handle the work for you by performing a check to see if the row exists already and performs the necessary write action all within 1 component.

8 Demos - Incremental Writing

9 Duplicate Detection

10 Duplicate Detection Fuzzy Lookup/Grouping Component
3rd Party Component - Duplicate Detector Use Fuzzy Lookup component to find duplicates Duplicates are determined based on key_in, key_out values Cumbersome configuration Fuzzy Lookup component is limited to OLEDB connection KingswaySoft Duplicate Detector component enables some advanced duplicate detection scenarios Additional matching support, such as FirstName, phone number, zip code, address, etc. Much better usability Support anything that comes through SSIS pipeline Duplicate

11 Demos – Duplicate Detection

12 Writing in Parallel

13 Writing in Parallel Usage Scenarios What to watch for? What to use?
When writing is the bottleneck There are only 5 active buffers available per data flow Plan for proper buffer sizing If ever there is a change needed, you would have to go through each destination component after each branch What to use? Balanced Data Distributor

14 Demos – Writing in Parallel

15

16 Resource Links on CDC CDC


Download ppt "Design for Flexibility and Performance - ETL Patterns with SSIS and Beyond And without further ado, here is Daniel with Using SSIS to Prepare Data for."

Similar presentations


Ads by Google