Presentation is loading. Please wait.

Presentation is loading. Please wait.

Implementing a Data Extraction Solution

Similar presentations


Presentation on theme: "Implementing a Data Extraction Solution"— Presentation transcript:

1 Implementing a Data Extraction Solution
Module 9 The 20767A-MIA-SQL virtual machine used in the lab for this module includes many software services that can take a while to start. For the best experience, have students start the 20767A-MIA-DC and 20767A- MIA-SQL virtual machines at the beginning of the module so that the services are ready before they start the lab. Implementing a Data Extraction Solution

2 Module Overview Temporal Tables 20767A
9: Implementing a Data Extraction Solution Temporal Tables

3 Lesson 1: Introduction to Incremental ETL
9: Implementing a Data Extraction Solution Slowly Changing Dimensions Question What are the missing words indicated by<xx xx xx> in the following statement about a data warehouse ETL flow? “For BI solutions that involve loading large volumes of data, a <xx xx xx> process is recommended. In this data flow architecture, the data is initially extracted to tables that closely match the source system schemas (often referred to as a landing zone).” Answer Three-stage ETL Which of the following questions is NOT a consideration when planning extraction windows? ( )Option 1: How much impact on business is likely during the extraction window? ( )Option 2: How frequently is new data generated in the source systems, and for how long is it retained? ( )Option 3: During what time periods are source systems least heavily used? ( )Option 4: How long does data extraction take? ( )Option 5: What latency between changes in source systems and reporting is tolerable? (√) Option -2: How much impact on business is likely during the extraction window? Which three features does SQL Server offer for SCD control? SQL Server uses three methods for SCD version control: Control Data Change. Change Tracking. Temporal or System-Versioned Tables.

4 Overview of Data Warehouse Load Cycles
9: Implementing a Data Extraction Solution Extract changes from data sources Refresh the data warehouse based on changes Point out that a data warehousing solution might include multiple refresh cycles to handle updates from different data sources. For example, you might refresh sales and customer data daily to reflect each day’s transactions, but load new accounting data monthly after invoicing and payroll processes have been completed.

5 Considerations for Incremental ETL
9: Implementing a Data Extraction Solution Data modifications to be tracked Load order Dimension keys Updating dimension members Updating fact records Point out that data warehouses generally store data for historical analysis and reporting, so it is common for data deletions in source systems not to be propagated to the data warehouse. In some cases, however, you may want to identify deletions in source systems and perform a logical delete operation in the data warehouse to mark the record as obsolete. For example, to indicate a retired employee, an inactive customer account, or a discontinued product.

6 Common ETL Data Flow Architectures
9: Implementing a Data Extraction Solution Single-stage ETL: Data is transferred directly from source to data warehouse Transformations and validations occur in-flight or on extraction Two-stage ETL: Data is staged for a coordinated load Transformations and validations occur in-flight, or on staged data Three-stage ETL: Data is extracted quickly to a landing zone, and then staged prior to loading Transformations and validation can occur throughout the data flow Source DW Source Staging DW Source Landing Zone Staging DW

7 Planning Extraction Windows
9: Implementing a Data Extraction Solution How frequently is new data generated in the source systems, and for how long is it retained? What latency between changes in source system and reporting is tolerable? How long does data extraction take? During what time periods are source systems least heavily used?

8 Planning Transformations
9: Implementing a Data Extraction Solution On extraction: From source From landing zone From staging In data flow: Source to landing zone Landing zone to staging Staging to data warehouse In-place: In landing zone In staging Source Data Warehouse Staging Landing Zone Point out that, in some respects, the first two guidelines for where to perform transformations are contradictory. For example, you can use joins, ISNULL expressions, CAST and CONVERT functions, and concatenation expressions in the Transact-SQL code used to extract data from a source database—this would meet the goal of performing validation and transformations as soon as possible. However, doing this might add processing and memory overhead to the source system. Alternatively, using a JOIN to look up values in a related table during the extraction query might incur less overhead than performing a lookup later in the data flow. In a typical enterprise BI scenario, most transformations occur between the landing zone and the staging area, though some basic validation and lookups may be performed during extraction from source systems.

9 Documenting Data Flows
9: Implementing a Data Extraction Solution ProductDB Product Subcategory Category The diagram on the slide is not based on any standard, but does include the source-to-target information that should be documented during the initial ETL design phase. More information about specific data flows, from source tables to target tables in the data warehouse, can be added as the design is refined. Show students the DimCustomer and FactInternetSales worksheets in D:\Demofiles\Mod09\Source to Target mapping.xlsx as an example of how source-to-target documentation can be created as a table showing the lineage of each destination column. Audit Start Filter on LastModified Concatenate Size (Size + ' ' + MeasureUnit) Lookup Subcategory Lookup Category Handle NULLs* Update SCD1 rows ProductName Update and insert SCD2 rows (generate surrogate key) Category, Subcategory, Size, Color Insert new rows (generate surrogate key) *NULL Handling Rules Change NULL Subcategory and category to "Uncategorized“. Redirect rows with null ProductName. Audit End DimProduct

10 Slowly Changing Dimensions
9: Implementing a Data Extraction Solution Types of change to a dimension member: Type 1: Changing attributes are updated in the dimension record Type 2: Historical attribute changes result in a new record Type 3: The original and current values of historical attributes are stored in the dimension record Key AltKey Name Phone City 101 C123 Mary New York Key AltKey Name Phone City 101 C123 Mary New York Point out that other types of change have been identified by data warehouse professionals, but that types 1, 2, and 3 are the most common. In practice, most updates to dimension attributes are implemented as type 1 or type 2 changes. For type 2 changes, you can indicate the current record by: Using a Boolean flag that is TRUE for the current record and FALSE for all historical versions of the record. Using a datetime column to indicate the effective start date for each version of the dimension member. Using both a Boolean flag and an effective date column. Later in this module, you will discuss the SCD transformation in SSIS. Note that it does not support type 3 changes. Key AltKey Name Phone City Current 101 C123 Mary New York True Key AltKey Name Phone City Current 101 C123 Mary New York False 102 Seattle True Key AltKey Name Phone OriginalCity CurrentCity EffectiveDate 101 C123 Mary New York 1/1/00 Key AltKey Name Phone OriginalCity CurrentCity EffectiveDate 101 C123 Mary New York Seattle 6/7/11

11 Lesson 2: Extracting Modified Data
9: Implementing a Data Extraction Solution Extracting Data with Change Tracking At this point, you can refer to the Temporal Tables topic. Question What are the four steps used when extracting rows based on a datetime column? Answer The high level steps your ETL process must perform to use the high water mark technique are: Note the current time. Retrieve the date and time of the previous extraction from a log table. Extract records where the modified date column is later than the last extraction time, but before or equal to the current time you noted in step 1. This disregards any insert or update operations that have occurred since the start of the extraction process. In the log, update the last extraction date and time with the time you noted in step 1. In the following code to enable CDC in a table two system stored procedures have been omitted (<xxxxxx> and <yyyyy>). Choose the correct stored procedures from the options below. IF (SELECT is_cdc_enabled FROM sys.databases WHERE name='Customers') = 'FALSE' BEGIN EXEC <xxxxxx> EXEC <yyyyyy> @source_schema = N'dbo', @source_name = N'Customers', @role_name = NULL, @supports_net_changes = 1 END GO (More notes on the next slide)

12 Options for Extracting Modified Data
9: Implementing a Data Extraction Solution Extract all records Store a primary key and checksum Use a datetime column as a “high water mark” Use Change Data Capture Use Change Tracking Point out the note in the student workbook, and discuss the considerations for propagating deletions in source databases to the data warehouse. Emphasize that, in most scenarios, data warehouses are used to store historical data, so it is common for deletions to not be propagated. In some cases, logical deletions are performed in the data warehouse by setting a “deleted” flag column.

13 Extracting Rows Based on a Datetime Column
9: Implementing a Data Extraction Solution Note the current time. Retrieve the last extraction time from an extraction log. Extract and transfer records that were modified between the last extraction and the current time. Replace the stored last extraction value with the current time. Point out that you can use an SSIS function such as GETDATE() to store the current time in a user variable. Or you could use the system StartTime variable.

14 Demonstration: Using a Datetime Column
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Use a Datetime Column to Extract Modified Data Preparation Steps Start the 20767A-MIA-DC and 20767A-MIA-SQL virtual machines. Demonstration Steps Use a Datetime Column to Extract Modified Data Ensure 20767A-MIA-DC and 20767A-MIA-SQL are started, and log onto 20767A-MIA-SQL as ADVENTUREWORKS\Student with the password Pa$$w0rd. In the D:\Demofiles\Mod09 folder, run Setup.cmd as Administrator. In the User Account Control dialog box, click Yes. Start SQL Server Management Studio and connect to the MIA-SQL instance of the database engine using Windows authentication. In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Note that the database includes tables in three schemas (dw, src, and stg) to represent the data sources staging database, and data warehouse in an ETL solution. Right-click each of the following tables and click Select Top 1000 Rows: stg.Products: this table is used for staging product records during the ETL process, and is currently empty. stg.ExtractLog: this table logs the last extraction date for each source system. src.Products; this table contains the source data for products, including a LastModified column that records when each row was last modified. Start Visual Studio and open the IncrementalETL.sln solution in the D:\Demofiles\Mod09 folder. In Solution Explorer, double-click the Extract Products.dtsx SSIS package. On the SSIS menu, click Variables, and note that the package contains two user variables named CurrentTime and LastExtractTime, and then close the window. On the control flow surface, double-click Get Current Time. In the Expression Builder dialog box, note that in the Expression pane of this task sets the CurrentTime user variable to the current date and time. Then click Cancel. (More notes on the next slide)

15 Enable Change Data Capture:
9: Implementing a Data Extraction Solution Enable Change Data Capture: Map start and end times to log sequence numbers: Handle null log sequence numbers: Extract changes between log sequence numbers: EXEC sys.sp_cdc_enable_db EXEC sys.sp_cdc_enable_table @source_schema = = N'Customers', @role_name = NULL, @supports_net_changes = 1 Explain that failing to check for null log sequence numbers can result in the ambiguous error message: “An insufficient number of arguments were provided for the procedure or function cdc.fn_cdc_get_net_changes”. This generally occurs when no database activity has been logged within the specified time interval. The general error message is caused by the inability to raise an explicit error from within a table-valued function, and is a known limitation. binary(10); = sys.fn_cdc_map_time_to_lsn('smallest greater = sys.fn_cdc_map_time_to_lsn('largest less than or IF IS NULL) OR IS NULL) -- There may have been no transactions in the timeframe SELECT * 'all')

16 Demonstration: Using Change Data Capture
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Enable Change Data Capture Use Change Data Capture to Extract Modified Data Preparation Steps Complete the previous demonstration in this module. Demonstration Steps Enable CDC on the Customers Table Ensure you have completed the previous demonstration in this module. Maximize SQL Server Management Studio, and in Object Explorer, in the DemoDW database, right- click the src.Customers table and click Select Top 1000 Rows. This table contains source data for customers. Open Using CDC.sql in the D:\Demofiles\Mod09 folder, and in the code window, select the Transact-SQL code under the comment Enable CDC on src.Customers table, and then click Execute. This enables CDC in the DemoDW database, and starts logging modifications to data in the src.Customers table. Select the Transact-SQL code under the comment Select all changed customer records since the last extraction, and then click Execute. This code uses CDC functions to map dates to log sequence numbers, and retrieve records in the src.Customers table that have been modified between the last logged extraction in the stg.ExtractLog table, and the current time. There are no changed records because no modifications have been made since CDC was enabled. Use CDC to Extract Modified Data Select the Transact-SQL code under the comment Insert a new customer, and then click Execute. This code inserts a new customer record. Select the Transact-SQL code under the comment Make a change to a customer, and then click Execute. This code updates a customer record. Select the Transact-SQL code under the comment Now see the net changes, and then click Execute. This code uses CDC functions to map dates to log sequence numbers, and retrieve records in the src.Customers table that have been modified between the last logged extraction in the stg.ExtractLog table, and the current time. Two records are returned. Wait 10 seconds. Then select the Transact-SQL code under the comment Check for changes in an interval with no database activity, and then click Execute. (More notes on the next slide)

17 Extracting Data with Change Data Capture
9: Implementing a Data Extraction Solution Identify the endpoint for the extraction (LSN or DateTime). Retrieve the last extraction endpoint from an extraction log. Extract and transfer records that were modified during the LSN range defined by the previous extraction endpoint and the current endpoint. Replace the logged endpoint value with the current endpoint. The technique for extracting data from a CDC-enabled source is similar to that used to extract data with a datetime column. You store the time of the last extraction in the staging database and use it, together with the CDC functions, to identify rows that have changed in the meantime. Typically, you should create a stored procedure in the data source to encapsulate the extraction logic. There is no formal demonstration of this technique in the course workbook, but if you wish, you can show it by using the following steps: Perform the previous demonstrations so that the DemoDW database is in place and CDC has been enabled on the src.Customers table. Run the Create CDC SP.sql script to create a stored procedure that extracts modified customer records. Open the IncrementalETL.sln solution and examine and run the Extract Customers.dtsx package.

18 The CDC Control Task and Data Flow Components
9: Implementing a Data Extraction Solution Initial Extraction Incremental Extraction CDC Control Mark Initial Load Start Source Staged Inserts Mark Initial Load End CDC State Table CDC State Variable Get Processing Range CDC Source Mark Processed Range CDC Splitter Staged Updates Staged Deletes Data Flow 1 2 3 4 CDC A CDC Control Task records the starting LSN. A data flow extracts all records. A CDC Control task records the ending LSN. CDC Control Task establishes the range of LSNs to be extracted. A CDC Source extracts records and CDC metadata. Optionally, a CDC Splitter splits the data flow into inserts, updates, and deletes. A CDC Control task records the ending LSN.

19 Demonstration: Using CDC Components
9: Implementing a Data Extraction Solution In this demonstration, you will see how to use the CDC Control Task to: Perform an Initial Extraction Extract Changes Preparation Steps Complete the previous demonstrations in this module. Demonstration Steps Perform an Initial Extraction Ensure you have completed the previous demonstrations in this module. Maximize SQL Server Management Studio and open the CDC Components.sql script file in the D:\Demofiles\Mod09 folder. Note that the script enables CDC for the src.Shippers table, and then click Execute. In Object Explorer, right-click each of the following tables in the DemoDW database, and click Select Top 1000 Rows to view their contents: src.Shippers: this table should contain four records. stg.ShipperDeletes: this table should be empty. stg.ShipperInserts: this table should be empty. stg.ShipperUpdates: this table should be empty. Maximize Visual Studio, in which the IncrementalETL.sln solution should be open, and in Solution Explorer, double-click the Extract Initial Shippers.dtsx SSIS package. Note that the CDC Control Tasks in the control flow contain errors, which you will resolve. Double-click the Mark Initial Load Start CDC Control Task, and in its editor, set the following properties. Then click OK: SQL Server CDC database ADO.NET connection manager: localhost DemoDW ADO NET. CDC control operation: Mark initial load start. Variable containing the CDC state: click New and create a new variable named CDC_State. Automatically store state in a database table: selected. Connection manager for the database where the state is stored: localhost DemoDW ADO NET. Table to use for storing state: click New, and then click Run to create the cdc_states table. State name: CDC_State. (More notes on the next slide)

20 Change Tracking Enable Change Tracking
9: Implementing a Data Extraction Solution Enable Change Tracking Record the current version and extract the initial data Extract changes since the last extracted version, and then update the last extracted version ALTER DATABASE Sales SET CHANGE_TRACKING = ON (CHANGE_RETENTION = 7 DAYS, AUTO_CLEANUP = ON) ALTER TABLE Salespeople ENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = OFF) Point out that, until any changes are made, the CHANGETABLE function will return an empty rowset. To retrieve the initial data, you must perform a regular SELECT statement against the source table and store the initial version number which, before any changes, is 0. On subsequent occasions, you can use the CHANGETABLE function to identify rows that have changed. = CHANGE_TRACKING_CURRENT_VERSION(); SELECT * FROM Salespeople = CHANGE_TRACKING_CURRENT_VERSION(); SELECT * FROM CHANGETABLE(CHANGES CT INNER JOIN Salespeople s ON CT.SalespersonID = s.SalespersonID Tip: Use snapshot isolation to ensure consistency

21 Demonstration: Using Change Tracking
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Enable Change Tracking Use Change Tracking Preparation Steps Complete the previous demonstrations in this module. Demonstration Steps Enable Change Tracking Ensure you have completed the previous demonstrations in this module. Maximize SQL Server Management Studio, and in Object Explorer, in the DemoDW database, right- click the src.Salespeople table and click Select Top 1000 Rows. This table contains source data for sales employees. Open Using CT.sql in the D:\Demofiles\Mod09 folder. Then select the Transact-SQL code under the comment Enable Change Tracking, and click Execute. This enables CT in the DemoDW database, and starts logging changes to data in the src.Salespeople table. Select the Transact-SQL code under the comment Obtain the initial data and log the current version number, and then click Execute. This code uses the CHANGE_TRACKING_CURRENT_VERSION function to determine the current version, and retrieves all records in the src.Salespeople table. Select the Transact-SQL code under the comment Insert a new salesperson, and then click Execute. Select the Transact-SQL code under the comment Update a salesperson, and then click Execute. Select the Transact-SQL code under the comment Retrieve the changes between the last extracted and current versions, and then click Execute. In Object Explorer, in the DemoDW database, right-click the src.Salespeople table and click Select Top 1000 Rows. Close SQL Server Management Studio.

22 Extracting Data with Change Tracking
9: Implementing a Data Extraction Solution Retrieve the last version number that was extracted from an extraction log. Extract and transfer records that were modified since the last version, retrieving the current version number. Replace the logged version number with the current version number. The technique for extracting data from a CT-enabled source is similar to that used to extract data with a datetime column. You store the time of the last extraction in the staging database and use it, together with the CT functions, to identify rows that have changed in the meantime. Typically, you should create a stored procedure in the data source to encapsulate the extraction logic. There is no formal demonstration of this technique in the course workbook, but if you wish, you can show it by using the following steps: Perform the previous demonstrations so that the DemoDW database is in place and CT has been enabled on the src.Salespeople table. Run the Create CT SP.sql script to create a stored procedure that extracts modified customer records. Open the IncrementalETL.sln solution; examine and run the Extract Salesperson.dtsx package.

23 Lab A: Extracting Modified Data
9: Implementing a Data Extraction Solution Exercise 4: Using Change Tracking Point out that: Lab instructions are deliberately designed to be high level so that students need to think carefully about what they are trying to accomplish and work out how best to proceed. Encourage students to read the scenario information carefully and collaborate with each other to meet the requirements. Remind students that, if they find a particular task or exercise too challenging, there are step-by-step instructions in the lab answer key. Like all other labs in this course, students must start by running a setup script to prepare the lab environment. Exercise 1: Using a Datetime Column to Incrementally Extract Data The InternetSales and ResellerSales databases contain source data for your data warehouse. The sales order records in these databases include a LastModified date column that is updated with the current date and time when a row is inserted or updated. You have decided to use this column to implement an incremental extraction solution that compares record modification times to a logged extraction date and time in the staging database. This restricts data extractions to rows that have been modified since the previous refresh cycle. Exercise 2: Using Change Data Capture The Internet Sales database contains a Customers table that does not include a column to indicate when records were inserted or modified. You plan to use the CDC feature of SQL Server Enterprise Edition to identify records that have changed between data warehouse refresh cycles, and restrict data extractions to include only modified rows. Exercise 3: Using the CDC Control Task The HumanResources database contains an Employee table in which employee data is stored. You plan to use the CDC feature of SQL Server Enterprise Edition to identify modified rows in this table. You also plan to use the CDC Control Task in SSIS to manage the extractions from this table by creating a package to perform the initial extraction of all rows, and a second package that uses the CDC data flow components to extract rows that have been modified since the previous extraction. Exercise 4: Using Change Tracking The ResellerSales database contains a Resellers table that does not include a column to indicate when records were inserted or modified. You plan to use the CT feature of SQL Server to identify records that have changed between data warehouse refresh cycles, and restrict data extractions to include only modified rows. Logon Information Virtual machine: 20767A-MIA-SQL User name: ADVENTUREWORKS\Student Password: Pa$$w0rd Estimated Time: 60 minutes.

24 20767A Lab Scenario 9: Implementing a Data Extraction Solution You have developed SSIS packages that extract data from various data sources and load it into a staging database. However, the current solution extracts all source records each time the ETL process is run. This results in unnecessary processing of records that have already been extracted and consumes a sizeable amount of network bandwidth to transfer a large volume of data. To resolve this problem, you must modify the SSIS packages to extract only data that has been added or modified since the previous extraction.

25 Lesson 3: Loading Modified Data
9: Implementing a Data Extraction Solution Demonstration: Partition Switching Question Which of these options is NOT used in a MERGE statement? ( )Option 1: WHEN NOT MATCHED ( )Option 2: USING ( )Option 3: ON ( )Option 4: WHEN MATCHED ( )Option 5: OFF Answer (√) Option -2: OFF Is the following statement true or false? “The Lookup transformation uses an in-memory cache to optimize performance.” ( )False ( )True (√)True

26 Options for Incrementally Loading Data
9: Implementing a Data Extraction Solution Insert, Update, or Delete from CDC Output Tables Use a Lookup transformation Use the Slowly Changing Dimension transformation Use the MERGE statement Use a checksum Considerations for Deleting Data Warehouse records: Use a logical deletion technique Technique depends on how deleted records are staged

27 Using CDC Output Tables
9: Implementing a Data Extraction Solution Staging DB Data Warehouse Execute SQL Task INSERT… FROM UPDATE… FROM JOIN ON BizKey DELETE WHERE BizKey IN or Source Staged Inserts Staged Updates Staged Deletes Destination Dimension Table OLE DB Command UPDATE… UPDATE… or DELETE… Staging and Data Warehouse Co-located Remote Data Warehouse Data Flow

28 Demonstration: Using CDC Output Tables
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Load Data from CDC Output Tables Preparation Steps Start the 20767A-MIA-DC and 20767A-MIA-SQL virtual machines. Demonstration Steps Load Data from CDC Output Tables Ensure you have completed the previous demonstrations in this module. Start SQL Server Management Studio and connect to the MIA-SQL instance of the SQL Server database engine by using Windows authentication. In Object Explorer, expand Databases, expand DemoDW, and expand Tables. Then right-click each of the following tables and click Select Top 1000 Rows: dw.DimShipper. This is the dimension table in the data warehouse. stg.ShipperDeletes. This is the table of records that have been deleted in the source system. stg.ShipperInserts. This is the table of new records in the source system. stg.ShipperUpdates. This is the table of rows that have been updated in the source system. Start Visual Studio and open the IncrementalETL.sln solution in the D:\Demofiles\Mod09 folder. Then in Solution Explorer, double-click the Load Shippers.dtsx SSIS package. On the control flow surface, double-click the Load Inserted Shippers Execute SQL task. Note that the SQL Statement inserts data into dw.DimShippers from the stg.ShipperInserts table. Then click Cancel. On the control flow surface, double-click the Load Updated Shippers Execute SQL task. Note that the SQL Statement updates data in dw.DimShippers with new values from the stg.ShipperUpdates table. Then click Cancel. On the control flow surface, double-click the Load Deleted Shippers data flow task. On the data flow surface, note that the task extracts data from the stg.ShipperDeletes table, and then uses an OLE DB Command transformation to update the Deleted column in dw.DimShippers for the extracted rows. (More notes on the next slide)

29 The Lookup Transformation
9: Implementing a Data Extraction Solution Redirect non-matched rows to the no match output Look up extracted data in a dimension or fact table based on a business key or unique combination of keys If no match is found, insert a new record Optionally, if a match is found, update non-key columns to apply a type 1 change Point out that you can choose to discard rows that match existing records by not consuming the match output from the Lookup transformation. In effect, this design assumes that all dimension or fact attributes are fixed and cannot be changed in the data warehouse. Alternatively, you can connect the match output to an OLE DB Command transformation to update any changeable columns. For dimensions, this approach is an implementation of a type 1 change.

30 Demonstration: Using the Lookup Transformation
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Use a Lookup Transformation to Insert Rows Use a Lookup Transformation to Insert and Update Rows Preparation Steps Complete the previous demonstrations in this module. Demonstration Steps Use a Lookup Transformation to Insert Rows Ensure you have completed the previous demonstrations in this module. Maximize Visual Studio, and in Solution Explorer, double-click the Load Geography.dtsx SSIS package. On the control flow surface, double-click Load Geography Dimension to view the data flow surface. On the data flow surface, double-click Staged Geography Data. Note that the SQL command used by the OLE DB source extracts geography data from the stg.Customers and stg.Salespeople tables, and then click Cancel. On the data flow surface, double-click Lookup Existing Geographies and note the following configuration settings of the Lookup transformation. Then click Cancel: On the General tab, unmatched rows are redirected to the no match output. On the Connection tab, the data to be matched is retrieved from the dw.DimGeography table. On the Columns tab, the GeographyKey column is retrieved for rows where the input columns are matched. On the data flow surface, note that the data flow arrow connecting Lookup Existing Geographies to New Geographies represents the no match data flow. Double-click New Geographies, and note that the rows in the no match data flow are inserted into the dw.DimGeography table. Then click Cancel. On the Debug menu, click Start Debugging, and observe the data flow as it executes. Note that, while four rows are extracted from the staging tables, only one does not match an existing record. The new record is loaded into the data warehouse, and the rows that match existing records are discarded. When execution is complete, on the Debug menu, click Stop Debugging. (More notes on the next slide)

31 The Slowly Changing Dimension Transformation
9: Implementing a Data Extraction Solution Changing Attributes (Type 1) Inferred members SCD OLE DB Command to update existing record OLE DB Destination for new records OLE DB Command to insert minimal record Historical Attributes (Type 2) Derived Column to add current row indicator column OLE DB Command to set existing record’s current row indicator to False Union All Derived Column sets current row indicator to True New records

32 Matches source and target rows
The MERGE Statement 9: Implementing a Data Extraction Solution Matches source and target rows Performs insert, update, or delete operations based on row matching results Point out that the second MERGE example treats all changes as type 2.

33 Demonstration: Using the Merge Statement
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Use the MERGE Statement Preparation Steps Complete the previous demonstrations in this module. Demonstration Steps Use the MERGE Statement In the D:\Demofiles\Mod09\LoadModifiedData folder, run SetupB.cmd as Administrator. In a User Account Control dialog box, click Yes. In SQL Server Management Studio, in Object Explorer, right-click the stg.SalesOrders table and click Select Top 1000 Rows. This table contains staged sales order data. Right-click the dw.FactSalesOrders table and click Select Top 1000 Rows. This table contains sales order fact data. Note that the staged data includes three order records that do not exist in the data warehouse fact table (with OrderNo and ItemNo values of 1005 and 1; 1006 and 1; and and 2 respectively), and one record that does exist but for which the Cost value has been modified (OrderNo 1004, ItemNo 1). Open the Merge Sales Orders.sql file in the D:\Demofiles\Mod09\LoadModifiedData folder and view the Transact-SQL code it contains, noting the following details: The MERGE statement specifies the DemoDW.dw.FactSalesOrders table as the target. A query that returns staged sales orders and uses joins to look up dimension keys in the data warehouse is specified as the source. The target and source tables are matched on the OrderNo and ItemNo columns. Matched rows are updated in the target. Unmatched rows are inserted into the target. Click Execute and note the number of rows affected. Right-click the dw.FactSalesOrders table and click Select Top 1000 Rows. Then compare the contents of the table with the results of the previous query you performed in step 3. Minimize SQL Server Management Studio.

34 Switch loaded tables into partitions Partition-align indexed views
Partition Switching 9: Implementing a Data Extraction Solution Switch loaded tables into partitions Partition-align indexed views Col1 Col2 Col3 Data Col1 Col2 Col3 Data Col1 Col2 Col3 Col1 Col2 Col3 Col1 Col2 Col3 Data

35 Demonstration: Partition Switching
9: Implementing a Data Extraction Solution In this demonstration, you will see how to: Split a Partition Create a Load Table Switch a Partition Preparation Steps Complete the previous demonstrations in this module. Demonstration Steps Split a Partition Ensure you have completed the previous demonstration in this module (Using the Merge Statement). Maximize SQL Server Management Studio and open the Load Partition.sql file in the D:\Demofiles\LoadModifiedData folder. Select the code under the comment Create a partitioned table, and then click Execute. This creates a database with a partitioned fact table, on which a columnstore index has been created. Select the code under the comment View partition metadata, and then click Execute. This shows the partitions in the table with their starting and ending range values, and the number of rows they contain. Note that the partitions are shown once for each index (or for the heap if no clustered index exists). Note that the final partition (4) is for key values of or higher and currently contains no rows. Select the code under the comment Add a new filegroup and make it the next used, and then click Execute. This creates a filegroup, and configures the partition scheme to use it for the next partition to be created. Select the code under the comment Split the empty partition at the end, and then click Execute. This splits the partition function to create a new partition for keys with the value or higher. Select the code under the comment View partition metadata again, and then click Execute. This time the query is filtered to avoid including the same partition multiple times. Note that the table now has three empty partitions (1, 4 and 5). Create a Load Table Select the code under the comment Create a load table, and then click Execute. This creates a table on the same filegroup as partition 4, with the same schema as the partitioned table. Select the code under the comment Bulk load new data, and then click Execute. This inserts the data to be loaded into the load table (in a real solution, this would typically be bulk loaded from staging tables). (More notes on the next slide)

36 Lesson 4: Temporal Tables
9: Implementing a Data Extraction Solution Change Data Capture Compared to System- Versioned Tables Question When you create a System-Versioned table, what is the minimum SQL Server creates in terms of tables, columns and keys? Answer Current Data Table. Historical Data Table. SysStartTime Column. SysEndTime Column. Primary Key. Which of these options is not a sub-clause of FOR SYSTEM_TIME? ( )Option 1: AS OF <date_time> ( )Option 2: BEFORE <date_time> ( )Option 3: FROM <start_date_time> TO <end_date_time> ( )Option 4: BETWEEN <start_date_time> AND <end_date_time> ( )Option 5: CONTAINED IN (<start_date_time>, <end_date_time>) (√) Option -2: BEFORE <date_time>

37 About System-Versioned Tables
9: Implementing a Data Extraction Solution System-versioned tables summary: Enable a full history of data changes Current and historical tables operate as a pair Feature can be added to existing tables Historical table can be named, or take system name Current table must have a primary key System-versioned tables operation: Versioning is automatic SysStartTime and SysEndTime columns define the validity period for data versions History specific queries are used to obtain version information A primary key is established on the current table and used to provide a relationship with the historical table. It is this relationship that defines how versioning works. With many data changes within a data warehouse, it is essential that version control is complete and robust.

38 Considerations for System-Versioned Tables
9: Implementing a Data Extraction Solution Main considerations when using system- versioned tables: Current table must have a primary key SysStartTime and SysEndTime must be datatime2 Data in the historical table cannot be directly modified Current table cannot be truncated when SYSTEM_VERSIONING is ON FILETABLE or FILESTREAM are not supported History table is PAGE compressed by default

39 Creating System-Versioned Tables
9: Implementing a Data Extraction Solution Create a new system-versioned table: CREATE TABLE dbo.Employee ( EmployeeID int NOT NULL PRIMARY KEY CLUSTERED, ManagerID int NULL, FirstName varchar(50) NOT NULL, LastName varchar(50) NOT NULL, SysStartTime datetime2 GENERATED ALWAYS AS ROW START NOT NULL, SysEndTime datetime2 GENERATED ALWAYS AS ROW END NOT NULL, PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime) ) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.EmployeeHistory));

40 Querying System-Versioned Tables
9: Implementing a Data Extraction Solution System-versioned tables can be queried using the FOR SYSTEM_TIME clause and one of the following four sub-clauses: AS OF <date_time> FROM <start_date_time> TO <end_date_time> BETWEEN <start_date_time> AND <end_date_time> CONTAINED IN (<start_date_time>, <end_date_time>) Or use ALL to return everything

41 Demonstration: Creating System-Versioned Tables
9: Implementing a Data Extraction Solution In this demonstration you will learn: How to create a system-versioned table How to update an existing table to make it system-versioned You will also look at the structure of the tables in SSMS Demonstration Steps Create System-Versioned Tables In the D:\Demofiles\Mod09\Temporal folder, run SetupC.cmd as Administrator. In a User Account Control dialog box, click Yes. When prompted, press any key to continue. Start SQL Server Management Studio and connect to the MIA-SQL instance of the database engine using Windows authentication. Open the Demo.ssmssln solution in the D:\Demofiles\Mod09\Temporal\Demo folder. If a Microsoft SQL Server Management Studio dialog box, appears, click OK. In Solution Explorer, open the 6 – Temporal Tables.sql script file. Select the Transact-SQL code under the comment Step 1: Connect to AdventureWorks, and then click Execute. Select the Transact-SQL code under the comment Step 2: Create System-Versioned Table, and then click Execute. In Object Explorer, expand Databases, expand AdventureWorks, expand Tables, point out that the object name has "(System-Versioned)" in it to help identify these tables. Expand the table to show the history table located under the main table node. Expand the columns to show the tables are the same, other than the PK. Select the Transact-SQL code under the comment Step 4 - Alter Existing Table, and then click Execute. In Object Explorer, point out that the object name has "(System-Versioned)" in it to help identify these tables. Expand the table to show the history table located under the main table node. Expand the columns to show the tables are the same, other than the PK. Select the Transact-SQL code under the comment Step 6 - Show Dates, and then click Execute. Close SQL Server Management Studio. If prompted, do not save changes.

42 Using System-Versioned Tables to Implement Slowly Changing Dimensions
9: Implementing a Data Extraction Solution Slowly Changing Dimensions (SCD): SCD manages incoming data: Changed, historical, fixed, disallowed, inferred SCD change types and transformation outputs: Changing Attributes Updates Historical Attribute Insert Output and New Fixed Attribute Inserts Inferred Member The SCD wizard: Choose data source and dimension table Configure mapping Set attribute options Review, run and update

43 Change Data Capture Compared to System-Versioned Tables
9: Implementing a Data Extraction Solution Parameter CDC System-Versioning Usage Rapid data change, short retention. ETL, data checks, fact tables. Transact-SQL options, SSMS, and Visual Studio Slower data change, long retention. SCD, audit, fix corrupt data, period reporting, and comparison. Transact SQL and SSMS Deployment Database and table levels Table level Methodology Transaction logs many tables One historical table per source table Scope Down to row/column level Source table tracked at row level Performance Handles large data amounts quickly Large data amounts slower Dependencies SQL Server Agent None Operations Can be quickly enabled/disabled at database level Requires each table to be enabled/disabled Operational Easy to enable/disable for maintenance, and so on More difficult Primary Tables No extra columns in primary Primary Keys Not required Required The reasons for using CDC and/or System-Versioning are complex and require great thought. However, the overriding principle is how quickly does the data change and how long will it be retained for?

44 Lab B: Loading a Data Warehouse
9: Implementing a Data Extraction Solution Exercise 4: Using the MERGE Statement Point out that the lab instructions are deliberately designed to be high level so that students need to think carefully about what they are trying to accomplish and work out how best to proceed. Encourage students to read the scenario information carefully and collaborate with each other to meet the requirements. Remind students that, if they find a particular task or exercise too challenging, there are step-by-step instructions in the lab answer key. Point out that, like all other labs in this course, students must start by running a setup script to prepare the lab environment. Exercise 1: Loading Data from CDC Output Tables The staging database in your ETL solution includes tables named as follows: EmployeeInserts, containing employee records that have been inserted in the employee source system. EmployeeUpdates, containing records modified in the employee source system. EmployeeDeletes, containing records that have been deleted in the employee source system. You must use these tables to load and update the DimEmployee dimension table, which uses a Deleted flag to indicate records that have been deleted in the source system. Exercise 2: Using a Lookup Transformation to Insert or Update Dimension Data Another BI developer has partially implemented an SSIS package to load product data into a hierarchy of dimension tables. You must complete this package by creating a data flow that uses a Lookup transformation to determine whether a product dimension record already exists, and then insert or update a record in the dimension table accordingly. Exercise 3: Implementing a Slowly Changing Dimension You have an existing SSIS package that uses an SCD transformation to load reseller dimension records into a data warehouse. You want to examine this package and then create a new one that uses an SCD transformation to load customer dimension records into the data warehouse. Exercise 4: Using the MERGE Statement Your Staging database is located on the same server as the data warehouse and you want to take advantage of this colocation of data and use the MERGE statement to insert and update staged data into the Internet sales fact table. An existing package already uses this technique to load data into the reseller sales fact table. Logon Information Virtual machine: 20767A-MIA-SQL User name: ADVENTUREWORKS\Student Password: Pa$$w0rd Estimated Time: 60 minutes

45 20767A Lab Scenario 9: Implementing a Data Extraction Solution You are ready to start developing the SSIS packages that load data from the staging database into the data warehouse.

46 Module Review and Takeaways
9: Implementing a Data Extraction Solution Review Question(s) Review Question(s) Question What should you consider when choosing between Change Data Capture and Change Tracking? Answer CDC is available in all editions of SQL Server 2012, 2014, and 2016. CT is available in all editions of SQL Server 2008, 2008R2, 2012, 2014, and 2016. With CDC, you can record every change made to a record within a specified interval. With CT, you can retrieve only the most recent changes. What should you consider when deciding whether or not to use the MERGE statement to load staging data into a data warehouse? The MERGE statement requires access to both the source and target tables, so can only be used in the following scenarios: The staging tables and data warehouse are co-located in the same SQL Server database. The staging and data warehouse tables are located in multiple databases in the same SQL Server instance, and the credentials used to execute the MERGE statement have appropriate user rights in both databases. The staging tables are located in a different SQL Server instance than the data warehouse, but a linked server has been defined that enables the MERGE statement to access both databases. Additionally, you should consider whether you need to track historical changes, as this may require extremely complex Transact-SQL. In this scenario, an SCD transformation or a custom SSIS data flow may be easier to develop and maintain.


Download ppt "Implementing a Data Extraction Solution"

Similar presentations


Ads by Google