Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:

Similar presentations


Presentation on theme: "Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:"— Presentation transcript:

1 Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:

2 (c) 2011 Microsoft. All rights reserved.
Garrett Edmondson MCITP, MSCE – BLOG - more videos! (c) 2011 Microsoft. All rights reserved.

3 ETL = Move Data Over Network
Network is slowest part of any data warehouse!!! Minimize Transformations: Load data from source(s) as fast as possible Incremental loads: pull least amount of data possible (c) 2011 Microsoft. All rights reserved.

4 OLTP ETL-ish Data Warehouse ETL
ETL Types OLTP ETL-ish Mirroring or 2012  Database Availability Groups Replication (Transactional, Merge, Peer-to-Peer ) Log Shipping Data Warehouse ETL Data State Change Data Capture/incremental Integration Services – SSIS Compress and BCP Flat-Files

5 OLTP Based ETL-ish (c) 2011 Microsoft. All rights reserved.
Easy to setup DBA’s very familiar to with replication technologies 3rd NF (typically non-dimensional) Transactional Consistency Replay transactions on “Date Warehouse” server and support Reporting queries Scalability issues – 100’s of sources/instances ?!!! Can be used for “Real Time” Data Warehousing Be very careful !!! See above (c) 2011 Microsoft. All rights reserved.

6 Data Warehouse ETL (c) 2011 Microsoft. All rights reserved.
Load Pattern Typically Daily ETL Loads Load Data State No need to replay DMLs Change Data Capture (CDC) Transaction log reader for DW work loads Convert LSN to DateTime stamp Net changes since last ETL run i.e. row version (c) 2011 Microsoft. All rights reserved.

7 Data Processing with SSIS - Transformations
(c) 2011 Microsoft. All rights reserved.

8 Transforms Row Based Partially Blocking Blocking
Logically works row by row No memory is copied Buffer is reused Row Based (synchronous) Works with groups of rows Memory is copied Shape of the buffer can change Partially Blocking (asynchronous) Need all input rows from all buffers before producing any output rows Blocking (asynchronous)

9 (c) 2011 Microsoft. All rights reserved.

10 Asynchronous Data Processing in SSIS
Aggregation Demo (c) 2011 Microsoft. All rights reserved.

11 (c) 2011 Microsoft. All rights reserved.
!?! (c) 2011 Microsoft. All rights reserved.

12 Asynchronous Processing in SSIS = Linear performance
Each fully blocking asynchronous component must spool all the rows More rows = longer processing time No way to process the rows faster No DB Engine optimization like (query engine, statistics, compression, columnstore indexes) Procedural like processing (do this then that) versus relational declarative (give me that) Good for VM/SAN solutions as long as processing times are acceptable

13 (c) 2011 Microsoft. All rights reserved.
<rant> NEVER use the OLE DB Command for a data warehouse batch load processes. it is pure evil because  it does DML commands on a row-by-row basis. Good luck loading a lot rows with that! </rant> (c) 2011 Microsoft. All rights reserved.

14 (c) 2011 Microsoft. All rights reserved.
Fact Data: ELT ETL – Extract Transform Load Extract (SSIS) Transform (SSIS) Load to DB Engine ELT – Extract Load Tans from Transform with DB Engine ELT – Advantages Asynchronous (Blocking) transforms much faster Join optimization SQL Server DBE query engine Fastest Loads with flatfiles (PDW dwloader) (c) 2011 Microsoft. All rights reserved.

15 (c) 2011 Microsoft. All rights reserved.
SSIS: The Right Tool for the Job Multiple data sources and destination Synchronous Transformations Workflow management Trickle feed Real-time ETL Asynchronous Transformations Once your operations are defined, you need to figure out if SSIS really is the right tool for the job. First let’s consider what SSIS is best suited for. SSIS is a great choice if you’re pulling in data from multiple sources, or splitting it up and sending it to a number of places. It’s also good if your data needs to go through a series of transforms, or you’re merging multiple sources of data. Finally, the package designer in BIDS lets you visually layout your workflow, which for a lot of people is easier than doing everything directly inside of stored procedures with SQL. You’ll want to be careful about using SSIS if your design requires you to do trickle feed or real-time ETL type operations. SSIS can do them, but it was really designed for bulk data loads. Our data pipeline is really fast, but the runtime that loads and hosts it can be slow to startup at times. When you’re moving large amounts of data, you don’t notice this startup cost, but you will if you’re running your package every 15 seconds, or only moving a row or two at a time. One of the first big customer issues I worked on, they had a single set of packages to do their bulk loads, and their incremental feeds. I say incremental, but it was more like a trickle feed – they had some web process that would kick off all of the packages for one to five rows of customer data. Their solution was big, too – something like 30 packages. They’d run through these really complex data flows that worked great when they were moving their entire data set, but it seems to take forever to just run through a couple of rows. Finally, there are a couple of reasons you’ll want to use something other than SSIS. If your source and destination databases are on the same server, you’ll probably want to do everything using SQL. Otherwise you’ll be copying all of the data out to SSIS, and then pushing it all back in. It’s usually way more efficient to just process it directly on the server at that point. Another reason not to use SSIS is if you’re doing a straight file to database load, or sometimes even a database to database load, without applying any transformations or control flow type logic to it. You can do it with SSIS, and if you want the graphical design experience, it’ll still work, but you can get the same performance, or maybe even a little better, if you use a BULK INSERT statement, or BCP. Single source and destination server BULK INSERT works just fine (c) 2011 Microsoft. All rights reserved.

16 (c) 2011 Microsoft. All rights reserved.
Summary 90% of customers will hit their performance goals with the correct package design Most tuning and optimization will be done at the database and environment level (c) 2011 Microsoft. All rights reserved.

17 (c) 2011 Microsoft. All rights reserved.
FlatFile ELT Data Compression – Most efficient way to transfer Data over the wire FlatFile Demo (c) 2011 Microsoft. All rights reserved.

18 (c) 2011 Microsoft. All rights reserved.
Partition Switching (c) 2011 Microsoft. All rights reserved.

19 Partition Switching Pattern 1
Target DB SSIS: Surrogate Key Lookup Filegroup A Source DB Source DB STG Fact (Partitioned) Heap Fact (Partitioned) CSI Source DB Source DB Switch Concurrent bulk inserts = # Cores Create Indexes SORT_IN_TEMPDB MAXDOP =1

20 Partition Switching Demo
Table Partitioning Demo.sql (c) 2011 Microsoft. All rights reserved.

21 Partition Switching Pattern - FlatFiles
Target DB Fastest Filegroup A Source Data Files STG Fact (Partitioned) Heap Fact (Partitioned) CL/CI Switch Concurrent bulk inserts = # Cores Create Indexes SORT_IN_TEMPDB MAXDOP =1

22 Partitioning Fact Tables
See the Data Loading Performance Guide from the SQLCAT team Minimally logged operations are key Best practices Remove indexes (empty tables), or use ORDER hint to load sorted data Use TABLOCK Insert in parallel Scales linearly to 16 streams if you’re not IO bound SQL ,000 Partitions Switching on same filegroup Target Partition must be empty Load staging table with BULK INSERT with TABLOCK (c) 2011 Microsoft. All rights reserved.

23 Don’t run more than 1 create CI per filegroup to avoid page splits
Step 3 “Transform” Step 2 “Stage Insert” Step 4 “Final Append” Target Database Step 1 “Base Load” 8 Source Data Files 2 sets, 4 concurrent Create Cluster Index with Compression INTO “Final Destination” Create CI 8 Concurrent Partition Switch Part Switch Filegroup “Stage A” Filegroup “A” Partition 1,2 Filegroup “B” Partition 3,4 Filegroup “C” Partition 5,6 Filegroup “D” Partition 7,8 “Stage B” 8 Concurrent Inserts 8 Heap Stage Table Constraint on CI Part Key 8 Concurrent Bulk Insert Don’t run more than 1 create CI per filegroup to avoid page splits Determine number of filegroups and partitions per filegroup by examining available memory and CPU cores 4 Create Cluster Indexes at a time = 4x40GB memory for sorts: 160GB total memory of 192 available CPU: General rule is to run with half the number of physical cores if compression is being used and parity with physical cores if no compression is used. Since we can run up to 4 partitions at a time, 4 independent filegroups is ideal for a table of this size on a system with this much memory and CPU cores 2 CI Stage Tables Base Heap StageTable Destination Partitioned CI Table Partition 2 Destination CI Partition 1 Destination CI Partition 4 Destination CI Partition 3 Destination CI Partition 6 Destination CI Partition 5 Destination CI Partition 8 Destination CI Partition 7 Destination CI 8 Core Server (c) 2011 Microsoft. All rights reserved.

24 (c) 2011 Microsoft. All rights reserved.
Bulk Insert Sales 2001 2002 2003 2004 Fact SWITCH SWITCH stgFact_2001 BULK INSERT SWITCH SWITCH stgFact_2002 BULK INSERT BULK INSERT SWITCH SWITCH stgFact_2003 SWITCH SWITCH stgFact_2004 BULK INSERT (c) 2011 Microsoft. All rights reserved.

25 (c) 2011 Microsoft. All rights reserved.
Large Updates Fact 2001 2002 2003 2004 Fact_New SWITCH Fact Update SWITCH Fact_Old Fact_Delta Update Records BULK INSERT (c) 2011 Microsoft. All rights reserved.

26 (c) 2011 Microsoft. All rights reserved.
Large Deletes Fact 2001 2002 2003 2004 2001 (Filtered) SWITCH Fact_Temp (2001 Filtered) BULK INSERT SWITCH Fact_Temp (2001) (c) 2011 Microsoft. All rights reserved.


Download ppt "Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:"

Similar presentations


Ads by Google