ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1.

Slides:



Advertisements
Similar presentations
Jose Chinchilla MCITP: Database Administrator, SQL Server 2008 MCTS: SQL Server 2005 & 2008 MCTS: Business Intelligence SQL Server 2008 Position(s): Business.
Advertisements

Supervisor : Prof . Abbdolahzadeh
Introduction to ETL Using Microsoft Tools By Dr. Gabriel.
ISQS 6339, Data Management and Business Intelligence Cubism – Measures and Dimensions Zhangxi Lin Texas Tech University 1.
Data Manager Business Intelligence Solutions. Data Mart and Data Warehouse Data Warehouse Architecture Dimensional Data Structure Extract, transform and.
James Serra – Data Warehouse/BI/MDM Architect
Moving Data Lesson 23. Skills Matrix Moving Data When populating tables by inserting data, you will discover that data can come from various sources.
Technical BI Project Lifecycle
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
1 Chapter Overview Transferring and Transforming Data Introducing Microsoft Data Transformation Services (DTS) Transferring and Transforming Data with.
ETL Design and Development Michael A. Fudge, Jr.
XP Information Information is everywhere in an organization Employees must be able to obtain and analyze the many different levels, formats, and granularities.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
ETL By Dr. Gabriel.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1.
ISQS 3358, Business Intelligence Creating Data Marts Zhangxi Lin Texas Tech University 1.
Chapter 4: Organizing and Manipulating the Data in Databases
Chapter 4-1. Chapter 4-2 Database Management Systems Overview  Not a database  Separate software system Functions  Enables users to utilize database.
SQL Server Integration Services (SSIS) Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server (MVP) Microsoft Certified Technology Specialist.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Session 4: The HANA Curriculum and Demos Dr. Bjarne Berg Associate professor Computer Science Lenoir-Rhyne University.
HDNUG 27-March-2007 SQL Server 2005 Suite as a Business Intelligence Solution.
Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.
DTS Conversion to SSIS Conversion Best Practices Mike Davis
 2004 Prentice Hall, Inc. All rights reserved. 1 Segment – 6 Web Server & database.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
Relational Databases Database Driven Applications Retrieving Data Changing Data Analysing Data What is a DBMS An application that holds the data manages.
1 Data Warehouses BUAD/American University Data Warehouses.
Populating a Data Warehouse. Overview Process Overview Methods of Populating a Data Warehouse Tools for Populating a Data Warehouse Populating a Data.
ISQS 6339, Data Management and Business Intelligence Cubism – Bells and Whistles Zhangxi Lin Texas Tech University 1.
IPortal Bringing your company and your business partners together through customized WEB-based portal software. SanSueB Software Presents iPortal.
DAT 360: DTS in SQL Server 2000 Best Practices Euan Garden Group Manager, SQL Server Microsoft Corporation.
Today’s Agenda Chapter 7 Review for Midterm. Data Transfer Tools DTS (Data Transformation Services) BCP (Bulk Copy Program) BULK INSERT command Other.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
ISQS 3358, Business Intelligence Cubism – Measures and Dimensions Zhangxi Lin Texas Tech University 1.
ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.
DAT 332 SQL Server 2000 Data Transformation Services (DTS) Best Practices Euan Garden Product Unit Manager SQL Server Development Microsoft Corporation.
7 Strategies for Extracting, Transforming, and Loading.
02 | Data Flow – Extract Data Richard Currey | Senior Technical Trainer–New Horizons United George Squillace | Senior Technical Trainer–New Horizons Great.
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
Creating Simple and Parallel Data Loads With DTS.
MIS 451 Building Business Intelligence Systems Data Staging.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
Pulling Data into the Model. Agenda Overview BI Development Studio Integration Services Solutions Integration Services Packages DTS to SSIS.
SSIS ETL Data Resource Management. Create an ETL package using a wizard database server to database server The business goal of this ETL package is to.
Practical MSBI(SSIS, SSAS,SSRS) online training. Contact Us: Call: Visit:
©NIIT BCP and DTS Implementing Stored Procedures Lesson 2A / Slide 1 of 23 Objectives In this lesson, you will learn to: Perform bulk copy using the BCP.
PROJECT ORIENTED ONLINE TRAINING ON MSBI (IS,AS,RS)
Supervisor : Prof . Abbdolahzadeh
Advanced Applied IT for Business 2
Zhangxi Lin Texas Tech University
Zhangxi Lin Texas Tech University
Zhangxi Lin Texas Tech University
IBM DATASTAGE online Training at GoLogica
Data Warehouse.
Best Practices in Higher Education Student Data Warehousing Forum
Visual Data Flows – Azure Data Factory v2
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

ISQS 3358, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1

Outline Data Integration Introduction to SSIS Learn by doing – Exercise 4 More about SSIS features - Package development tools 2

Data Integration

Integration that comprises three major processes: ◦ data access, ◦ data federation, and ◦ change capture. When these three processes are correctly implemented, data can be accessed and made accessible to an array of ETL and analysis tools and data warehousing environments ETL Tools 4’56” ETL Tools 4 ISQS 3358 BI

Data Integration Enterprise application integration (EAI) A technology that provides a vehicle for pushing data from source systems into a data warehouse, including application functionality integration. Recently service- oriented architecture (SOA) is applied Enterprise information integration (EII) An evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases Extraction, transformation, and load (ETL) A data warehousing process that consists of extraction (i.e., reading data from a database), transformation (i.e., converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database), and load (i.e., putting the data into the data warehouse) 5 ISQS 3358 BI

Transformation Tools: To purchase or to Build in-House Issues affect whether an organization will purchase data transformation tools or build the transformation process itself ◦ Data transformation tools are expensive ◦ Data transformation tools may have a long learning curve ◦ It is difficult to measure how the IT organization is doing until it has learned to use the data transformation tools Important criteria in selecting an ETL tool ◦ Ability to read from and write to an unlimited number of data source architectures ◦ Automatic capturing and delivery of metadata ◦ A history of conforming to open standards ◦ An easy-to-use interface for the developer and the functional user 6 ISQS 3358 BI

Introduction to SSIS

ETL Topics Dimension Processing ◦ Extract changed row from the operational database ◦ Handling slowly changing dimensions ◦ De-duplication and fuzzy transforms Fact Processing ◦ Extract fact data from the operational database ◦ Extract fact updates and deletes ◦ Cleaning fact data ◦ Checking data quality and halting package execution ◦ Transform fact data ◦ Surrogate key pipeline ◦ Loading fact data ◦ Analysis services processing Integrating all tasks 8

Different Types of ETL Control Flows With data flows, e.g. ◦ Import data ◦ Database updates ◦ Loading SCD ◦ Database cleansing ◦ Aggregating data Without data flows, e.g. ◦ Downloading zipped files ◦ Archiving downloaded files ◦ Reading application log ◦ Mailing opportunities ◦ Consolidating workflow package 9

Structure and Components of Business Intelligence 10 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG MS SQL Server 2008 BIDS

What can you do with SSIS? To load the dimension and fact tables in the database. If the source data for a dimension table is stored in multiple data sources, the package can merge the data into one dataset and load the dimension table in a single process, instead of using a separate process for each data source. To update data in data warehouses and data marts. The Slowly Changing Dimension Wizard automates support for slowly changing dimensions by dynamically creating the SQL statements that insert and update records, update related records, and add new columns to tables. To process Analysis Services cubes and dimensions. When the package updates tables in the database that a cube is built on, you can use Integration Services tasks and transformations to automatically process the cube and to process dimensions as well. To compute functions before the data is loaded into its destination. If your data warehouses and data marts store aggregated information, the SSIS package can compute functions such as SUM, AVERAGE, and COUNT. An SSIS transformation can also pivot relational data and transform it into a less-normalized format that is more compatible with the table structure in the data warehouse. 11

Youtube Videos Introduction to SQL Server Integration Services ◦ Part 1 10’19”, Part 2 8’12” Part 1Part 2 Create A Basic SSIS Package with SSIS 7’55” Create A Basic SSIS Package with SSIS How to create a simple SSIS package ◦ Part 1 3’19”, Part 2 7’54”, Part 3 6’41” Part 1Part 2Part 3 More videos ◦ An Overview of SSIS  Part 1 6’12”, Part 2 6’13”, Part 3 8’20” Part 1Part 2Part 3 ◦ ETL Demon 10’40” ETL Demon ◦ ETL Tools 4’56” ETL Tools ◦ SSIS 2008 Package Deployment  Part I 8’04”, Part II 5’09” Part I Part II ◦ Introduction to SSIS 2008  Part I 9’57”, Part II 9’57”, Part III 9’55”, Part IV 9’59”, Part V 6’11” Part I Part II Part IIIPart IVPart V ◦ ETL Strategies with SSIS 21’56” ETL Strategies with SSIS

SQL Server Integration Services The hierarchy of SSIS ◦ Project -> Package -> Control flow -> Data flow Package structure ◦ Control flow ◦ Data flow ◦ Event handler ◦ Package explorer ◦ Connection tray Features ◦ Event driven ◦ Layered ◦ Drag-and-drop programming ◦ Data I/O definitions are done using Connection Managers 13 Control Flow Data Flow Event Handler

Control Flow Bulk Insert task: Perform a fast load of data from flat files into a target table. Good for loading clean data. Execute SQL task: Perform database operations, creating views, tables, or even databases. Good for query data or metadata File Transfer Protocol and File System tasks: transfer files or sets of files. Execute Package, Execute DTS2000 Package, and Execute Process tasks: Break a complex workflow into smaller ones, and define a parent or master package to execute them. Send Mail task: sends an message. 14

Control Flow (cont’d) Script and ActiveX Script tasks: Perform an endless array of operations that are beyond the scope of the standard tasks. Data Mining and Analysis Service Processing tasks: Launch processing on SSAS dimensions and databases. Use SSAS DDL task to create new Analysis Services partitions, or perform any data definition language operation. XML and Web Services tasks Message Queue, WMI Data Reader, and WMI Event Watcher tasks: Help to build an automatic ELT system. ForEach Loop, For Loop, and Sequence containers: Execute a set of tasks multiple times Data Flow tasks 15

Data Flow Task Data Flow task is a pipeline in which data is picked up, processed and written to a destination. Avoids I/O, which provided excellent performance Concepts ◦ Data sources ◦ Data destinations ◦ Data transformations ◦ Error flows 16

Frequently Used Data Transformation Steps Sort and Aggregate transforms Conditional Split and Multicast transforms Union All, Merge Join, and Lookup transforms Slowly Changing Dimension transform OLE DB Command transform Row Count and Audit transforms Pivot and Unpivot transforms Data mining Model Training and data Mining Query transforms Term extraction and Term Lookup transforms File Extractor and File Injector transforms 17

Demo: Exploring Features of SSIS Data Set: ◦ Source: Commrex_2011, D5.txt (in the shared directory under ~\Datasets) ◦ Destination: Flat file, Excel file, OLE DB file Data flow transformation ◦ Aggregate (Use D5.txt, and aggregate the data with regard to UserID) ◦ Derived Column (Use Commrex_2011, and create a new column “NewID”) ◦ Data Conversion (Use Commrex_2011, and convert data type of some columns, such as UserID, Prop_ID) ◦ Sort (use D5.txt, sort ascending with ID, Date, Time) 18

LEARN BY DOING – EXERCISE 4

Exercise 4: Populate Maximum Miniatures Manufacturing Data Mart Dimensions Preparation: Data sources and destination definition ◦ Source database: AccountingSystemDatabase Loading dimensions ◦ ProductType ◦ ProductSubType ◦ Product ◦ Country ◦ Plant (using SQL Command) ◦ Material (using SQL Command, Aggregate item) ◦ MachineType (copied from the Material loading task) ◦ Machine (copied from the MachineType loading task) Note: DimBatch and the fact table will be loaded in the next exercise. Debugging ◦ Step by step ◦ Understand the error messages ◦ Watch database loading status See more detailed Guidelines of this exerciseGuidelines Submit the screenshots of “green” results of the ETL flow to by February 20 before 5p. 20

Review: A Primary Key A primary key is a candidate key chosen to be the main key for the relation. If you know the value of the primary key, you will be able to uniquely identify a single row.

Review: A Foreign Key To preserve relationships, you may need to create a foreign key. A foreign key is a primary key from one table placed into another table. The key is called a foreign key in the table that received the key.

Review: Referential Integrity Referential integrity states that every value of a foreign key must match a value of an existing primary key. Example (see previous slide): ◦ If EmpID = 4 in EMPLOYEE has a DeptID = 7 (a foreign key), a Department with DeptID = 7 must exist in DEPARTMENT. ◦ The primary key value must exist before the foreign key value is entered.

Snowflake Schema of the Data Mart 24 ManufacturingFact DimProduct DimProductSubType DimProductType DimBatch DimMachine DimMachineType DimMaterial DimPlant DimCountry AggregateSQL Coding Why following this order?

Data Source Lin.AccountingSystemDatabase

Codes for data flows The following codes are used to selectively retrieve data from the source for the destination database Code for DimPlant loading SELECT LocationCode, LocationName, CountryCode From Locations WHERE LocationType = 'Plant Site' Code for DimMaterial loading SELECT AssetCode, AssetName, AssetClass, LocationCode, Manufacturer, DateOfPurchase, RawMaterial FROM CapitalAssets WHERE AssetType = 'Molding Machine' 26

More About SSIS Features

ETL System Debugging Most frequently encountered errors ◦ Data format error: The database table’s data type does not match the input data’s format  Reason 1: Flat Text file uses varchar(50), or string[DT_STR] format; Excel file uses nvarchar format  Reason 2: You defined the database using different formats, which could be caused by the imported data set.  Solution: A Data Conversion data transformation node can be used for changing the format ◦ SQL Server system error: Even though you did things correctly you could not get through.  Solution: the easiest way to solve this problem is to redo the ETL flow. 28

ETL How-to Problems How to use Merge function of Data Transformation to join datasets from two tables into one. How to split a dataset to two tables How to remove duplicated rows in a table. How to detect the changes of the rows in the data sources and extract the updated rows into a table in the data warehouse. How to load multiple datasets with similar structure into a table Reference: SQL Server 2005 Integration Services, McGraw Hill Osborne, 2007SQL Server 2005 Integration Services 29

Strategies for Extracting Data Extracting data from packaged source systems –self- contained data sources ◦ May not be good to use their APIs ◦ May not be good to use their add-on analytic system Extracting directly from the source databases ◦ Strategies vary depending on the nature of the source database Extracting data from incremental loads ◦ How the source database records the changes of the rows Extracting historical data 30 ISQS 6339, Data Management & Business Intelligence

Decision Issues in ETL System Design Source-to-target mapping Load frequency How much history is needed 31 ISQS 6339, Data Management & Business Intelligence

De-Duplication Two common situations: person, and organization SSIS provides two general-purpose transforms helping address data quality and de-duplication ◦ Fuzzy Lookup ◦ Fuzzy Grouping 32

Connection managers Excel Connection Manger File Connection Manger Flat File Connection Manager FTP Connection Manager HTTP Connection Manager ODBC Connection Manager OLE DB Connection Manager ADO Connection Manager – for legacy applications using earlier versions of programming languages, such as VB 6.0 ADO.NET Connection Manager – Access to Microsoft SQL Server and data sources exposed through OLE DB and XML by using a.NET provider Microsoft.NET Data Provider for mySQL Business Suite – access to SAP server and enables to execute RFC/NAPI commands and select queries against SAP tables Design-time data source objects can be created in SSIS, SSAS and SSRS projects 33

Container Managers Foreach Loop Container For Loop Container Sequence Container 34

Data Flow for Updating Database 35

Data Flow for Loading Slowly Changing Dimension 36

Control Flow for Importing Expanded Files 37

Automating your routine information processing tasks Your routine information processing tasks ◦ Read online news at 8:00a and collect a few most important pieces ◦ Retrieve data from database to draft a short daily report at 10a ◦ View and reply s and take some notes that are saved in a database ◦ View 10 companies’ webpage to see the updates. Input the summaries into a database ◦ Browse three popular magazines twice a week. Input the summaries into a database ◦ Generate a few one-way frequency and two-way frequency tables and put them on the web ◦ Merge datasets collected by other people into a main database. ◦ Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. ◦ Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. Seems there are many things on going. How to handle them properly in the right time? ◦ Organizer – yes ◦ How about regular data processing tasks? 38

Package Items Data flow Task – main task Control Flow Items ◦ For Loop Container, Foreach Loop Container, Sequence Container Data Preparation Tasks ◦ File System Task, FTP Task, Web Service Task, XML Task Work Flow Tasks ◦ Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task, Message Queue Task, Send Mail Task, WMI Data Reader Task, WMI Event Watcher Task SQL Server Tasks ◦ Bulk Insert Task, Execute SQL task Scripting Tasks ◦ ActiveX Script Task, Script Task Analysis Services Tasks ◦ Analysis Services Processing Task, Analysis Services Execute DDL Task, Data Mining Query Task Transfer Tasks ◦ Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task Transfer Objects Task, Transfer Stored Procedures Task Maintenance Tasks Custom Tasks 39