Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extract, Transform, Load 1. Agenda  Review  Analysis (Bus Matrix, Info Package)  Logical Design(Dimensional Modeling)  Physical Design(Spreadsheet)

Similar presentations


Presentation on theme: "Extract, Transform, Load 1. Agenda  Review  Analysis (Bus Matrix, Info Package)  Logical Design(Dimensional Modeling)  Physical Design(Spreadsheet)"— Presentation transcript:

1 Extract, Transform, Load 1

2 Agenda  Review  Analysis (Bus Matrix, Info Package)  Logical Design(Dimensional Modeling)  Physical Design(Spreadsheet)  Implementation(Data Mart Relational Tables)  ETL Process Overview  ETL Components  Staging Area  Extraction  Transformation  Loading  Documenting High-Level ETL Requirements  Documenting Detailed ETL Flows  Example ETL 2

3 Review: Dimensional Modeling 3

4 Review: DM Implementation DimStudent FactEnrollment CREATE TABLE DimStudent( student_sk int identity(1,1), student_id varchar(9), firstname varchar(30), lastname varchar(30), city varchar(20), state varchar(2), major varchar(6), classification varchar(25), gpa numeric(3, 2), club_name varchar(25), undergrad_school varchar(25), gmat int, undergrad_or_gradvarchar(10), CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO CREATE TABLE FactEnrollment( student_sk int, class_sk int, date_sk int, professor_sk int, course_grade numeric(2, 1), CONSTRAINT factenrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk), CONSTRAINT factenrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk), CONSTRAINT factenrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk), CONSTRAINT factenrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk), CONSTRAINT factenrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk)); GO 4

5 Review: Physical DW Design 5

6 ETL Overview  Reshaping relevant data from source systems into useful information stored in the DW  Extract  Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW  Transform  Cleaning and converting data to prepare it for loading into the DW  Load  Putting cleansed and converted data into the DW 6

7 ETL Process  Not Really New, BUT…  Much more data  Includes rearranging, summarizing  Data used for strategic decision-making  Characteristics:  Process AND technology  Detailed, highly-dependent tasks  Consumes average 75% of DW development  An on-going process for life of DW  Requirements:  Well-documented  Automated  Flexible 7

8 ETL Process 1. Determine target data 2. Determine data sources 3. Prepare data mapping 4. Organize data staging area 5. Establish data extraction rules 6. Establish data transformation rules 7. Plan aggregate tables 8. Establish data load procedures 9. Load dimension tables 10. Load fact tables 8

9 ETL Process Flow 9 1, Dim Model 2, Spreadsheet 3, Spreadsheet 4 5, SSIS 6, 7, Map & SSIS 8, 9, 10, SSIS

10 ETL Staging Area 10  Information hub, facilitating the enriching stages that data goes through to populate a DW  Advantages:  Separates source systems and DW  Minimizes ETL impact on source AND DW systems  Can consist of multiple “hubs”  “upload” area  “staging” area  “DW load images”

11 ETL Staging Area, cont… 11

12 High Level Design of ETL Process  Initial documentation of:  What data do we need and where is it coming from?  Physical DW Design Spreadsheet shown previously  What are the major transformation/cleansing needs?  “Extend” Physical DW Design Spreadsheet OR  ETL Map  What’s the sequence of activities for ETL?  ETL Map 12

13 Common Transformations  Format Revisions  Key Restructuring, Lookup  Handling of Null Values  Decoding fields  Calculated, Derived values  Merging of Data 13

14 Common Transformations, cont…  Splitting of single fields  Character set conversion  Units of measurement conversion  Date/time conversion  Summarization  Deduplication 14

15 Common Transformations, cont…  Other Data Quality Issues  Standardize values  Validate values  Identifying mismatches, misspellings  Etc…  Suggestions:  Appoint “Data Stewards”  Ensure ETL programs have control checks  Data Profiling… 15

16 Comparison of Models 16

17 Transformations Example DimTimeDimProfessorDimClassDimStudentFactEnrollment Create tableGenerate SK Add SKs: student, section, prof (join registration to student, time, and section dims; left join them to prof) Insert row w/SK = -1 Expand rank values (use SQL case) Get coursename & cred hrs from section tbl (join section to course) Expand classification values (use SQL case) Expand department values (join prof to departments) Expand state values (needs lookup table but use SQL case instead) Get gmat, undergrad school from grad table (join student to grad) Get club name from club (join student to undergrad; Left join them to club) Create undergrad_or_grad values (if stud_id in undergrad or stud_id in grad) 17

18 Data Profiling  Systematic analysis of the content of a data source  Goals:  Anticipate potential data quality issues upfront  Build quality corrections and controls into ETL process  Manual and/or Tool-assisted 18

19 Profiling Example: Manual CustID Account Number Customer TypeTitle First Name Last NameGender PhoneAddress Line1 Address Line2State Postal CodeCountry AW IMr.JonYangF works.com. 1(11) N. 14th St Queensland4700AU AW I EugeneHuangF works.com W St. Victoria3198AU AW I RubenTorresF works.com. 1(11) Linden Dr New South Wales7001AU AW I ChristyZhuF works.com. 1(11) Village Pl. Queensland AW IMrs.ElizabethJohnsonF works.com.(500) Harness Circle 2500AU AW I JulioRuizM works.com. 1(11) Humphrey Drive New South Wales4169OZ 19

20 Profiling Example: SSIS 20

21 Documenting ETL High Level Design  Add to existing DW Physical Design Spreadsheet 21

22 Documenting ETL High Level Design 22

23 Low Level Design of ETL Process  Detailed documentation of:  What data do we need and where is it coming from?  What are the major transformation/cleansing needs?  What’s the sequence of activities for ETL?  Can use tool like SSIS 23

24 Extracting Source Data  Two forms: 1. Static Data Capture  Point-in-time snapshot  Initial Loads and periodic refreshes 2. Revised Data Capture  Only data that has been added, updated, deleted since last load  Ongoing incremental loads  Two timeframes  Immediate  Deferred 24

25 Static Data Capture  (T)SQL Scripts  e.g., small number of tables/rows  Export/Import Tables  e.g., database or non-database sources  Backup/Restore Database  e.g., copying sqlserver source database for initial load ETL  Detach/Attach Database  e.g., copying older sqlserver version to newer sqlserver version for initial load ETL 25

26 Revised Data Capture  Immediate / Real-time  ETL side: procs get changed data from log real-time and update ETL staging tables  OLTP side: triggers update ETL staging tables  OLTP side: apps write to OLTP AND ETL staging tables  Deferred  ETL side: procs get changed data from OLTP tables based on timestamps  ETL side:procs do file comparison  OLTP side:changed data capture (SS 2008) 26

27 Documenting ETL Low Level Design: SSIS  Comes with SQL Server  Helps document and automate ETL process  Based on defining  Packages  Tasks  One approach  A package for each target table  A "master" package 27

28 SSIS Package Examples: Master 28

29 SSIS Package Examples: Extract All 29

30 SSIS Package Examples: Extract Changed using CDC 30 Eg, SELECT * from cdc- customer WHERE cdc_chg_date > etl_last_capture_date;

31 SSIS Package Examples: Transforms 31

32 SSIS Package Examples: Load 32

33 Class Performance DW Example  Create ClassPerformanceDW database  Using ClassPerformanceDW database…  Create ClassPerformanceDW tables using SQL Script  w_tables/create_class_performance_dw_tables.sql w_tables/create_class_performance_dw_tables.sql 33

34 ETL Example using SQL Scripts  One "Master Script"  Calls five "table" scripts 34

35 "Master" Script 35 --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading factenrollment table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql" Print 'class performance DW data transformation and loading is complete' Go

36 Load "DimProfessor" Script (pg. 1 of 3) 36 set nocount on print 'remove existing data from dimprofessor' delete from dimprofessor; go print 'reseeding SK identity value back to 1' dbcc checkident ('dimprofessor', reseed, 0); go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id, firstname, lastname, rank, department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof ; go

37 Load "DimProfessor" Script (pg. 2 of 3) 37 print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; Go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.department FROMdimprofessor, regnOLTP.dbo.departments WHEREdimprofessor.department = regnOLTP.dbo.departments.prefix ; Go

38 Load "DimProfessor" Script (pg. 3 of 3) 38 print 'adding SK -1 row' set identity_insert dimprofessor on Go insert into dimprofessor ( professor_sk, professor_id, firstname, lastname, rank, department) Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insert dimprofessor off Go Set nocount off

39 Load "FactEnrollment" Script 39 print 'adding oltp registration data to fact_enrollment' INSERT INTO factenrollment ( student_sk, class_sk, date_sk, professor_sk, course_grade) SELECT student_sk, class_sk, datekey, professor_sk, final_grade FROM ((((regnOLTP.dbo.registration INNER JOIN dimstudent ON registration.stud_id = dimstudent.student_id) INNER JOIN dimclass ON regnOLTP.dbo.registration.callno = dimclass.crn) INNER JOIN dimtime ON CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey) INNER JOIN regnOLTP.dbo.section ON dimclass.crn = regnOLTP.dbo.section.callno) LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id = dimprofessor.professor_id ; Go

40 Entire Transform/Load "Package" w_tables.zip 40


Download ppt "Extract, Transform, Load 1. Agenda  Review  Analysis (Bus Matrix, Info Package)  Logical Design(Dimensional Modeling)  Physical Design(Spreadsheet)"

Similar presentations


Ads by Google