Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Lecture-19 ETL Detail: Data Cleansing
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Prentice Hall, Database Systems Week 1 Introduction By Zekrullah Popal.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-5 Types & Typical Applications of DWH Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Managing Data Resources
Lecture Microsoft Access and Relational Database Basics.
From Class Diagrams to Databases. So far we have considered “objects” Objects have attributes Objects have operations Attributes are the things you record.
1 1 File Systems and Databases Chapter 1 The Worlds of Database Systems Prof. Sin-Min Lee Dept. of Computer Science.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
DWH-Ahsan Abdullah 1 Data Warehousing Lab Lect-1 DTS: Introduction Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Lecture-33 DWH Implementation: Goal Driven Approach (1)
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
SESSION 7 MANAGING DATA DATARESOURCES. File Organization Terms and Concepts Field: Group of words or a complete number Record: Group of related fields.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
1 1 File Systems and Databases Chapter 1 Prof. Sin-Min Lee Dept. of Computer Science.
Lecture-1 Introduction and Background
DWH-Ahsan Abdullah 1 Data Warehousing Lab Lect-2 Lab Data Set Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
MIS2502: Data Analytics Extract, Transform, Load
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
Sayed Ahmed Logical Design of a Data Warehouse.  Free Training and Educational Services  Training and Education in Bangla: Training and Education in.
Managing Data Resources
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
CHAPTER 8: MANAGING DATA RESOURCES. File Organization Terms Field: group of characters that represent something Record: group of related fields File:
Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION GLOBAL EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE ENHANCING DECISION MAKING Lecture.
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
© 2007 by Prentice Hall 1 Introduction to databases.
Ahsan Abdullah 1 Data Warehousing Lecture-7De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Databases Topic 4 Text Materials Chapter 3 – Databases and Data Warehouses.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-4 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. &
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
Data Warehousing 1 Lecture-28 Need for Speed: Join Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-14 Process of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-2 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Ahsan Abdullah 1 Data Warehousing Lecture-10 Online Analytical Processing (OLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing Lecture-30 What can Data Mining do? Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-29 Brief Intro. to Data Mining Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
7 Strategies for Extracting, Transforming, and Loading.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
3/6: Data Management, pt. 2 Refresh your memory Relational Data Model
44220: Database Design & Implementation Introduction to Module Ian Perry Room: C49 Ext.: 7287
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
Big Data Quality Panel Norman Paton University of Manchester.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-21 Introduction to Data Quality Management (DQM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof.
Managing Data Resources File Organization and databases for business information systems.
Lecture-3 Introduction and Background
Lecture-32 DWH Lifecycle: Methodologies
MANAGING DATA RESOURCES
Lecture-38 Case Study: Agri-Data Warehouse
Lecture-35 DWH Implementation: Pitfalls, Mistakes, Keys
Data Warehousing Concepts
Presentation transcript:

Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research National University of Computers & Emerging Sciences, Islamabad

Ahsan Abdullah 2 Issues of ETL

Ahsan Abdullah 3 Why ETL Issues? Data from different source systems will be different, poorly documented and dirty. Lot of analysis required. Easy to collate addresses and names? Not really. No address or name standards. Use software for standardization. Very expensive, as any “standards” vary from country to country, not large enough market.

Ahsan Abdullah 4 Why ETL Issues? Things would have been simpler in the presence of operational systems, but that is not always the case Manual data collection and entry. Nothing wrong with that, but potential to introduces lots of problems. Data is never perfect. The cost of perfection, extremely high vs. its value.

Ahsan Abdullah 5 “Some” Issues  Usually, if not always underestimated  Diversity in source systems and platforms  Inconsistent data representations  Complexity of transformations  Rigidity and unavailability of legacy systems  Volume of legacy data  Web scrapping

Ahsan Abdullah 6 Complexity of problem/work underestimated  Work seems to be deceptively simple.  People start manually building the DWH.  Programmers underestimate the task.  Impressions could be deceiving.  Traditional DBMS rules and concepts break down for very large heterogeneous historical databases.

Ahsan Abdullah 7 Diversity in source systems and platforms PlatformOSDBMSMIS/ERP Main FrameVMSOracleSAP Mini ComputerUnixInformixPeopleSoft DesktopWin NTAccessJD Edwards DOSText file Dozens of source systems across organizations Numerous source systems within an organization Need specialist for each

Ahsan Abdullah 8 Same data, different representation Date value representations Examples: /14/ MAR-1997 March (Julian date format) Gender value representations Examples: - Male/Female- M/F - 0/1- PM/PF Inconsistent data representations

Ahsan Abdullah 9 Need to rank source systems on a per data element basis. Take data element from source system with highest rank where element exists. “Guessing” gender from name Something is better than nothing? Must sometimes establish “group ranking” rules to maintain data integrity. First, middle and family name from two systems of different rank. People using middle name as first name. Multiple sources for same data element

Ahsan Abdullah 10 Simple one-to-one scalar transformations - 0/1 → M/F One-to-many element transformations - 4 x 20 address field → House/Flat, Road/Street, Area/Sector, City. Many-to-many element transformations - House-holding (who live together) and individualization (who are same) and same lands. Complexity of required transformations

Ahsan Abdullah 11 Rigidity and unavailability of legacy systems  Very difficult to add logic to or increase performance of legacy systems.  Utilization of expensive legacy systems is optimized.  Therefore, want to off-load transformation cycles to open systems environment.  This often requires new skill sets.  Need efficient and easy way to deal with incompatible mainframe data formats.

Ahsan Abdullah 12 Volume of legacy data  Talking about not weekly data, but data spread over years.  Historical data on tapes that are serial and very slow to mount etc.  Need lots of processing and I/O to effectively handle large data volumes.  Need efficient interconnect bandwidth to transfer large amounts of data from legacy sources to DWH.

Ahsan Abdullah 13  Lot of data in a web page, but is mixed with a lot of “junk”.  Problems:  Limited query interfaces  Fill in forms  “Free text” fields  E.g. addresses  Inconsistent output  i.e., html tags which mark interesting fields might be different on different pages.  Rapid change without notice. Web scrapping

Ahsan Abdullah 14 Beware of data quality (or lack of it)  Data quality is always worse than expected.  Will have a couple of lectures on data quality and its management.  It is not a matter of few hundred rows.  Data recorded for running operations is not usually good enough for decision support.  Correct totals don’t guarantee data quality.  Not knowing gender does not hurt POS.  Centurion customers popping up.

Ahsan Abdullah 15 ETL vs. ELT There are two fundamental approaches to data acquisition: ETL: Extract, Transform, Load in which data transformation takes place on a separate transformation server. ELT: Extract, Load, Transform in which data transformation takes place on the data warehouse server. Combination of both is also possible