Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head.

Slides:



Advertisements
Similar presentations
De-Duplication A not so simple problem Covers Appendix Part 5.
Advertisements

ICDL Software Applications - Database Concepts. Unit 6 Data and Data Representation Database Concepts –File Structure –Relationships Database Design –Data.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Lecture-19 ETL Detail: Data Cleansing
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
1 Database Design and Development: A Visual Approach © 2006 Prentice Hall Chapter 2 Relational Theory DATABASE DESIGN AND DEVELOPMENT: A VISUAL APPROACH.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-5 Types & Typical Applications of DWH Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Managing Data Resources
Managing data Resources: An information system provides users with timely, accurate, and relevant information. The information is stored in computer files.
Multidimensional Database Structure
1 Haiguang Li 01. Dec Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation.
The Relational Database Model:
Lecture-33 DWH Implementation: Goal Driven Approach (1)
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
DWH-Ahsan Abdullah 1 Data Warehousing Lab Lect-2 Lab Data Set Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
The Relational Database Model
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2012 (September 5, 2012)
Database Design - Lecture 1
INTRODUCTION TO PROGRAMMING STRUCTURE Chapter 4 1.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Using Data Hygiene. Real World Issues Focus moves from strategic concepts in identifying individuals to operational concepts. Representative issues Multiple.
Detailed design – class design Domain Modeling SE-2030 Dr. Rob Hasker 1 Based on slides written by Dr. Mark L. Hornick Used with permission.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
Ahsan Abdullah 1 Data Warehousing Lecture-7De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Information Systems Today (©2006 Prentice Hall) 3-1 CS3754 Class Note 12 Summery of Relational Database.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-4 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. &
Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Lecture 12 Data Duplication Elimination & BSN Method by Adeel Ahmed Faculty of Computer Science 1.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Data Warehousing 1 Lecture-28 Need for Speed: Join Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 4 Data and Databases. Learning Objectives Upon successful completion of this chapter, you will be able to: Describe the differences between data,
Ahsan Abdullah 1 Data Warehousing Lecture-10 Online Analytical Processing (OLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Query and Reasoning. Types of Queries Most GIS queries will select spatial features Query by Attribute (Select by Attribute) –Structured Query Language.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing Lecture-30 What can Data Mining do? Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-29 Brief Intro. to Data Mining Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University Class Presentation by Rhonda Kost, 06.April.
Lesson 2: Designing a Database and Creating Tables.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Ahsan Abdullah 1 Data Warehousing Lecture-6Normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Modeling Security-Relevant Data Semantics Xue Ying Chen Department of Computer Science.
CHAPTER 2 : RELATIONAL DATA MODEL Prepared by : nbs.
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-21 Introduction to Data Quality Management (DQM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University Class Presentation by Jeff Maynard.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
IT 5433 LM4 Physical Design. Learning Objectives: Describe the physical database design process Explain how attributes transpose from the logical to physical.
COP Introduction to Database Structures
Lecture-3 Introduction and Background
Overview of MDM Site Hub
Lecture-32 DWH Lifecycle: Methodologies
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Real-World Data Is Dirty
Lecture-35 DWH Implementation: Pitfalls, Mistakes, Keys
DATABASE TECHNOLOGIES
INSTRUCTOR: MRS T.G. ZHOU
Presentation transcript:

Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research National University of Computers & Emerging Sciences, Islamabad

Ahsan Abdullah 2 Why data duplicated? A data warehouse is created from heterogeneous sources, with heterogeneous databases (different schema/representation) of the same entity. The data coming from outside the organization owning the DWH, can have even lower quality data i.e. different representation for same entity, transcription or typographical errors.

Ahsan Abdullah 3 Problems due to data duplication Data duplication, can result in costly errors, such as:  False frequency distributions.  Incorrect aggregates due to double counting.  Difficulty with catching fabricated identities by credit card companies.

Ahsan Abdullah 4 Unable to determine customer relationships (CRM) Unable to analyze employee benefits trends NamePhone NumberCust. No. M. Ismail Siddiqi M. Ismail Siddiqi M. Ismail Siddiqi Bonus DateNameDepartmentEmp. No. Jan. 2000Khan Muhammad213 (MKT) Dec. 2001Khan Muhammad567 (SLS) Mar. 2002Khan Muhammad349 (HR) Duplicate Identification Numbers Multiple Customer Numbers Multiple Employee Numbers Data Duplication: Non-Unique PK

Ahsan Abdullah 5 Data Duplication: House Holding  Group together all records that belong to the same household. Why bother ? ……… S. Ahad 440, Munir Road, Lahore ……… ………….………………………………… ……… Shiekh Ahad No. 440, Munir Rd, Lhr ……… Shiekh Ahed House # 440, Munir Road, Lahore ……… ………….…………………………………

Ahsan Abdullah 6  Identify multiple records in each household which represent the same individual Address field is standardized. By coincidence ?? ……… M. Ahad 440, Munir Road, Lahore ……… ………….………………………………… ……… Maj Ahad 440, Munir Road, Lahore Data Duplication: Individualization

Ahsan Abdullah 7 Formal definition & Nomenclature  Problem statement:  “Given two databases, identify the potentially matched records Efficiently and Effectively”  Many names, such as:  Record linkage  Merge/purge  Entity reconciliation  List washing and data cleansing.  Current market and tools heavily centered towards customer lists.

Ahsan Abdullah 8 Need & Tool Support  Logical solution to dirty data is to clean it in some way.  Doing it manually is very slow and prone to errors.  Tools are required to do it “cost” effectively to achieve reasonable quality.  Tools are there, some for specific fields, others for specific cleaning phase.  Since application specific, so work very well, but need support from other tools for broad spectrum of cleaning problems.

Ahsan Abdullah 9 Overview of the Basic Concept  In its simplest form, there is an identifying attribute (or combination) per record for identification.  Records can be from single source or multiple sources sharing same PK or other common unique attributes.  Sorting performed on identifying attributes and neighboring records checked.  What if no common attributes or dirty data?  The degree of similarity measured numerically, different attributes may contribute differently.

Ahsan Abdullah 10 Basic Sorted Neighborhood (BSN) Method  Concatenate data into one sequential list of N records  Steps 1: Create Keys  Compute a key for each record in the list by extracting relevant fields or portions of fields  Effectiveness of the this method highly depends on a properly chosen key  Step 2: Sort Data  Sort the records in the data list using the key of step 1  Step 3: Merge  Move a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window  If the size of the window is w records then every new record entering the window is compared with the previous w-1 records.

Ahsan Abdullah 11 BSN Method : Sliding Window Current window of records w Next window of records w

Ahsan Abdullah 12 BSN Method: Selection of Keys  Selection of Keys  Effectiveness highly dependent on the key selected to sort the records middle name vs. family name,  A key is a sequence of a subset of attributes or sub-strings within the attributes chosen from the record.  The keys are used for sorting the entire dataset with the intention that matched candidates will appear close to each other. FirstMiddleAddressNIDKey MuhammedAhmad440 Munir Road AHM440MUN345 MuhammadAhmad440 Munir Road AHM440MUN345 MuhammedAhmed440 Munir Road AHM440MUN345 MuhammadAhmar440 Munawar Road AHM440MUN345

Ahsan Abdullah 13 BSN Method: Problem with keys  Since data is dirty, so keys WILL also be dirty, and matching records will not come together.  Data becomes dirty due to data entry errors or use of abbreviations. Some real examples are as follows:  Solution is to use external standard source files to validate the data and resolve any data conflicts. Technology Tech. Techno. Tchnlgy

Ahsan Abdullah 14 BSN Method: Problem with keys (e.g.) NoNameAddressGender 1Syed N Jaffri Chaklala No Rawalpindi StreetM 2Syed Noman420 4 Rwp SchemeM 3Saiam Noor5 Afshan Colony Flat Lahore Road SaidpurF NoNameAddressGender 1N. Jaffri, SyedNo. 420, Street 15, Chaklala 4, RawalpindiM 2S. Noman420, Scheme 4, RwpM 3Saiam NoorFlat 5, Afshan Colony, Saidpur Road, LahoreF If contents of fields are not properly ordered, similar records will NOT fall in the same window. Example: Records 1 and 2 are similar but will occur far apart. Solution is to TOKENize the fields i.e. break them further. Use the tokens in different fields for sorting to fix the error. Example: Either using the name or the address field records 1 and 2 will fall close.

Ahsan Abdullah 15 BSN Method: Matching Candidates Merging of records is a complex inferential process. Example-1: NomaNoman Example-1: Two persons with names spelled nearly but not identically, have the exact same address. We infer they are same person i.e. Noma Abdullah and Noman Abdullah. Example-2: Example-2: Two persons have same National ID numbers but names and addresses are completely different. We infer same person who changed his name and moved or the records represent different persons and NID is incorrect for one of them. Use of further information such as age, gender etc. can alter the decision. Example-3:NomaNoman Noma Noman Example-3: Noma-F and Noman-M we could perhaps infer that Noma and Noman are siblings i.e. brothers and sisters. Noma-30 and Noman-5 i.e. mother and son.

Ahsan Abdullah 16  Time Complexity: O(n log n)  O (n) for Key Creation  O (n log n) for Sorting  O (w n) for matching, where w  2  n  Constants vary a lot  At least three passes required on the dataset.  Complexity or rule and window size detrimental.  For large sets disk I/O is detrimental. Complexity Analysis of BSN Method

Ahsan Abdullah 17 BSN Method: Equational Theory To specify the inferences we need equational Theory.  Logic is NOT based on string equivalence.  Logic based on domain equivalence.  Requires declarative rule language.