Presentation is loading. Please wait.

Presentation is loading. Please wait.

De-Duplication A not so simple problem Covers Appendix Part 5.

Similar presentations


Presentation on theme: "De-Duplication A not so simple problem Covers Appendix Part 5."— Presentation transcript:

1 De-Duplication A not so simple problem Covers Appendix Part 5

2 False? False positives occur when a group of duplicates are identified that do NOT represent the same customer False negatives occur when actual redundant representations of the same customer are NOT identified

3 Customer Name – only personal names Postal Address – only United States address formats Tax ID – Could be personal National Insurance Number or another unique identifier

4 Identical Would you argue that these are NOT duplicate customers?

5 Exact???

6 Abbreviation The abbreviation of first and middle names is a common challenge: Does a matching Tax ID guarantee that a variation is a duplicate? What about when Tax ID is missing?

7 Marriage Marriages can be good for people but possibly bad for their data: Did the hyphenated last name on Key 252 help overcome the change of address and missing Tax ID? How do you know if Keys 261 and/or 262 are truly the same customer as Key 263?

8 False Positives For Keys 312 & 313, do you think the matching Tax ID and similar name indicate possible duplication of Key 311 despite the different postal address? For Keys 322 & 323, do you think the exact same postal address and similar name indicate possible duplication of Key 321 despite the missing Tax IDs

9 Same Address A common challenge is the same family name and the exact same postal address

10 What goes in Report Appendix? Discuss deduplication – What is your business strategy Show via a flow chart how you would attempt deduplication

11 Mailing List Management Functional Requirements Set out what the new system will do. You have some experience with this from CS22120 Group Project. An attempt to describe, logically, the functionality of the system. You need to describe it NOT build it.

12 Requirements Functional – What is it supposed to do Non-Functional requirements – Computer Environment – Personnel – Web based

13 Some functions Set up required fields Add, Modify and Delete Fields Import initial list – Field matching – Excel, CSV programs Add, Modify and Delete Records Merge records from externally purchased files

14 Mailing List Functionality cont’d Cleanse using Post Office Address File (PAF) – Contains all address in UK – Use to correct address from post code – Can add correct: Street name Posttown County

15 Sorting Sort by – Post code – Geographic Areas – Job Title – SIC codes – Turnover (Ascending/Descending/Random) – And combinations of above

16 Mailing List Functionality cont’d Select Number of records to deliver and maybe by – Post code – Job Title – SIC codes – Turnover (Ascending/Descending/Random) – Add false “ghosts” – File formats

17 Product? Must be able distribute software – How? – Web or local OS – Hardware Platform

18 Competition Mailing Houses – Data discs – Web – Mailing list management services Software Companies – Dedupe software – Mailing List Management Software CHECK THESE OUT FOR THE REPORT


Download ppt "De-Duplication A not so simple problem Covers Appendix Part 5."

Similar presentations


Ads by Google