Presentation on theme: "De-Duplication A not so simple problem Covers Appendix Part 5."— Presentation transcript:
De-Duplication A not so simple problem Covers Appendix Part 5
False? False positives occur when a group of duplicates are identified that do NOT represent the same customer False negatives occur when actual redundant representations of the same customer are NOT identified
Customer Name – only personal names Postal Address – only United States address formats Tax ID – Could be personal National Insurance Number or another unique identifier
Identical Would you argue that these are NOT duplicate customers?
Abbreviation The abbreviation of first and middle names is a common challenge: Does a matching Tax ID guarantee that a variation is a duplicate? What about when Tax ID is missing?
Marriage Marriages can be good for people but possibly bad for their data: Did the hyphenated last name on Key 252 help overcome the change of address and missing Tax ID? How do you know if Keys 261 and/or 262 are truly the same customer as Key 263?
False Positives For Keys 312 & 313, do you think the matching Tax ID and similar name indicate possible duplication of Key 311 despite the different postal address? For Keys 322 & 323, do you think the exact same postal address and similar name indicate possible duplication of Key 321 despite the missing Tax IDs
Same Address A common challenge is the same family name and the exact same postal address
What goes in Report Appendix? Discuss deduplication – What is your business strategy Show via a flow chart how you would attempt deduplication
Mailing List Management Functional Requirements Set out what the new system will do. You have some experience with this from CS22120 Group Project. An attempt to describe, logically, the functionality of the system. You need to describe it NOT build it.
Requirements Functional – What is it supposed to do Non-Functional requirements – Computer Environment – Personnel – Web based
Some functions Set up required fields Add, Modify and Delete Fields Import initial list – Field matching – Excel, CSV programs Add, Modify and Delete Records Merge records from externally purchased files
Mailing List Functionality cont’d Cleanse using Post Office Address File (PAF) – Contains all address in UK – Use to correct address from post code – Can add correct: Street name Posttown County
Sorting Sort by – Post code – Geographic Areas – Job Title – SIC codes – Turnover (Ascending/Descending/Random) – And combinations of above
Mailing List Functionality cont’d Select Number of records to deliver and maybe by – Post code – Job Title – SIC codes – Turnover (Ascending/Descending/Random) – Add false “ghosts” – File formats
Product? Must be able distribute software – How? – Web or local OS – Hardware Platform
Competition Mailing Houses – Data discs – Web – Mailing list management services Software Companies – Dedupe software – Mailing List Management Software CHECK THESE OUT FOR THE REPORT