Presentation on theme: "Kamiya Chaudhary Daniel Green. Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming."— Presentation transcript:
Kamiya Chaudhary Daniel Green
Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming errors and gibberish input Impact - unnecessary increase in storage data, can result in wrong analysis as incorrect data will be evaluated
Ideally the data used for the analysis should be relevant data to get accurate results.However, in real world it it practically impossible to have such system with no noisy data. Therefore, certain techniques are proposed to remove the noisy data as much as possible to improve the quality of analysis. Goal is to remove the data that hinder the analysis
Distance Based Density Based Clustering Based HCleaner
Hcleaner : HyperClique-Based Data Cleaner Eliminates data objects that are not tightly connected to other data objects in the data set. Hyperclique patterns generated instead of Frequent patterns. Hyperclique patterns contains items that are strongly correlated with each other.
Generates Hyperclique pattern Eliminates objects not part of hyperclique pattern based on h-confidence H-confidence is the measure that reflects the overall affinity among items within the pattern The higher the h-confidence threshold is the more the object are closely related Includes size 3 hyperclique patterns because there can be strong co-relation of a non- relevant object with another irrelevant object.
Therefore, an object appearing in size-3 pattern means that there are at least 2 objects which have a guaranteed pairwise similarity with an object Computational cost is from generating hyperclique patterns for size-3 The data to be labeled as noise is not input to algorithm but a result of it.
Noise reduction is absolute step for data cleansing Data cleansing contributes in getting accurate analysis results As proposed by the paper,Hcleaner does not need to go beyond hyperclique patterns and no combinatorial growth of pattern space is required which makes it efficient and scalable algorithm.
Make Customers aware of unclaimed assets Help business identify and prevent dormant accounts
When a customer setup a bank account, there is a potential for a customer to abandon the account. Accounts with positive balance will become dormant after a certain period of time. Assets inside of a dormant accounts cannot legally be utilized by the business.
To build an application to help identify where accounts have gone dormant.
Understand basic facts about dormant accounts Identify the areas where dormant accounts exists according to various factors like geographical, bank, bank branch, amount. Provide better answers to the questions for the future
WeekProgress 1 and half weekData warehouse consolidation including data cleansing 2 weekDeveloping Data mining system and front end Half weektesting