Download presentation
Presentation is loading. Please wait.
Published byOmar Codner Modified over 9 years ago
1
Kamiya Chaudhary Daniel Green
2
Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming errors and gibberish input Impact - unnecessary increase in storage data, can result in wrong analysis as incorrect data will be evaluated
3
Ideally the data used for the analysis should be relevant data to get accurate results.However, in real world it it practically impossible to have such system with no noisy data. Therefore, certain techniques are proposed to remove the noisy data as much as possible to improve the quality of analysis. Goal is to remove the data that hinder the analysis
4
Distance Based Density Based Clustering Based HCleaner
5
Hcleaner : HyperClique-Based Data Cleaner Eliminates data objects that are not tightly connected to other data objects in the data set. Hyperclique patterns generated instead of Frequent patterns. Hyperclique patterns contains items that are strongly correlated with each other.
6
Hyperclique patternsSupportH-confidence {earings,gold ring,bracelet}0.019%45.8% {nokia battery,nokia adapter, nokia wireless phone} 0.049%52.8% {coffee maker, can opener, toaster}0.014%61.5% {baby bumper pad, diaper stacker, baby crib sheet} 0.028%72.7% {skirt tub, 3pc bath set, shower curtain}0.26%74.4% {jar cookie, container 3pc, box bread, soup tureen, goblets 8ps} 0.012%77.8%
7
Generates Hyperclique pattern Eliminates objects not part of hyperclique pattern based on h-confidence H-confidence is the measure that reflects the overall affinity among items within the pattern The higher the h-confidence threshold is the more the object are closely related Includes size 3 hyperclique patterns because there can be strong co-relation of a non- relevant object with another irrelevant object.
8
Therefore, an object appearing in size-3 pattern means that there are at least 2 objects which have a guaranteed pairwise similarity with an object Computational cost is from generating hyperclique patterns for size-3 The data to be labeled as noise is not input to algorithm but a result of it.
9
Noise reduction is absolute step for data cleansing Data cleansing contributes in getting accurate analysis results As proposed by the paper,Hcleaner does not need to go beyond hyperclique patterns and no combinatorial growth of pattern space is required which makes it efficient and scalable algorithm.
10
http://www- users.cs.umn.edu/~kumar/papers/noise_removal_tkde.pdf http://www- users.cs.umn.edu/~kumar/papers/noise_removal_tkde.pdf http://www.mimuw.edu.pl/~son/datamining/DM/4- preprocess.pdf http://www.mimuw.edu.pl/~son/datamining/DM/4- preprocess.pdf http://books.google.com/books?id=gdkY4QHy0XIC&pg=PA8 87&lpg=PA887&dq=noise+reduction+data+warehouse&sour ce=bl&ots=280EhMs0dB&sig=1TKs37HYn9LFlq- qPCoHvAHGTKY&hl=en&sa=X&ei=lstaVPidJ8f6oQT8goLYCw& ved=0CFYQ6AEwBg#v=onepage&q=noise%20reduction%20da ta%20warehouse&f=false
11
Kamiya Chaudhary Daniel Green
12
Problem statement Proposed solution
13
Dormant accounts are scattered by banks. It would be beneficial if the data was centralized.
14
To build a centralized system for all the dormant bank accounts which could be used by various other systems.
15
One centralized location for all information Easier to search Answer various questions and may also lead to new question about where dormant accounts exists.
16
Collect data from various systems Data cleansing Generate Snow-flake schema
17
Daniel Green Kamiya Chaudhary
18
Motivation Background Problem statement Proposed solution Benefits Schedule
19
Make Customers aware of unclaimed assets Help business identify and prevent dormant accounts
20
When a customer setup a bank account, there is a potential for a customer to abandon the account. Accounts with positive balance will become dormant after a certain period of time. Assets inside of a dormant accounts cannot legally be utilized by the business.
21
To build an application to help identify where accounts have gone dormant.
22
Understand basic facts about dormant accounts Identify the areas where dormant accounts exists according to various factors like geographical, bank, bank branch, amount. Provide better answers to the questions for the future
23
WeekProgress 1 and half weekData warehouse consolidation including data cleansing 2 weekDeveloping Data mining system and front end Half weektesting
24
https://www.ki.informatik.hu- berlin.de/mac/lehre/lehrmaterial/Informationsinte gration/Rahm00.pdf https://www.ki.informatik.hu- berlin.de/mac/lehre/lehrmaterial/Informationsinte gration/Rahm00.pdf http://www.edmontonjournal.com/news/unclaime d-bank-accounts/index.html http://www.edmontonjournal.com/news/unclaime d-bank-accounts/index.html http://www.helpwithmybank.gov/get- answers/bank-accounts/inactive-accounts/bank- accounts-inactive-accounts-quesindx.html http://www.helpwithmybank.gov/get- answers/bank-accounts/inactive-accounts/bank- accounts-inactive-accounts-quesindx.html http://www.helpwithmybank.gov/get- answers/bank-accounts/inactive-accounts/faq- bank-accounts-inactive-accounts-01.html
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.