Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kamiya Chaudhary Daniel Green. Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming.

Similar presentations


Presentation on theme: "Kamiya Chaudhary Daniel Green. Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming."— Presentation transcript:

1 Kamiya Chaudhary Daniel Green

2 Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming errors and gibberish input Impact - unnecessary increase in storage data, can result in wrong analysis as incorrect data will be evaluated

3 Ideally the data used for the analysis should be relevant data to get accurate results.However, in real world it it practically impossible to have such system with no noisy data. Therefore, certain techniques are proposed to remove the noisy data as much as possible to improve the quality of analysis. Goal is to remove the data that hinder the analysis

4  Distance Based  Density Based  Clustering Based  HCleaner

5 Hcleaner : HyperClique-Based Data Cleaner Eliminates data objects that are not tightly connected to other data objects in the data set. Hyperclique patterns generated instead of Frequent patterns. Hyperclique patterns contains items that are strongly correlated with each other.

6 Hyperclique patternsSupportH-confidence {earings,gold ring,bracelet}0.019%45.8% {nokia battery,nokia adapter, nokia wireless phone} 0.049%52.8% {coffee maker, can opener, toaster}0.014%61.5% {baby bumper pad, diaper stacker, baby crib sheet} 0.028%72.7% {skirt tub, 3pc bath set, shower curtain}0.26%74.4% {jar cookie, container 3pc, box bread, soup tureen, goblets 8ps} 0.012%77.8%

7  Generates Hyperclique pattern  Eliminates objects not part of hyperclique pattern based on h-confidence  H-confidence is the measure that reflects the overall affinity among items within the pattern  The higher the h-confidence threshold is the more the object are closely related  Includes size 3 hyperclique patterns because there can be strong co-relation of a non- relevant object with another irrelevant object.

8  Therefore, an object appearing in size-3 pattern means that there are at least 2 objects which have a guaranteed pairwise similarity with an object  Computational cost is from generating hyperclique patterns for size-3  The data to be labeled as noise is not input to algorithm but a result of it.

9  Noise reduction is absolute step for data cleansing  Data cleansing contributes in getting accurate analysis results  As proposed by the paper,Hcleaner does not need to go beyond hyperclique patterns and no combinatorial growth of pattern space is required which makes it efficient and scalable algorithm.

10  users.cs.umn.edu/~kumar/papers/noise_removal_tkde.pdf users.cs.umn.edu/~kumar/papers/noise_removal_tkde.pdf  preprocess.pdf preprocess.pdf  87&lpg=PA887&dq=noise+reduction+data+warehouse&sour ce=bl&ots=280EhMs0dB&sig=1TKs37HYn9LFlq- qPCoHvAHGTKY&hl=en&sa=X&ei=lstaVPidJ8f6oQT8goLYCw& ved=0CFYQ6AEwBg#v=onepage&q=noise%20reduction%20da ta%20warehouse&f=false

11 Kamiya Chaudhary Daniel Green

12  Problem statement  Proposed solution

13 Dormant accounts are scattered by banks. It would be beneficial if the data was centralized.

14 To build a centralized system for all the dormant bank accounts which could be used by various other systems.

15  One centralized location for all information  Easier to search  Answer various questions and may also lead to new question about where dormant accounts exists.

16  Collect data from various systems  Data cleansing  Generate Snow-flake schema

17 Daniel Green Kamiya Chaudhary

18  Motivation  Background  Problem statement  Proposed solution  Benefits  Schedule

19  Make Customers aware of unclaimed assets  Help business identify and prevent dormant accounts

20 When a customer setup a bank account, there is a potential for a customer to abandon the account. Accounts with positive balance will become dormant after a certain period of time. Assets inside of a dormant accounts cannot legally be utilized by the business.

21 To build an application to help identify where accounts have gone dormant.

22  Understand basic facts about dormant accounts  Identify the areas where dormant accounts exists according to various factors like geographical, bank, bank branch, amount.  Provide better answers to the questions for the future

23 WeekProgress 1 and half weekData warehouse consolidation including data cleansing 2 weekDeveloping Data mining system and front end Half weektesting

24  https://www.ki.informatik.hu- berlin.de/mac/lehre/lehrmaterial/Informationsinte gration/Rahm00.pdf https://www.ki.informatik.hu- berlin.de/mac/lehre/lehrmaterial/Informationsinte gration/Rahm00.pdf  d-bank-accounts/index.html d-bank-accounts/index.html  answers/bank-accounts/inactive-accounts/bank- accounts-inactive-accounts-quesindx.html answers/bank-accounts/inactive-accounts/bank- accounts-inactive-accounts-quesindx.html  answers/bank-accounts/inactive-accounts/faq- bank-accounts-inactive-accounts-01.html

25


Download ppt "Kamiya Chaudhary Daniel Green. Noise – Irrelevant, inconsistent, duplicate and missing data Introduced in the system - hardware failures, programming."

Similar presentations


Ads by Google