Download presentation

Presentation is loading. Please wait.

Published byRoy Gravlin Modified over 4 years ago

1
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 1 Presented By: Rama kanta Behera IT200127207 Under the guidance of : Miss Ipsita Mishra

2
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 2 INTRODUCTION A “records set” is a list of prior distinct records. A new record is to be verified for a duplicate against the records set A database is a collection of related data. Various Algorithms like Matching learning algo, Learnable string similarity measures Adaptive Algo

3
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 3 OBJECTIVES Reduced cost of duplicate record detection. Perfect scalability of one such detection procedure. Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search Keep the algorithm adaptive.

4
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 4 PREVALENT METHODS The Brute Force Method This method consumes complexity of the order number of records in the records set and requires all prior records to be stored. Method by Rail et. al The comparison of a new record against the records set is reduced from being full text match to comparing two integers

5
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 5 OUTLINE OF THE PROPOSED SOLUTION The central idea behind the present algorithm is based on the fundamental property of primality of numbers I f(x) Record set Integer number space Fig: hashing IP Record set Integer numberPrime number f(x) g(x) Fig: Extended hashing into prime space

6
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 6 r1 r2 … rn I1 I2 … In P1 P2 … Pn PRODUCT( P prior) f(x) g(x) P1*p2…*pn= P prior Fig: The complete algorithm

7
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 7 REALIZATION OF THE ALGORITHM Two functions f(x) and g(x) are to be realized for the implementation of the algorithm. Realizing f(x) Realizing g(x)

8
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 8 STEPS OF THE ALGORITHM Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained. Step 2 : Hnew is mapped to its corresponding unique prime (Pnew). Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record. Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive.

9
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 9 Fig: Flowchart

10
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 10 IMPLEMENTATIONS There are three important implementation details that need to be discussed Size of Records set Use of Logarithms Subsets of Records set

11
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 11 CONCLUSION A new approach to handle duplicate records is presented This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of “duplicate record detection”.

12
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 12 THANK YOU !!!

13
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 13

Similar presentations

OK

Distributed DBMS©M. T. Özsu & P. Valduriez Ch.15/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.

Distributed DBMS©M. T. Özsu & P. Valduriez Ch.15/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google

Ppt on water scarcity in china Ppt on barack obama leadership qualities Download ppt on oxidation and reduction reaction Ppt on natural resources class 8 Ppt on sexually transmitted diseases Ppt on power diode rectifier Ppt on obesity diet foods Ppt on acute renal failure in pediatrics Ppt on maths quiz for class 9 Ppt on prepaid energy meter