Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes.

Similar presentations


Presentation on theme: "Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes."— Presentation transcript:

1 Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes

2 Motivation Fast identification of approximate matches in large sets of records. Applications: Medical databases Customer records National security

3 Outline Records and queries Search for matches Experimental results

4 Table of records We specify a table of records by a list of attributes. Example We can describe patients in a hospital by their sex, age, and diagnosis.

5 Records and queries A record includes a specific value for each attribute. A query may include lists of values and numeric ranges. Query Sex: male, female Age: 20..40 Dx: asthma, flu Example Record Sex: female Age: 30 Dx: asthma

6 Query types A point query includes a specific value for each attribute. A region query includes lists of values or numeric ranges. Region query Sex: male, female Age: 20..40 Dx: asthma, flu Example Point query Sex: female Age: 30 Dx: asthma

7 Exact matches A record is an exact match for a query if every value in the record belongs to the respective range in the query. Record Age Sex Dx Query

8 Approximate matches A record is an approximate match for a query if it is “close” to the query region. Record Age Sex Dx Query

9 Approximate queries An approximate query includes: Point or region Distance function Number of matches Distance limit

10 Outline Records and queries Search for matches Experimental results

11 Indexing structure diagnosis male, 30, asthma female, 30, asthma male, 40, flu female, 50, flu female, 30, ulcer female, 30, fracture diagnosis age sex malefemale 30 40 50 30 asthma flu fracture ulcer asthma flu Maintain a PATRICIA tree of records Group nodes into fixed-size disk blocks

12 Search for matches diagnosis male, 30, asthma female, 30, asthma male, 40, flu female, 50, flu female, 30, ulcer female, 30, fracture diagnosis age sex malefemale 30 40 50 30 asthma flu fracture ulcer asthma flu Depth-first search for exact matches Best-first search for approximate matches

13 Outline Records and queries Search for matches Experimental results

14 Performance : Twenty-one attributes 1.6 million records Experiments with a database of all patients admitted to Massachusetts hospitals from October 2000 to September 2002 Use of a Pentium computer: 2.4 GHz CPU 1 Gbyte memory 400 MHz bus

15 Variables Control variables: Number of records Memory size Query type Measurements: Retrieval time

16 Small memory Number of records: 100 to 1,672,016 Memory size: 4 MByte Retrieval Time (msec) 100 10 1 10 2 10 3 10 4 10 5 10 6 Number of Records Range queries Approximate queries Exact queries Available memory lg n n 0.15 lg n n 0.5

17 Large memory Number of records: 1,672,016 Memory size: 64 to 1,024 MByte Range queries Approximate queries Exact queries Retrieval Time (msec) 100 10 1 64 128 256 5121,024 Memory Size (MBytes) 1,000 10,000

18 Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases

19 Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases: Number of records (n) n 0.5 time (seconds) 1,000,000 100,000,000 10,000,000,000 1,000,000,000,000 0.05. 0.50. 5.00. 50.00.


Download ppt "Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes."

Similar presentations


Ads by Google