Presentation is loading. Please wait.

Presentation is loading. Please wait.

Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman.

Similar presentations


Presentation on theme: "Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman."— Presentation transcript:

1 Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman

2 Motivation Search for interesting patterns in large data sets

3 Motivation Search for interesting patterns in large data sets Current applications Processing of intelligence data Prediction of “natural” threats Future applications Scientific discoveries Analysis of business data … and more …

4 Outline Main results of the ARGUS project - Approximate matching - Streaming data - Novelty detection More about approximate matching - Records and queries - Search for matches - Experimental results

5 Large data sets Large: From a million (10 6 ) to several billion (10 10 ) records Data: Structured records with numbers, strings, and nominal values Sets: Databases and streams of records Specific sets: Hospital admissions (1.7 million records) Network flow (5 trillion records) Federal wire (simulated data)

6 Main results We have developed a system that addresses three problems: Retrieval of approximate matches for known patterns Processing of streaming data Identification of new patterns and gradual changes in old patterns

7 Approximate matching Fast identification of approximate matches in large sets of records Examples Misspelled names Inexact numbers Spatial proximity

8 Streaming data Continuous search for matches in a stream of new records Maintain a set of “pending” queries Identify matches for these queries among incoming records

9 RETE network Identify common parts of queries and arrange them into a RETE network, which significantly reduces the matching time Hundreds to thousands of pending queries Tens to hundreds of records per second

10 Novelty detection Identify “normal” clusters in the historic data Search for new clusters in the incoming data Track density changes in the existing clusters

11 Example: Static event density distance from the center

12 Example: New event density distance density distance

13 Example: Hidden event density distance

14 Example: Growing event density distance

15 Visualization Display of records, clusters, and queries in two and three dimensions Access to data tables and analysis results

16 Example: Data and clusters

17 Example: Density analysis

18 Information flow

19 Outline Main results of the ARGUS project - Approximate matching - Streaming data - Novelty detection More about approximate matching - Records and queries - Search for matches - Experimental results

20 Motivation Retrieval of relevant records based on partially inaccurate information Inaccurate records Inaccurate queries Incomplete knowledge

21 Table of records We specify a table of records by a list of attributes Example We can describe patients in a hospital by their sex, age, and diagnosis

22 Records and queries A record includes a specific value for each attribute A query may include lists of values and numeric ranges Query Sex: male, female Age: 20..40 Dx: asthma, flu Example Record Sex: female Age: 30 Dx: asthma

23 Query types A point query includes a specific value for each attribute A region query includes lists of values or numeric ranges Region query Sex: male, female Age: 20..40 Dx: asthma, flu Example Point query Sex: female Age: 30 Dx: asthma

24 Exact matches A record is an exact match for a query if every value in the record belongs to the respective range in the query Record Age Sex Dx Query

25 Approximate matches A record is an approximate match for a query if it is “close” to the query region Record Age Sex Dx Query

26 Approximate queries An approximate query includes Point or region Distance function Number of matches Distance limit

27 Indexing tree diagnosis male, 30, asthma female, 30, asthma male, 40, flu female, 50, flu female, 30, ulcer female, 30, fracture diagnosis age sex malefemale 30 40 50 30 asthma flu fracture ulcer asthma flu Maintain a PATRICIA tree of records Group nodes into fixed-size disk blocks

28 Search for matches diagnosis male, 30, asthma female, 30, asthma male, 40, flu female, 50, flu female, 30, ulcer female, 30, fracture diagnosis age sex malefemale 30 40 50 30 asthma flu fracture ulcer asthma flu Depth-first search for exact matches Best-first search for approximate matches

29 Performance Twenty-one attributes 1.7 million records Experiments with a database of all patients admitted to Massachusetts hospitals from October 2000 to September 2002 Use of a Pentium computer 2.4 GHz CPU 1 Gbyte memory 400 MHz bus

30 Variables Control variables Number of records Memory size Query type Measurements Retrieval time

31 Small memory Number of records: 100 to 1,670,000 Memory size: 4 MByte Retrieval Time (msec) 1000 100 10 10 2 10 3 10 4 10 5 10 6 Number of Records Range queries Approximate queries Point queries Available memory lg n n 0.15 lg n n 0.5

32 Large memory Number of records: 1,670,000 Memory size: 64 to 1,024 MByte Range queries Approximate queries Point queries Retrieval Time (msec) 100 10 1 64 128 256 5121,024 Memory Size (MBytes) 1,000 10,000

33 Scalability Retrieval time grows as fractional power (about 0.5) of database size Number of records (n) n 0.5 time (seconds) 1,000,000 100,000,000 10,000,000,000 0.05. 0.50. 5.00.

34 Distributed architecture Indexing trees on multiple computers When the system receives a query, it searches all trees in parallel query When the system receives a new record, it adds the record to one of the trees new record

35 Conclusions We have developed a set of tools for analysis of massive structured data Experiments have shown that it improves the productivity of intelligence analysts Future work includes development of more tools and application to other domains


Download ppt "Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman."

Similar presentations


Ads by Google