NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004.

NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004

NIMD 2 Massive Structured Data Static data –Focus on 10 10 to 10 12 records –Typical record size 100 to 1,000 bytes –Typical collection size between terabyte and petabyte –Smaller than large collections including unstructured data because field size is much smaller Streaming data –1,000 to 5,000 records per second –Approx 100M to 400M records per day –Static data corresponds to a few years of stream

NIMD 3 Approximate Structured Matching Range or Point Query Exact Match Near Match No Match Distance

NIMD 4 Data Matching and Retrieval Matcher finds data that matches query exactly or is close to it Different versions for different data volumes VolumeTime complexity In-memory10 6 to 10 8 Logarithmic Disk-based10 7 to 10 10 Low power Distributed10 9 to 10 12 Same as underlying matcher

NIMD 5 Disk-Matcher Experiments Retrieval Time (msec) 100 10 1 10 2 10 3 10 4 10 5 10 6 Number of Records Range queries Approximate queries Exact queries Available memory lg n n 0.15 lg n n 0.5

NIMD 6 Monitoring Streaming Data Rete Network Generator Query Rete Networks Data Tables Analyst Identified Threats Intermediate Tables Data Streams Query Table Stream Anomaly Monitoring Do_queries Scheduler

NIMD 7 Monitoring Streaming Data Monitoring structured data streams for anomalies, hazards or alerts posted by analysts. Alert profiles = continuous persistent queries (10 5 ) Daily stream volumes target 10 8 + records. System is optimized for very high selectivity queries –“ Needle in a field of haystacks ” challenge –Alert profiles can be anything (relational, aggregation, … ) Functions atop DBMS (now), or full DYNAMiX matcher (coming soon) Based on modified Rete algorithm

NIMD 8 Old ResultsNew Incremental Results (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm When Δn and Δm are very small compared to n and m, rete time complexity of incremental join is worse case O(n+m), and using b-trees it goes to O( log n+ log m+Δn+Δm) Adapted Rete Algorithm

NIMD 9 Finding Novel Patterns in Data Primary topic of Hypothesis Generation and Tracking paper Scales well for massive data because algorithms are near linear in number of records, rather than n 2

NIMD 10 Need for Suitable Data Most suitable data is classified or proprietary Fabricated data does not have “right” distribution –Risk of tailoring solution to fabricated characteristics Ideal is real data processed to be unclassified, but still retaining relevant characteristics of original

NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004.

Similar presentations

Presentation on theme: "NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004.

Similar presentations

Presentation on theme: "NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004."— Presentation transcript:

Similar presentations

About project

Feedback