Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

Similar presentations


Presentation on theme: "Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006."— Presentation transcript:

1 Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006

2 2 Thesis Committee Christos Faloutsos (Chair) Tom Mitchell Hui Zhang David Steier, PricewaterhouseCoopers Philip Yu, IBM Watson Research Center

3 3 Thesis Proposal Goal: incremental pattern discovery on streaming applications Streams:  E1: Environmental sensor networks  E2: Cluster/data center monitoring Graphs:  E3: Social network analysis Tensors:  E4: Network forensics  E5: Financial auditing  E6: fMRI: Brain image analysis How to summarize streaming data efficiently and incrementally?

4 4 E1: Environmental Sensor Monitoring water distribution network normal operation May have hundreds of measurements, and they are often related ! Phase 1Phase 2Phase 3 : : : chlorine concentrations sensors near leak sensors away from leak CMU civil department Prof. Jeanne M. VanBriesen

5 5 Phase 1Phase 2Phase 3 : : : E1: Environmental Sensor Monitoring water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak CMU civil department Prof. Jeanne M. VanBriesen May have hundreds of measurements, and they are often related!

6 6 E1: Environmental Sensor Monitoring We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1-2 : : : SPIRIT

7 7 E3: Social network analysis Traditionally, people focus on static networks and find community structures We plan to monitor the change of the community structure over time and identify abnormal individuals

8 8 E4: Network forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] 450 GB/hour with compression Task: Identify abnormal traffic pattern and find out the cause normal traffic abnormal traffic destination source destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

9 9 Commonality of all Data: continuously arriving Large volume Multi-dimensional Unlabeled Task: incremental pattern discovery Main trends Anomalies

10 10 Thesis statement Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

11 11 Outline Motivating examples Data model and mining framework Related work Current work Proposed work Conclusion

12 12 Static Data model Tensor Formally, Generalization of matrices Represented as multi-array, data cube. Order1st2 nd 3 rd CorrespondenceVectorMatrix3D array Example

13 13 Dynamic Data model (our focus) Tensor Streams A sequence of Mth order tensor where n is increasing over time Order1st2 nd 3 rd CorrespondenceMultiple streamsTime evolving graphs3D arrays Example time … … author keyword …

14 14 Application Modules Our framework for incremental pattern discovery Data Streams Tensor Streams Core tensors Projections Preprocessing Tensor Analysis Anomaly Detection ClusteringPrediction Mining flow

15 15 Outline Motivating examples Data model and mining framework Related work Current work Proposed work Conclusion

16 16 Related work Low Rank approximation PCA, SVD: orthogonal based projection CUR [Drineas05]: example based projection Multilinear analysis Tensors: matricizing, mode-product Tensor decompositions: Tucker, PARAFAC, HOSVD Stream mining Scan data once to identify patterns Sampling: [Vitter85], [Gibbons98] Sketches: [Indyk00], [Cormode03] Graph mining Explorative: [Faloutsos04][Kumar99] [Leskovec05]… Algorithmic: [Yan05][Cormode05]… Our Work

17 17 Y Background – Singular value decomposition (SVD) SVD Best rank k approximation in L2 PCA is an important application of SVD Note that U and V are dense and may have negative entries A m n  m n RR R U VTVT k k k UTUT

18 18 Background – Latent semantic indexing (LSI) Singular vectors are useful for clustering pattern cluster query cache = DM DB xx document-concept concept-term concept-association frequent

19 19 Background: Tensor Operations Matricizing Unfold a tensor into a matrix Source Destination Port Source Destination*Port

20 20 Background: Tensor Operations Mode-product Multiply a tensor with a matrix Source Destination Port “group”  source Destination Port “group” source

21 21 Outline Data model Framework Related work Current work Dynamic and Streaming tensor analysis (DTA/STA) Compact matrix decomposition (CMD) Proposed work Conclusion

22 22 Methodology map staticdynamic 1 st 1 st order DTA, SPIRIT (1 st order STA) 2 nd SVD, PCA, CMD DTA, STA 33 PARAFAC, HOSVD, TensorPCA order data

23 23 Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … … t Note that this is a generalization of PCA when n is a constant

24 24 Why do we care? Anomaly detection Reconstruction error driven Multiple resolution Multiway latent semantic indexing (LSI) Philip Yu Michael Stonebreaker Query Pattern time

25 25 1 st order DTA - problem Given x 1 …x n where each x i  R N, find U  R N  R such that the error e is small: n N x1x1 xnxn …. ? time Sensors UTUT indoor outdoor Y Sensors R Note that Y = XU

26 26 1 st order DTA Input: new data vector x  R N, old variance matrix C  R N  N Output: new projection matrix U  R N  R Algorithm: 1. update variance matrix C new = x T x + C 2. Diagonalize U  U T = C new 3. Determine the rank R and return U xTxT C U UTUT x C new Diagonalization has to be done for every new x! Old X x time

27 27 1 st order STA: SPIRIT Adjust U smoothly when new data arrive without diagonalization For each new point x Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : y i := U i T x(proj. onto U i ) d i  d i + y i 2 (energy  i-th eigenval.) e i := x – y i U i (error) U i  U i + (1/d i ) y i e i (update estimate) x  x – y i U i (repeat with remainder) error U Sensor 1 Sensor 2

28 28 M th order DTA

29 29 M th order DTA – complexity Storage: O(  N i ), i.e., size of an input tensor at a single timestamp Computation:  N i 3 (or  N i 2 )diagonalization of C +  N i  N i matrix multiplication X (d) T X (d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplication is the main cost

30 30 Streaming tensor analysis (STA) Run SPIRIT along each mode Complexity: Storage: O(  N i ) Computation:  R i  N i which is smaller than DTA y1y1 U1U1 x e1e1 U 1 updated

31 31 Experiment Goal Computation efficiency Accurate approximation Real applications  Anomaly detection  Clustering

32 32 Data set 1: Network data TCP flows collected at CMU backbone Raw data 500GB with compression Construct 2 nd or 3 rd order tensors with hourly windows with or Each tensor: 500  500 or 500  500  100 biased sampled from over 22k hosts 1200 timestamps (hours) Sparse dataPower-law distribution 10AM to 11AM on 01/06/2005

33 33 Data set 2: Bibliographic data (DBLP) Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with Each tensor: 4584  3741 11 timestamps (years)

34 34 Computational cost 3 rd order network tensor 2 nd order DBLP tensor OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: DTA and STA are orders of magnitude faster than OTA The slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

35 35 Accuracy comparison Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3 rd order network tensor 2 nd order DBLP tensor

36 36 Network anomaly detection Reconstruction error gives indication of anomalies. Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin). Reconstruction error over time Normal trafficAbnormal traffic

37 37 Multiway LSI AuthorsKeywordsYear michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,pro cess,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004 Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time DB DM

38 38 Quick summary of DTA/STA Tensor stream is a general data model DTA/STA incrementally decompose tensors into core tensors and projection matrices The result of DTA/STA can be used in other applications Anomaly detection Multiway LSI Incremental computation!

39 39 Outline Data model Framework Related work Current work Dynamic and Streaming tensor analysis (DTA/STA) Compact matrix decomposition (CMD) Proposed work Conclusion

40 40 Methodology map staticdynamic 1 st 1 st order DTA, SPIRIT (1 st order STA) 2 nd SVD, PCA, CMD DTA, STA 33 PARAFAC, HOSVD, TensorPCA order data

41 41 Disadvantage of orthogonal projection on sparse data Real data are often (very) sparse Orthogonal projection does not preserve the sparsity in the data more space than original data large computational cost DataSizeNonzero percent Network flow22k-by-22k0.0025% DBLP (author, conference) 428k-by-3.6k0.004%

42 42 Interpretability problem of orthogonal projection Each column of projection matrix U i is a linear combination of all dimensions along certain mode U i (:,1) = [0.5; -0.5; 0.5; 0.5] All the data are projected onto the span of U i It is hard to interpret the projections

43 43 Compact matrix decomposition (CMD) Example-based projection: use actual rows and columns to specify the subspace Given a matrix A  R m  n, find three matrices C  R m  c, U  R c  r, R  R r  n, such that ||A-CUR|| is small U is the pseudo-inverse of X Orthogonal projection Example-based

44 44 CMD algorithm (high level) CMU from 4K feet

45 45 CMD algorithm (high level) Biased sample with replacement of columns and rows from A Remove duplicates with proper scaling Construct U from C and R (pseudo- inverse of the intersection of C and R) Remove duplicates with proper scaling 11111111 10101010 00110011 A 11111111 11111111 1 1 1 CdCd RdRd C 2 2 2 R U

46 46 CMD algorithm (low level) CMU from 3 feet

47 47 CMD algorithm (low level) Remove duplicates with proper scaling C i = u i 1/2 C i R i = v i R i Theorem: Matrix C and C d have the same singular values and left singular vectors Proof: see [Sun06] u i, v i the number of occurrences of C i and R i

48 48 Experiment Datasets Performance metrics Space ratio to the original data CPU time (sec) Accuracy = 1 – reconstruction error DataDimensionNonzeros Network flow (source, destination) 22k-by-22k12K DBLP (author, conference) 428K-by-3.6K64K

49 49 Space efficiency CMD uses much smaller space to achieve the same accuracy CUR limitation: duplicate columns and rows SVD limitation: orthogonal projection densifies the data Network DBLP

50 50 Computational efficiency CMD is fastest among all three CMD and CUR requires SVD on only the sampled columns CUR is much worse than CMD due to duplicate columns SVD is slowest since it performs on the entire data Network DBLP

51 51 Quick summary on CMD CMD: A  C U R C/R: sampled and scaled columns and rows (sparse) U: a small matrix (dense) Properties Interpretability: interpret matrix by sampled rows and columns Efficiency: in computation and space Application Anomaly detection Efficient computation, intuitive model

52 52 My related publications Sun, J., Tao, D., Faloutsos, C. Beyond Streams and Graphs: Dynamic Tensor Analysis, submitted. Sun, J., Xie, Y., Zhang, H., Faloutsos, C. Compact Matrix Decomposition for Large Graphs: Theory and Practice, submitted. Hoke, E., Sun, J., Faloutsos, C. Intemon: intelligent monitoring system for large clusters. submitted Sun, J., Papadimitriou, S., Faloutsos, C. Distributed Pattern Discovery in Multiple Streams, PAKDD 2006 Papadimitriou, S., Sun, J., Faloutsos, C. Streaming Pattern Discovery in Multiple Time-Series, VLDB 2006 Sun,.J. Papadimitriou, S., Faloutsos, C. Online latent variable detection in sensor networks, ICDE, 2005

53 53 Outline Motivating examples Data model and mining framework Background and related work Current work Dynamic and Streaming tensor analysis (DTA/STA) Compact matrix decomposition (CMD) Proposed work Conclusion

54 54 Proposed work Methodology Evaluation Goal: real data, real application, real patterns [P3] DTA STA Tensor analysis Orthogonal projection Example-based projection M th Other divergence SPIRIT Distributed SPIRIT 1 st [P4] CMD, [P1,P2] 2 nd M th

55 55 P1: Effective example-based projection Occasionally, CMD does not give an accurate result. Especially, when the “large” columns and rows are in near parallel space. Current heuristics keeps sample those “large” columns/rows Recent work [Drineas06] provides relative error guarantee |A-CUR|  (1+  )|A-A k | where A k is best k approximation from SVD Our idea: pick the column that disagree the most with the selected columns. 11111111 11111111 00110011 CMD New

56 56 P2: Incremental CMD Given time evolving graphs (2 nd tensor stream), currently we need to apply CMD every timestamp on a new (slightly changed) graph How to compute CMD efficiently over time? Our idea: 11111111 11111111 00110011 t =1 12211221 21202120 00120012 t =2

57 57 P3: Example-based tensor decomposition CMD is currently on matrices (2 nd order tensors) only. Generalize CMD to higher order Build infrastructure: sparse tensor package [Kolda 06] Prototype sparse tensor access methods  How to store a sparse tensor?  How to access some subset of a tensor? Our goal: Implement tensor CMD efficiently.

58 58 P4: Other divergence Currently, the model implicitly assumes Gaussian distribution and Euclidean distance. But, many real data are not Gaussian. Our goal focus on other distribution and distance measure Euclidean distance  Gaussian distribution KL divergence  Multinomial distribution Bregman divergence  Exponential family

59 59 Evaluation plan Real data, real application, real success or failure DataTasksOrder Environmental monitoring Temperature, humidity in large building; chlorine concentration in water distribution; Real-time summarization and anomaly detection 1 st Machine monitoring Monitor a number of system parameters; identify unusual patterns in real-time 1 st DBLP/IMDBTime evolving graphs; find community structure 2 nd Network flowIdentify interesting patterns, identify attacks 2 nd or 3 rd Other data fMRI dataBrain image data; classification3 rd Financial dataTransaction data; identify the anomalies that may indicate frauds or errors >= 1

60 60 Timeline 1-3 months 4-6 months 7-8 months 9-11 months 7-12 months After 12 months P1:Effective example-based projection P2:Incremental CMD P3:Example-based tensor decomposition P4:other divergence Writing thesis Defense P1P3P2P4 Writing thesis 12 months Defense


Download ppt "Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006."

Similar presentations


Ads by Google