Download presentation

Presentation is loading. Please wait.

Published byLexie Whiton Modified over 2 years ago

1
Beyond Streams and Graphs: Dynamic Tensor Analysis Jimeng Sun Christos Faloutsos Dacheng Tao Speaker: Jimeng Sun

2
Motivation Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs: E3: Social network analysis Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis How to summarize streaming data effectively and incrementally?

3
E3: Social network analysis Traditionally, people focus on static networks and find community structures We plan to monitor the change of the community structure over time and identify abnormal individuals

4
E4: Network forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] 450 GB/hour with compression Task: Identify abnormal traffic pattern and find out the cause normal traffic abnormal traffic destination source destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

5
Static Data model For a timestamp, the stream measurements can be modeled using a tensor Dimension: a single stream E.g, Mode: a group of dimensions of the same kind. E.g., Source, Destination, Port Destination Source Time = 0

6
Static Data model (cont.) Tensor Formally, Generalization of matrices Represented as multi-array, data cube. Order1st2 nd 3 rd CorrespondenceVectorMatrix3D array Example

7
Dynamic Data model (our focus) Streams come with structure (time, source, destination, port) (time, author, keyword) time Destination Source

8
Dynamic Data model (cont.) Tensor Streams A sequence of Mth order tensor where n is increasing over time Order1st2 nd 3 rd CorrespondenceMultiple streamsTime evolving graphs3D arrays Example time … … author keyword …

9
Dynamic tensor analysis U Destination U Source Old cores Source Destination New TensorOld Tensors

10
Roadmap Motivation and main ideas Background and related work Dynamic and streaming tensor analysis Experiments Conclusion

11
Y Background – Singular value decomposition (SVD) SVD Best rank k approximation in L2 PCA is an important application of SVD A m n m n RR R U VTVT k k k UTUT

12
Latent semantic indexing (LSI) Singular vectors are useful for clustering or correlation detection pattern cluster query cache = DM DB xx document-concept concept-term concept-association frequent

13
Tensor Operation: Matricize X (d) Unfold a tensor into a matrix Acknowledge to Tammy Kolda for this slide

14
Tensor Operation: Mode-product Multiply a tensor with a matrix source destination port group source destination port group source

15
Related work Low Rank approximation PCA, SVD: orthogonal based projection Multilinear analysis Tensor decompositions: Tucker, PARAFAC, HOSVD Stream mining Scan data once to identify patterns Sampling: [Vitter85], [Gibbons98] Sketches: [Indyk00], [Cormode03] Graph mining Explorative: [Faloutsos04][Kumar99] [Leskovec05]… Algorithmic: [Yan05][Cormode05]… Our Work

16
Roadmap Motivation and main ideas Background and related work Dynamic and streaming tensor analysis Experiments Conclusion

17
Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … … t Note that this is a generalization of PCA when n is a constant

18
Why do we care? Anomaly detection Reconstruction error driven Multiple resolution Multiway latent semantic indexing (LSI) Philip Yu Michael Stonebreaker Query Pattern time

19
1 st order DTA - problem Given x 1 …x n where each x i R N, find U R N R such that the error e is small: n N x1x1 xnxn …. ? time Sensors UTUT indoor outdoor Y Sensors R Note that Y = XU

20
1 st order DTA Input: new data vector x R N, old variance matrix C R N N Output: new projection matrix U R N R Algorithm: 1. update variance matrix C new = x T x + C 2. Diagonalize U U T = C new 3. Determine the rank R and return U xTxT C U UTUT x C new Diagonalization has to be done for every new x! Old X x time

21
1 st order STA Adjust U smoothly when new data arrive without diagonalization [VLDB05] For each new point x Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : y i := U i T x(proj. onto U i ) d i d i + y i 2 (energy i-th eigenval.) e i := x – y i U i (error) U i U i + (1/d i ) y i e i (update estimate) x x – y i U i (repeat with remainder) error U Sensor 1 Sensor 2

22
M th order DTA

23
M th order DTA – complexity Storage: O( N i ), i.e., size of an input tensor at a single timestamp Computation: N i 3 (or N i 2 )diagonalization of C + N i N i matrix multiplication X (d) T X (d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplication is the main cost

24
M th order STA Run 1 st order STA along each mode Complexity: Storage: O( N i ) Computation: R i N i which is smaller than DTA y1y1 U1U1 x e1e1 U 1 updated

25
Roadmap Motivation and main ideas Background and related work Dynamic and streaming tensor analysis Experiments Conclusion

26
Experiment Objectives Computational efficiency Accurate approximation Real applications Anomaly detection Clustering

27
Data set 1: Network data TCP flows collected at CMU backbone Raw data 500GB with compression Construct 3 rd order tensors with hourly windows with Each tensor: 500 500 timestamps (hours) Sparse data 10AM to 11AM on 01/06/2005 value Power-law distribution

28
Data set 2: Bibliographic data (DBLP) Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with Each tensor: 4584 timestamps (years)

29
Computational cost 3 rd order network tensor 2 nd order DBLP tensor OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: DTA and STA are orders of magnitude faster than OTA The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

30
Accuracy comparison Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3 rd order network tensor2 nd order DBLP tensor

31
Network anomaly detection Reconstruction error gives indication of anomalies. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin). Reconstruction error over time Normal trafficAbnormal traffic

32
Multiway LSI AuthorsKeywordsYear michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic, process,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004 Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time DB DM

33
Conclusion Tensor stream is a general data model DTA/STA incrementally decompose tensors into core tensors and projection matrices The result of DTA/STA can be used in other applications Anomaly detection Multiway LSI

34
The world is not flat, neither should data mining be. Final word: Think structurally! Contact: Jimeng Sun

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google