Download presentation

Presentation is loading. Please wait.

Published byLexie Whiton Modified over 2 years ago

1
**Beyond Streams and Graphs: Dynamic Tensor Analysis**

Jimeng Sun Dacheng Tao Christos Faloutsos Speaker: Jimeng Sun

2
Motivation Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs: E3: Social network analysis Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis How to summarize streaming data effectively and incrementally? Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains. How to summarize streaming data efficiently and incrementally? Statement: Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

3
**E3: Social network analysis**

Traditionally, people focus on static networks and find community structures We plan to monitor the change of the community structure over time and identify abnormal individuals

4
**Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie**

E4: Network forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] 450 GB/hour with compression Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic source destination destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

5
Static Data model For a timestamp, the stream measurements can be modeled using a tensor Dimension: a single stream E.g, <Christos, “graph”> Mode: a group of dimensions of the same kind. E.g., Source, Destination, Port Time = 0 Source Destination

6
**Static Data model (cont.)**

Tensor Formally, Generalization of matrices Represented as multi-array, data cube. Order 1st 2nd 3rd Correspondence Vector Matrix 3D array Example

7
**Dynamic Data model (our focus)**

Source Destination time Streams come with structure (time, source, destination, port) (time, author, keyword)

8
**Dynamic Data model (cont.)**

Tensor Streams A sequence of Mth order tensor where n is increasing over time Order 1st 2nd 3rd Correspondence Multiple streams Time evolving graphs 3D arrays Example keyword time … time author … …

9
**Dynamic tensor analysis**

Old Tensors New Tensor Source Destination UDestination Old cores USource

10
**Roadmap Motivation and main ideas Background and related work**

Dynamic and streaming tensor analysis Experiments Conclusion

11
**Background – Singular value decomposition (SVD)**

Best rank k approximation in L2 PCA is an important application of SVD n n R k k k VT A U m m UT Y

12
**Latent semantic indexing (LSI)**

Singular vectors are useful for clustering or correlation detection cluster cache frequent pattern query concept-association DM x x = DB document-concept concept-term

13
**Tensor Operation: Matricize X(d)**

Unfold a tensor into a matrix 5 7 6 8 Acknowledge to Tammy Kolda for this slide

14
**Tensor Operation: Mode-product**

Multiply a tensor with a matrix port port source source destination destination group group source

15
**Related work Our Work Low Rank approximation Multilinear analysis**

PCA, SVD: orthogonal based projection Multilinear analysis Tensor decompositions: Tucker, PARAFAC, HOSVD Stream mining Scan data once to identify patterns Sampling: [Vitter85], [Gibbons98] Sketches: [Indyk00], [Cormode03] Graph mining Explorative: [Faloutsos04][Kumar99] [Leskovec05]… Algorithmic: [Yan05][Cormode05]… Our Work

16
**Roadmap Motivation and main ideas Background and related work**

Dynamic and streaming tensor analysis Experiments Conclusion

17
**Note that this is a generalization of PCA when n is a constant**

Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant

18
**Why do we care? Anomaly detection**

Reconstruction error driven Multiple resolution Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query

19
**1st order DTA - problem Given x1…xn where each xi RN, find**

URNR such that the error e is small: N Y UT x1 R ? Sensors time n …. xn indoor Note that Y = XU Sensors outdoor

20
**Diagonalization has to be done for every new x!**

1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT = Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!

21
1st order STA Adjust U smoothly when new data arrive without diagonalization [VLDB05] For each new point x Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : yi := UiTx (proj. onto Ui) di di + yi2 (energy i-th eigenval.) ei := x – yiUi (error) Ui Ui + (1/di) yiei (update estimate) x x – yiUi (repeat with remainder) error Sensor 2 U Sensor 1

22
Mth order DTA

23
**Mth order DTA – complexity**

Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation: Ni3 (or Ni2) diagonalization of C + Ni Ni matrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplication is the main cost

24
**Mth order STA y1 U1 x e1 U1 updated Run 1st order STA along each mode**

Complexity: Storage: O( Ni) Computation: Ri Ni which is smaller than DTA y1 U1 x e1 U1 updated

25
**Roadmap Motivation and main ideas Background and related work**

Dynamic and streaming tensor analysis Experiments Conclusion

26
**Experiment Objectives Computational efficiency Accurate approximation**

Real applications Anomaly detection Clustering

27
**Data set 1: Network data Sparse data Power-law distribution**

TCP flows collected at CMU backbone Raw data 500GB with compression Construct 3rd order tensors with hourly windows with <source, destination, port> Each tensor: 500500100 1200 timestamps (hours) value Sparse data Power-law distribution 10AM to 11AM on 01/06/2005

28
**Data set 2: Bibliographic data (DBLP)**

Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with <author, keywords> Each tensor: 45843741 11 timestamps (years)

29
**Computational cost 3rd order network tensor 2nd order DBLP tensor**

OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: DTA and STA are orders of magnitude faster than OTA The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

30
**Accuracy comparison 3rd order network tensor 2nd order DBLP tensor**

Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

31
**Network anomaly detection**

Abnormal traffic Reconstruction error over time Normal traffic Reconstruction error gives indication of anomalies. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).

32
Multiway LSI Authors Keywords Year michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri DB DM Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time

33
**Conclusion Tensor stream is a general data model**

DTA/STA incrementally decompose tensors into core tensors and projection matrices The result of DTA/STA can be used in other applications Anomaly detection Multiway LSI

34
**Final word: Think structurally!**

The world is not flat, neither should data mining be. Contact: Jimeng Sun

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google