Beyond Streams and Graphs: Dynamic Tensor Analysis

Beyond Streams and Graphs: Dynamic Tensor Analysis
Jimeng Sun Dacheng Tao Christos Faloutsos Speaker: Jimeng Sun

Motivation Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs: E3: Social network analysis Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis How to summarize streaming data effectively and incrementally? Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains. How to summarize streaming data efficiently and incrementally? Statement: Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

E3: Social network analysis
Traditionally, people focus on static networks and find community structures We plan to monitor the change of the community structure over time and identify abnormal individuals

Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie
E4: Network forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] 450 GB/hour with compression Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic source destination destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

Static Data model For a timestamp, the stream measurements can be modeled using a tensor Dimension: a single stream E.g, <Christos, “graph”> Mode: a group of dimensions of the same kind. E.g., Source, Destination, Port Time = 0 Source Destination

Static Data model (cont.)
Tensor Formally, Generalization of matrices Represented as multi-array, data cube. Order 1st 2nd 3rd Correspondence Vector Matrix 3D array Example

Dynamic Data model (our focus)
Source Destination time Streams come with structure (time, source, destination, port) (time, author, keyword)

Dynamic Data model (cont.)
Tensor Streams A sequence of Mth order tensor where n is increasing over time Order 1st 2nd 3rd Correspondence Multiple streams Time evolving graphs 3D arrays Example keyword time … time author … …

Dynamic tensor analysis
Old Tensors New Tensor Source Destination UDestination Old cores USource

Roadmap Motivation and main ideas Background and related work
Dynamic and streaming tensor analysis Experiments Conclusion

Background – Singular value decomposition (SVD)
Best rank k approximation in L2 PCA is an important application of SVD n n R k k k VT A U  m m UT Y

Latent semantic indexing (LSI)
Singular vectors are useful for clustering or correlation detection cluster cache frequent pattern query concept-association DM x x = DB document-concept concept-term

Tensor Operation: Matricize X(d)
Unfold a tensor into a matrix 5 7 6 8 Acknowledge to Tammy Kolda for this slide

Tensor Operation: Mode-product
Multiply a tensor with a matrix port port source source destination destination group group source

Related work Our Work Low Rank approximation Multilinear analysis
PCA, SVD: orthogonal based projection Multilinear analysis Tensor decompositions: Tucker, PARAFAC, HOSVD Stream mining Scan data once to identify patterns Sampling: [Vitter85], [Gibbons98] Sketches: [Indyk00], [Cormode03] Graph mining Explorative: [Faloutsos04][Kumar99] [Leskovec05]… Algorithmic: [Yan05][Cormode05]… Our Work

Note that this is a generalization of PCA when n is a constant
Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant

Why do we care? Anomaly detection
Reconstruction error driven Multiple resolution Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query

1st order DTA - problem Given x1…xn where each xi RN, find
URNR such that the error e is small: N Y UT x1 R ? Sensors time n …. xn indoor Note that Y = XU Sensors outdoor

Diagonalization has to be done for every new x!
1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT = Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!

1st order STA Adjust U smoothly when new data arrive without diagonalization [VLDB05] For each new point x Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : yi := UiTx (proj. onto Ui) di  di + yi2 (energy  i-th eigenval.) ei := x – yiUi (error) Ui  Ui + (1/di) yiei (update estimate) x  x – yiUi (repeat with remainder) error Sensor 2 U Sensor 1

Mth order DTA

Mth order DTA – complexity
Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation:  Ni3 (or  Ni2) diagonalization of C +  Ni  Ni matrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplication is the main cost

Mth order STA y1 U1 x e1 U1 updated Run 1st order STA along each mode
Complexity: Storage: O( Ni) Computation:  Ri  Ni which is smaller than DTA y1 U1 x e1 U1 updated

Experiment Objectives Computational efficiency Accurate approximation
Real applications Anomaly detection Clustering

Data set 1: Network data Sparse data Power-law distribution
TCP flows collected at CMU backbone Raw data 500GB with compression Construct 3rd order tensors with hourly windows with <source, destination, port> Each tensor: 500500100 1200 timestamps (hours) value Sparse data Power-law distribution 10AM to 11AM on 01/06/2005

Data set 2: Bibliographic data (DBLP)
Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with <author, keywords> Each tensor: 45843741 11 timestamps (years)

Computational cost 3rd order network tensor 2nd order DBLP tensor
OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: DTA and STA are orders of magnitude faster than OTA The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)

Accuracy comparison 3rd order network tensor 2nd order DBLP tensor
Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

Network anomaly detection
Abnormal traffic Reconstruction error over time Normal traffic Reconstruction error gives indication of anomalies. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).

Multiway LSI Authors Keywords Year michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri DB DM Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time

Conclusion Tensor stream is a general data model
DTA/STA incrementally decompose tensors into core tensors and projection matrices The result of DTA/STA can be used in other applications Anomaly detection Multiway LSI

Final word: Think structurally!
The world is not flat, neither should data mining be. Contact: Jimeng Sun

Beyond Streams and Graphs: Dynamic Tensor Analysis

Similar presentations

Presentation on theme: "Beyond Streams and Graphs: Dynamic Tensor Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beyond Streams and Graphs: Dynamic Tensor Analysis

Similar presentations

Presentation on theme: "Beyond Streams and Graphs: Dynamic Tensor Analysis"— Presentation transcript:

Similar presentations

About project

Feedback