Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos.

Similar presentations


Presentation on theme: "School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos."— Presentation transcript:

1 School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos

2 School of Computer Science Carnegie Mellon USC 04C. Faloutsos2 Joint work with Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU)

3 School of Computer Science Carnegie Mellon USC 04C. Faloutsos3 Outline Introduction - motivation Problem #1: Stream Mining –Motivation –Main idea –Experimental results Problem #2: Graphs & Virus propagation Conclusions

4 School of Computer Science Carnegie Mellon USC 04C. Faloutsos4 Introduction Sensor devices –Temperature, weather measurements –Road traffic data –Geological observations –Patient physiological data Embedded devices –Network routers –Intelligent (active) disks

5 School of Computer Science Carnegie Mellon USC 04C. Faloutsos5 Introduction Limited resources –Memory –Bandwidth –Power –CPU Remote environments –No human intervention

6 School of Computer Science Carnegie Mellon USC 04C. Faloutsos6 Introduction – problem dfn Given a emi-infinite stream of values (time series) x 1, x 2, …, x t, … Find patterns, forecasts, outliers…

7 School of Computer Science Carnegie Mellon USC 04C. Faloutsos7 Introduction Periodicity? (daily) Periodicity? (twice daily) “Noise”?? E.g.,

8 School of Computer Science Carnegie Mellon USC 04C. Faloutsos8 Introduction Periodicity? (daily) Periodicity? (twice daily) “Noise”?? Can we capture these patterns –automatically –with limited resources?

9 School of Computer Science Carnegie Mellon USC 04C. Faloutsos9 Related work Statistics: Time series forecasting Main problem: “[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91] Typically: Resource intensive Cannot update online AR(I)MA and seasonal variants ARFIMA, GARCH, …

10 School of Computer Science Carnegie Mellon USC 04C. Faloutsos10 Related work Databases: Continuous Queries Typically, different focus: –“Compression” –Not generative models Largely orthogonal problem… –Gilbert, Guha, Indyk et al. (STOC 2002) –Garofalakis, Gibbons (SIGMOD 2002) –Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003) –Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke et al. (SIGMOD 2002) –Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA 2002) –Madden+ [SIGMOD02], [SIGMOD03]

11 School of Computer Science Carnegie Mellon USC 04C. Faloutsos11 Goals Adapt and handle arbitrary periodic components No human intervention/tuning Also: Single pass over the data Limited memory (logarithmic) Constant-time update

12 School of Computer Science Carnegie Mellon USC 04C. Faloutsos12 Outline Introduction - motivation Problem #1: Stream Mining –Motivation –Main idea –Experimental results Problem #2: Graphs & Virus propagation Conclusions

13 School of Computer Science Carnegie Mellon USC 04C. Faloutsos13 Wavelets “Straight” signal t I1I1 t I2I2 t I3I3 t I4I4 t I5I5 t I6I6 t I7I7 t I8I8 time t xtxt

14 School of Computer Science Carnegie Mellon USC 04C. Faloutsos14 Wavelets Introduction – Haar t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency t xtxt

15 School of Computer Science Carnegie Mellon USC 04C. Faloutsos15 Wavelets So? Wavelets compress many real signals well… –Image compression and processing –Vision; Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive [Kotidis+]

16 School of Computer Science Carnegie Mellon USC 04C. Faloutsos16 Wavelets Correlations t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency xtxt t =

17 School of Computer Science Carnegie Mellon USC 04C. Faloutsos17 Wavelets Correlations t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency xtxt t

18 School of Computer Science Carnegie Mellon USC 04C. Faloutsos18 Main idea Correlations Wavelets are good… …we can do even better –One number… –…and the fact that they are equal/correlated

19 School of Computer Science Carnegie Mellon USC 04C. Faloutsos19 Proposed method W l,t W l,t-1 W l,t-2 W l,t   l,1 W l,t-1   l,2 W l,t-2  … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’   l’,1 W l’,t’-1   l’,2 W l’,t’-2  … Small windows suffice… (k~4)

20 School of Computer Science Carnegie Mellon USC 04C. Faloutsos20 More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones –very important!! [see paper] (incremental) (incremental; RLS) (single-pass)

21 School of Computer Science Carnegie Mellon USC 04C. Faloutsos21 Complexity Model update Space: O  lgN + mk 2   O  lgN  Time: O  k 2   O  1  Where –N: number of points (so far) –k:number of regression coefficients; fixed –m:number of linear models; O  lgN  [see paper] SKIP

22 School of Computer Science Carnegie Mellon USC 04C. Faloutsos22 Outline Introduction - motivation Problem #1: Stream Mining –Motivation –Main idea –Experimental results Problem #2: Graphs & Virus propagation Conclusions

23 School of Computer Science Carnegie Mellon USC 04C. Faloutsos23 Setup First half used for model estimation Models applied forward to forecast entire second half AR, Seasonal AR (SAR): R –Simplest possible estimation – no maximum likelihood estimation (MLE), etc. … vs. Python scripts

24 School of Computer Science Carnegie Mellon USC 04C. Faloutsos24 Results Synthetic data – Triangle pulse Triangle pulse AR captures wrong trend (or none) Seasonal AR (SAR) estimation fails

25 School of Computer Science Carnegie Mellon USC 04C. Faloutsos25 Results Synthetic data – Mix Mix (sine + square pulse) AR captures wrong trend (or none) Seasonal AR estimation fails

26 School of Computer Science Carnegie Mellon USC 04C. Faloutsos26 Results Real data – Automobile Automobile traffic –Daily periodicity with rush-hour peaks –Bursty “noise” at smaller time scales (filtered)

27 School of Computer Science Carnegie Mellon USC 04C. Faloutsos27 Results Real data – Automobile Automobile traffic –Daily periodicity with rush-hour peaks –Bursty “noise” at smaller time scales AR fails to capture any trend (average) Seasonal AR estimation fails

28 School of Computer Science Carnegie Mellon USC 04C. Faloutsos28 Results Real data – Automobile Automobile traffic –Daily periodicity with rush-hour peaks –Bursty “noise” at smaller time scales AWSOM spots periodicities, automatically

29 School of Computer Science Carnegie Mellon USC 04C. Faloutsos29 Results Real data – Automobile Automobile traffic –Daily periodicity with rush-hour peaks –Bursty “noise” at smaller time scales Generation with identified noise

30 School of Computer Science Carnegie Mellon USC 04C. Faloutsos30 Results Real data – Sunspot Sunspot intensity – Slightly time-varying “period” AR captures wrong trend (average) Seasonal ARIMA –Captures immediate wrong downward trend –Requires human to determine seasonal component period (fixed)

31 School of Computer Science Carnegie Mellon USC 04C. Faloutsos31 Results Real data – Sunspot Sunspot intensity – Slightly time-varying “period” Estimation: 40 minutes (R) vs. 9 seconds (Python)

32 School of Computer Science Carnegie Mellon USC 04C. Faloutsos32 Variance Variance (log-power) vs. scale: –“Noise” diagnostic (if decreasing linear…) –Can use to estimate noise parameters ~ 1 hour SKIP ~Hurst exponent

33 School of Computer Science Carnegie Mellon USC 04C. Faloutsos33 Running time stream size ( N ) time ( t )

34 School of Computer Science Carnegie Mellon USC 04C. Faloutsos34 Space requirements Equal total number of model parameters

35 School of Computer Science Carnegie Mellon USC 04C. Faloutsos35 Conclusion Adapt and handle arbitrary periodic components No human intervention/tuning Single pass over the data Limited memory (logarithmic) Constant-time update

36 School of Computer Science Carnegie Mellon USC 04C. Faloutsos36 Conclusion Adapt and handle arbitrary periodic components No human intervention/tuning Single pass over the data Limited memory (logarithmic) Constant-time update no human limited resources

37 School of Computer Science Carnegie Mellon USC 04C. Faloutsos37 Outline Introduction - motivation Problem #1: Streams Problem #2: Graphs & Virus propagation –Motivation & problem definition –Related work –Main idea –Experiments Conclusions

38 School of Computer Science Carnegie Mellon USC 04C. Faloutsos38 Introduction Internet Map [lumeta.com] Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Friendship Network [Moody ’01] ► Graphs are ubiquitious

39 School of Computer Science Carnegie Mellon USC 04C. Faloutsos39 Introduction What can we do with graph analysis? –Immunization; –Information Dissemination –network value of a customer [Domingos+] “Needle exchange” networks of drug users [Weeks et al. 2002] “bridges”

40 School of Computer Science Carnegie Mellon USC 04C. Faloutsos40 Problem definition Q1: How does a virus spread across an arbitrary network? Q2: will it create an epidemic? (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?)

41 School of Computer Science Carnegie Mellon USC 04C. Faloutsos41 Framework Susceptible-Infected-Susceptible (SIS) model –Cured nodes immediately become susceptible Susceptible/ healthy Infected & infectious Infected by neighbor Cured internally

42 School of Computer Science Carnegie Mellon USC 04C. Faloutsos42 Framework  : prob. an infected neighbor attacks  : prob. an infected node heals SusceptibleInfected Infected by neighbor Cured internally

43 School of Computer Science Carnegie Mellon USC 04C. Faloutsos43 The model (virus) Birth rate β : probability than an infected neighbor attacks (virus) Death rate δ : probability that an infected node heals Infected Healthy NN1 N3 N2 Prob. β Prob. δ

44 School of Computer Science Carnegie Mellon USC 04C. Faloutsos44 Epidemic threshold  Defined as the value of , such that if  /  <  an epidemic can not happen Thus, given a graph compute its epidemic threshold

45 School of Computer Science Carnegie Mellon USC 04C. Faloutsos45 Epidemic threshold  What should  depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or determinant of the adjacency matrix?

46 School of Computer Science Carnegie Mellon USC 04C. Faloutsos46 Basic Homogeneous Model Homogeneous graphs [Kephart-White ’91, ’93] Epidemic threshold = 1/ Homogeneous connectivity, ie, all nodes have ~same degree  unrealistic

47 School of Computer Science Carnegie Mellon USC 04C. Faloutsos47 Power-law Networks Model for Barabási-Albert networks –[Pastor-Satorras & Vespignani, ’01, ’02] –Epidemic threshold = / –for BA type networks, with only γ = 3 (γ = slope of power-law exponent)

48 School of Computer Science Carnegie Mellon USC 04C. Faloutsos48 Epidemic threshold Homogeneous graphs: 1/ BA (  =3) / more complicated graphs? arbitrary, REAL graphs? how many parameters??

49 School of Computer Science Carnegie Mellon USC 04C. Faloutsos49 Epidemic threshold [Theorem] We have no epidemic, if β/δ <τ = 1/ λ 1,A

50 School of Computer Science Carnegie Mellon USC 04C. Faloutsos50 Epidemic threshold [Theorem] We have no epidemic, if β/δ <τ = 1/ λ 1,A largest eigenvalue of adj. matrix A attack prob. recovery prob. epidemic threshold Proof: [Wang+03]

51 School of Computer Science Carnegie Mellon USC 04C. Faloutsos51 Epidemic threshold for various networks sanity checks / older results: Homogeneous networks –λ 1,A = ; τ = 1/ –where = average degree –This is the same result as of Kephart & White !

52 School of Computer Science Carnegie Mellon USC 04C. Faloutsos52 Epidemic threshold for various networks sanity checks / older results: Star networks –λ 1,A = sqrt(d); τ = 1/ sqrt(d) –where d = the degree of the central node

53 School of Computer Science Carnegie Mellon USC 04C. Faloutsos53 Epidemic threshold for various networks sanity checks / older results: Infinite, power-law networks –λ 1,A = ∞; τ = 0 : *any* virus has a chance! [Barabasi et al] Finite power-law networks –τ = 1/ λ 1,A

54 School of Computer Science Carnegie Mellon USC 04C. Faloutsos54 Outline Introduction - motivation Problem #1: Streams Problem #2: Graphs & Virus propagation –Motivation & problem definition –Related work –Main idea –Experiments Conclusions

55 School of Computer Science Carnegie Mellon USC 04C. Faloutsos55 Experiments 2 graphs –Star network: one “hub” + 99 “spokes” –“Oregon” Internet AS graph: 10,900 nodes, 31180 edges topology.eecs.umich.edu/data.html More in our paper: [SRDS ’03]

56 School of Computer Science Carnegie Mellon USC 04C. Faloutsos56 β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) Experiments (Star)

57 School of Computer Science Carnegie Mellon USC 04C. Faloutsos57 Experiments (Oregon) β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold)

58 School of Computer Science Carnegie Mellon USC 04C. Faloutsos58 Our prediction vs. previous prediction our predictions are more accurate OregonStar PL3 Our Number of infected nodes β/δβ/δβ/δβ/δ

59 School of Computer Science Carnegie Mellon USC 04C. Faloutsos59 Conclusions We found an epidemic threshold √ that applies to any network topology √ and it depends only on one parameter of the graph

60 School of Computer Science Carnegie Mellon USC 04C. Faloutsos60 Overall conclusions Automatic stream mining: AWSOM graphs and virus propagation: eigenvalue

61 School of Computer Science Carnegie Mellon USC 04C. Faloutsos61 Ongoing / related work Streams –how to find hidden variables on multiple streams [w/ Spiros and Jimeng Sun] –‘network tomography’ [w/ Airoldi +] Graphs –graph partitioning [w/ Deepay+] –important subgraphs [w/ Tomkins + McCurley] –graph generators [RMAT, w/ Deepay]

62 School of Computer Science Carnegie Mellon USC 04C. Faloutsos62 Thank you! Contact info: christos @ cs.cmu.edu spapadim @ cs.cmu.edu deepay @ cs.cmu.edu

63 School of Computer Science Carnegie Mellon USC 04C. Faloutsos63 Main References Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003. [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy.

64 School of Computer Science Carnegie Mellon USC 04C. Faloutsos64 Additional References Connection Subgraphs, C. Faloutsos, K. McCurley, A. Tomkins, SIAM-DM 2004 workshop on link analysis RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004 iFilter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted)


Download ppt "School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos."

Similar presentations


Ads by Google