Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.

Similar presentations


Presentation on theme: "School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University."— Presentation transcript:

1 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University

2 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos2 THANK YOU! Prof. N. Bourbakis

3 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos3 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws Discussion

4 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos4 Applications of sensors/streams ‘Smart house’: monitoring temperature, humidity etc Financial, sales, economic series

5 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos5 Motivation - Applications Medical: ECGs +; blood pressure etc monitoring Scientific data: seismological; astronomical; environment / anti-pollution; meteorological

6 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos6 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring Automobile traffic 0 200 400 600 800 1000 1200 1400 1600 1800 2000 time # cars

7 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos7 Motivation - Applications (cont’d) Computer systems –web servers (buffering, prefetching) –network traffic monitoring –... http://repository.cs.vt.edu/lbl-conn-7.tar.Z

8 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos8 Web traffic [Crovella Bestavros, SIGMETRICS’96]

9 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos9...... survivable, self-managing storage infrastructure...... a storage brick (0.5–5 TB) ~1 PB  “self-*” = self-managing, self-tuning, self-healing, …  Goal: 1 petabyte (PB) for CMU researchers  www.pdl.cmu.edu/SelfStar Self-* Storage (Ganger+)

10 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos10 Problem definition Given: one or more sequences x 1, x 2, …, x t, …; (y 1, y 2, …, y t, …) Find –patterns; clusters; outliers; forecasts;

11 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos11 Problem #1 Find patterns, in large datasets time # bytes

12 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos12 Problem #1 Find patterns, in large datasets time # bytes Poisson indep., ident. distr

13 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos13 Problem #1 Find patterns, in large datasets time # bytes Poisson indep., ident. distr

14 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos14 Problem #1 Find patterns, in large datasets time # bytes Poisson indep., ident. distr Q: Then, how to generate such bursty traffic?

15 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos15 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws Discussion

16 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos16 Problem #2 - network and graph mining How does the Internet look like? How does the web look like? What constitutes a ‘normal’ social network? What is the ‘network value’ of a customer? which gene/species affects the others the most?

17 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos17 Network and graph mining Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Friendship Network [Moody ’01] Graphs are everywhere!

18 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos18 Problem#2 Given a graph: which node to market-to / defend / immunize first? Are there un-natural sub- graphs? (eg., criminals’ rings)? [from Lumeta: ISPs 6/1999]

19 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos19 Solutions New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail Let’s see the details:

20 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos20 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws Discussion

21 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos21 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality??

22 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos22 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!)

23 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos23 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? Q: fd of a plane?

24 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos24 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) )

25 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos25 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58 == ‘correlation integral’ = CDF of pairwise distances

26 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos26 Observations: Fractals power laws Closely related: fractals self-similarity scale-free power laws ( y= x a ; F=K r -2 ) (vs y=e -ax or y=x a +b) log( r ) log(#pairs within <=r ) 1.58

27 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos27 Outline Problems Self-similarity and power laws Solutions to posed problems Discussion

28 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos28 time #bytes Solution #1: traffic disk traces: self-similar: (also: [Leland+94]) How to generate such traffic?

29 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos29 Solution #1: traffic disk traces (80-20 ‘law’) – ‘multifractals’ time #bytes 20% 80%

30 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos30 80-20 / multifractals 20 80

31 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos31 80-20 / multifractals 20 p ; (1-p) in general yes, there are dependencies 80

32 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos32 More on 80/20: PQRS Part of ‘self-* storage’ project time cylinder#

33 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos33 More on 80/20: PQRS Part of ‘self-* storage’ project pq rs q rs

34 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos34 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws –sensor/traffic data –network/graph data Discussion

35 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos35 Problem #2 - topology How does the Internet look like? Any rules?

36 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos36 Patterns? avg degree is, say 3.3 pick a node at random – guess its degree, exactly (-> “mode”) degree count avg: 3.3

37 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos37 Patterns? avg degree is, say 3.3 pick a node at random – guess its degree, exactly (-> “mode”) A: 1!! degree count avg: 3.3

38 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos38 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! A’: very skewed distr. Corollary: the mean is meaningless! (and std -> infinity (!)) degree count avg: 3.3

39 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos39 Solution#2: Rank exponent R A1: Power law in the degree distribution [SIGCOMM99] internet domains log(rank) log(degree) -0.82 att.com ibm.com

40 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos40 Solution#2’: Eigen Exponent E A2: power law in the eigenvalues of the adjacency matrix E = -0.48 Exponent = slope Eigenvalue Rank of decreasing eigenvalue May 2001

41 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos41 Power laws - discussion do they hold, over time? do they hold on other graphs/domains?

42 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos42 Power laws - discussion do they hold, over time? Yes! for multiple years [Siganos+] do they hold on other graphs/domains? Yes! –web sites and links [Tomkins+], [Barabasi+] –peer-to-peer graphs (gnutella-style) –who-trusts-whom (epinions.com)

43 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos43 Time Evolution: rank R The rank exponent has not changed! [Siganos+] Domain level log(rank) log(degree) - 0.82 att.com ibm.com

44 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos44 The Peer-to-Peer Topology Number of immediate peers (= degree), follows a power-law [Jovanovic+] degree count

45 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos45 epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] (out) degree count

46 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos46 Why care about these patterns? better graph generators [BRITE, INET] –for simulations –extrapolations ‘abnormal’ graph and subgraph detection

47 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos47 Outline problems Fractals Solutions Discussion –what else can they solve? –how frequent are fractals?

48 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos48 What else can they solve? separability [KDD’02] forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [PEVA’02] selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00]...

49 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos49 Storyboard Search results (ranked) Collage with maps, common phrases, named entities and dynamic query sliders Query (6TB of data) Full Content Indexing, Search and Retrieval from Digital Video Archives www.informedia.cs.cmu.edu

50 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos50 What else can they solve? separability [KDD’02] forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [PEVA’02] selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00]...

51 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos51 Problem #3 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability??

52 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos52 Solution#3: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! CORRELATION INTEGRAL!

53 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos53 Solution#3: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]

54 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos54 spatial d.m. r1r2 r1 r2 Heuristic on choosing # of clusters

55 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos55 Solution#3: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!

56 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos56 Problem#4: dim. reduction given attributes x 1,... x n –possibly, non-linearly correlated drop the useless ones mpg cc

57 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos57 Problem#4: dim. reduction given attributes x 1,... x n –possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) Solution: keep on dropping attributes, until the f.d. changes! [SBBD’00] mpg cc

58 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos58 Outline problems Fractals Solutions Discussion –what else can they solve? –how frequent are fractals?

59 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos59 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

60 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos60 Fractals: Brain scans brain-scans octree levels Log(#octants) 2.63 = fd

61 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos61 fMRI brain scans Center for Cognitive Brain Imaging @ CMU Tom Mitchell, Marcel Just, ++ fMRI Goal: human brain function Which voxels are active, for a given cognitive task?

62 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos62 More fractals periphery of malignant tumors: ~1.5 benign: ~1.3 [Burdet+]

63 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos63 More fractals: cardiovascular system: 3 (!) lungs: ~2.9

64 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos64 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

65 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos65 More fractals: Coastlines: 1.2-1.58 1 1.1 1.3

66 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos66

67 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos67 Cross-roads of Montgomery county: any rules? GIS points

68 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos68 GIS A: self-similarity: intrinsic dim. = 1.51 log( r ) log(#pairs(within <= r)) 1.51

69 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos69 Examples:LB county Long Beach county of CA (road end-points) 1.7 log(r) log(#pairs)

70 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos70 More power laws: areas – Korcak’s law Scandinavian lakes Any pattern?

71 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos71 More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)

72 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos72 More power laws: Korcak Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))

73 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos73 More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) Magnitude = log(energy)day Energy released

74 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos74 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

75 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos75 A famous power law: Zipf’s law Bible - rank vs. frequency (log-log) log(rank) log(freq) “a” “the” “Rank/frequency plot”

76 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos76 TELCO data # of service units count of customers ‘best customer’

77 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos77 SALES data – store#96 # units sold count of products “aspirin”

78 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos78 Olympic medals (Sidney’00, Athens’04): log( rank) log(#medals)

79 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos79 Olympic medals (Sidney’00, Athens’04): log( rank) log(#medals)

80 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos80 Even more power laws: Income distribution (Pareto’s law) size of firms publication counts (Lotka’s law)

81 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos81 Even more power laws: library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(#citations) log(count) Ullman

82 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos82 Even more power laws: web hit counts [w/ A. Montgomery] Web Site Traffic log(freq) log(count) Zipf “yahoo.com”

83 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos83 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

84 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos84 Power laws, cont’d In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log indegree - log(freq) from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ]

85 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos85 Power laws, cont’d In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log indegree log(freq) from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ]

86 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos86 Power laws, cont’d In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] length of file transfers [Crovella+Bestavros ‘96] duration of UNIX jobs [Harchol-Balter]

87 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos87 Conclusions Fascinating problems in Data Mining: find patterns in –sensors/streams –graphs/networks

88 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos88 Conclusions - cont’d New tools for Data Mining: self-similarity & power laws: appear in many cases Bad news: lead to skewed distributions (no Gaussian, Poisson, uniformity, independence, mean, variance) Good news: ‘correlation integral’ for separability rank/frequency plots 80-20 (multifractals) (Hurst exponent, strange attractors, renormalization theory, ++)

89 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos89 Resources Manfred Schroeder “Chaos, Fractals and Power Laws”, 1991

90 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos90 References [vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995 M. Crovella and A. Bestavros, Self similarity in World wide web traffic: Evidence and possible causes, SIGMETRICS ’96.

91 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos91 References J. Considine, F. Li, G. Kollios and J. Byers, Approximate Aggregation Techniques for Sensor Databases (ICDE’04, best paper award). [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13

92 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos92 References [vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996. [sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000

93 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos93 References [vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept. 1996 [sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999

94 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos94 References [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1-15, Feb. 1994. [brite] Alberto Medina, Anukool Lakhina, Ibrahim Matta, and John Byers. BRITE: An Approach to Universal Topology Generation. MASCOTS '01

95 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos95 References [icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree (ICDE’99) Stan Sclaroff, Leonid Taycher and Marco La Cascia, "ImageRover: A content-based image browser for the world wide web" Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp 2-9, 1997.

96 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos96 References [kdd2001] Agma J. M. Traina, Caetano Traina Jr., Spiros Papadimitriou and Christos Faloutsos: Tri- plots: Scalable Tools for Multidimensional Data Mining, KDD 2001, San Francisco, CA.

97 School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos97 Thank you! Contact info: christos cs.cmu.edu www. cs.cmu.edu /~christos (w/ papers, datasets, code for fractal dimension estimation, etc)


Download ppt "School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University."

Similar presentations


Ads by Google