Carnegie Mellon Powerful Tools for Data Mining Fractals, Power laws, SVD C. Faloutsos Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
On Power-Law Relationships of the Internet Topology Michalis Faloutsos Petros Faloutsos Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
A Nonstationary Poisson View of Internet Traffic T. Karagiannis, M. Molle, M. Faloutsos University of California, Riverside A. Broido University of California,
CMU SCS Graph and stream mining Christos Faloutsos CMU.
School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Self-Similar through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level Walter Willinger, Murad S. Taqqu, Robert Sherman,
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Data Mining using Fractals and Power laws
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
Modelling and Simulation 2008 A brief introduction to self-similar fractals.
Introduction to Fractals and Fractal Dimension Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional.
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals – examples & algo’s C. Faloutsos.
Galaxy clustering II 2-point correlation function 5 Feb 2013.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals (fractals for fun and profit) Christos Faloutsos Carnegie Mellon University.
1 Self Similar Traffic. 2 Self Similarity The idea is that something looks the same when viewed from different degrees of “magnification” or different.
Low-Dimensional Chaotic Signal Characterization Using Approximate Entropy Soundararajan Ezekiel Matthew Lang Computer Science Department Indiana University.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Spatial Access Methods - z-ordering.
Carnegie Mellon Finding patterns in large, real networks Christos Faloutsos CMU
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
SCS-CMU Data Mining Tools A crash course C. Faloutsos.
15-826: Multimedia Databases and Data Mining
Next Generation Data Mining Tools: SVD and Fractals
Spatial Indexing.
Indexing and Data Mining in Multimedia Databases
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
15-826: Multimedia Databases and Data Mining
Data Mining using Fractals and Power laws
R-trees: An Average Case Analysis
Presentation transcript:

Carnegie Mellon Powerful Tools for Data Mining Fractals, Power laws, SVD C. Faloutsos Carnegie Mellon University

Carnegie Mellon PART II Fractals

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)3 Intro to fractals - outline Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)4 Road end-points of Montgomery county: Q1: how many d.a. for an R-tree? Q2 : distribution? not uniform not Gaussian no rules?? Problem #1: GIS - points

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)5 Problem #2 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores and households...) - patterns? - attraction/repulsion? - how many ‘spi’ within r from an ‘ell’?

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)6 Problem #3: traffic disk trace (from HP - J. Wilkes); Web traffic - fit a model time #bytes Poisson - how many explosions to expect? - queue length distr.?

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)7 Common answer: Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski, Cantor, Mandelbrot, (Hausdorff, Lyapunov, Ken Wilson, …)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)8 Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)9 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area; infinite length!

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)10 Definitions (cont’d) Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) =

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)11 Dfn of fd: ONLY for a perfectly self-similar point set: =log(n)/log(f) = log(3)/log(2) = zero area; infinite length!

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)12 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: 1 (= log(2)/log(2)!)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)13 Intrinsic (‘fractal’) dimension Q: dfn for a given set of points? yx

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)14 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) )

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)15 Intrinsic (‘fractal’) dimension Algorithm, to estimate it? Notice avg nn(<=r) is exactly tot#pairs(<=r) / (N)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)16 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58 == ‘correlation integral’

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)17 Observations: Euclidean objects have integer fractal dimensions –point: 0 –lines and smooth curves: 1 –smooth surfaces: 2 fractal dimension -> roughness of the periphery

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)18 Important properties fd = embedding dimension -> uniform pointset a point set may have several fd, depending on scale

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)19 Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)20 Cross-roads of Montgomery county: any rules? Problem #1: GIS points

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)21 Solution #1 A: self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2)) avg#neighbors(<= r ) = r^D log( r ) log(#pairs(within <= r)) 1.51

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)22 Solution #1 A: self-similarity avg#neighbors(<= r ) ~ r^(1.51) log( r ) log(#pairs(within <= r)) 1.51

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)23 Examples:MG county Montgomery County of MD (road end- points)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)24 Examples:LB county Long Beach county of CA (road end-points)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)25 Solution#2: spatial d.m. Galaxies ( ‘BOPS’ plot - [sigmod2000]) log(#pairs(<=r)) log(r)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)26 Solution#2: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! - repulsion!

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)27 spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! - repulsion!

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)28 spatial d.m. r1r2 r1 r2 Heuristic on choosing # of clusters

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)29 spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! - repulsion!

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)30 spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! -repulsion!! -duplicates

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)31 Solution #3: traffic disk traces: self-similar: time #bytes

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)32 Solution #3: traffic disk traces (80-20 ‘law’ = ‘multifractal’) time #bytes 20% 80%

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)33 Solution#3: traffic Clarification: fractal: a set of points that is self-similar multifractal: a probability density function that is self-similar Many other time-sequences are bursty/clustered: (such as?)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)34 Tape accesses time Tape#1 Tape# N # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)35 Tape accesses time Tape#1 Tape# N # tapes retrieved # qual. records = Poisson real

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)36 Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More tools and examples Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)37 More tools Zipf’s law Korcak’s law / “fat fractals”

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)38 A famous power law: Zipf’s law Q: vocabulary word frequency in a document - any pattern? aaronzoo freq.

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)39 A famous power law: Zipf’s law Bible - rank vs frequency (log-log) log(rank) log(freq) “a” “the”

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)40 A famous power law: Zipf’s law Bible - rank vs frequency (log-log) similarly, in many other languages; for customers and sales volume; city populations etc etc log(rank) log(freq)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)41 A famous power law: Zipf’s law Zipf distr: freq = 1/ rank generalized Zipf: freq = 1 / (rank)^a log(rank) log(freq)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)42 Olympic medals (Sidney): rank log(#medals)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)43 More power laws: areas – Korcak’s law Scandinavian lakes Any pattern?

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)44 More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)45 More power laws: Korcak Japan islands

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)46 More power laws: Korcak Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)47 (Korcak’s law: Aegean islands)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)48 Korcak’s law & “fat fractals” How to generate such regions?

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)49 Korcak’s law & “fat fractals” Q: How to generate such regions? A: recursively, from a single region

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)50 so far we’ve seen: concepts: –fractals, multifractals and fat fractals tools: –correlation integral (= pair-count plot) –rank/frequency plot (Zipf’s law) –CCDF (Korcak’s law)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)51 Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More tools and examples Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)52 Other applications: Internet How does the internet look like? CMU

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)53 Other applications: Internet How does the internet look like? Internet routers: how many neighbors within h hops? CMU

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)54 (reminder: our tool-box:) concepts: –fractals, multifractals and fat fractals tools: –correlation integral (= pair-count plot) –rank/frequency plot (Zipf’s law) –CCDF (Korcak’s law)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)55 Internet topology Internet routers: how many neighbors within h hops? Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops) log(#pairs) 2.8

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)56 More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree) -0.82

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)57 More power laws - internet pdf of degrees: (slope: 2.2 ) Log(count) Log(degree) -2.2

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)58 Even more power laws on the Internet Scree plot for Internet domains (log- log) [sigcomm99] log(i) log( i-th eigenvalue) 0.47

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)59 More apps: Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)60 More apps: Medical images [Burdett et al, SPIE ‘93]: benign tumors: fd ~ 2.37 malignant: fd ~ 2.56

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)61 More fractals: cardiovascular system: 3 (!) stock prices (LYCOS) - random walks: 1.5 Coastlines: (Norway!) 1 year2 years

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)62

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)63 More power laws duration of UNIX jobs Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(freq) magnitudeday amplitude

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)64 Even more power laws: publication counts (Lotka’s law) Distribution of UNIX file sizes Income distribution (Pareto’s law) web hit counts [Huberman]

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)65 Power laws, cont’ed In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] length of file transfers [Bestavros+] Click-stream data (w/ A. Montgomery (CMU-GSIA) + MediaMetrix)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)66 Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)67 Settings for fractals: Points; areas (-> fat fractals), eg:

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)68 Settings for fractals: Points; areas, eg: cities/stores/hospitals, over earth’s surface time-stamps of events (customer arrivals, packet losses, criminal actions) over time regions (sales areas, islands, patches of habitats) over space

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)69 Settings for fractals: customer feature vectors (age, income, frequency of visits, amount of sales per visit) ‘good’ customers ‘bad’ customers

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)70 Some uses of fractals: Detect non-existence of rules (if points are uniform) Detect non-homogeneous regions (eg., legal login time-stamps may have different fd than intruders’) Estimate number of neighbors / customers / competitors within a radius

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)71 Multi-Fractals Setting: points or objects, w/ some value, eg: –cities w/ populations –positions on earth and amount of gold/water/oil underneath –product ids and sales per product –people and their salaries –months and count of accidents

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)72 Use of multifractals: Estimate tape/disk accesses – how many of the 100 tapes contain my 50 phonecall records? – how many days without an accident? time Tape#1 Tape# N

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)73 Use of multifractals how often do we exceed the threshold? time #bytes Poisson

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)74 Use of multifractals cont’d Extrapolations for/from samples time #bytes

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)75 Use of multifractals cont’d How many distinct products account for 90% of the sales? 20% 80%

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)76 Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)77 Conclusions Real data often disobey textbook assumptions (Gaussian, Poisson, uniformity, independence) –avoid ‘mean’ - use median, or even better, use: fractals, self-similarity, and power laws, to find patterns - specifically:

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)78 Conclusions tool#1: (for points) ‘correlation integral’: (#pairs within <= r) vs (distance r) tool#2: (for categorical values) rank- frequency plot (a’la Zipf) tool#3: (for numerical values) CCDF: Complementary cumulative distr. function (#of elements with value >= a )

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)79 Practitioner’s guide: tool#1: #pairs vs distance, for a set of objects, with a distance function (slope = intrinsic dimensionality) log(hops) log(#pairs) 2.8 log( r ) log(#pairs(within <= r)) 1.51 internet MGcounty

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)80 Practitioner’s guide: tool#2: rank-frequency plot (for categorical attributes) log(rank) log(degree) internet domains Bible log(freq) log(rank)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)81 Practitioner’s guide: tool#3: CCDF, for (skewed) numerical attributes, eg. areas of islands/lakes, UNIX jobs...) log(count( >= area)) log(area) scandinavian lakes

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)82 Resources: Software for fractal dimension –

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)83 Books Strongly recommended intro book: –Manfred Schroeder Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 Classic book on fractals: –B. Mandelbrot Fractal Geometry of Nature, W.H. Freeman, 1977

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)84 References –[ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1- 15, Feb – [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R- trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)85 References –[vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p , 1995 –[vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)86 References –[vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept –[sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)87 References –[icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree International Conference on Data Engineering (ICDE), Sydney, Australia, March 23-26, 1999 –[sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)88 Appendix - Gory details Bad news: There are more than one fractal dimensions –Minkowski fd; Hausdorff fd; Correlation fd; Information fd Great news: –they can all be computed fast! –they usually have nearby values

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)89 Fast estimation of fd(s): How, for the (correlation) fractal dimension? A: Box-counting plot: log( r ) r pi log(sum(pi ^2))

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)90 Definitions pi : the percentage (or count) of points in the i-th cell r: the side of the grid

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)91 Fast estimation of fd(s): compute sum(pi^2) for another grid side, r’ log( r ) r’ pi’ log(sum(pi ^2))

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)92 Fast estimation of fd(s): etc; if the resulting plot has a linear part, its slope is the correlation fractal dimension D2 log( r ) log(sum(pi ^2))

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)93 Definitions (cont’d) Many more fractal dimensions Dq (related to Renyi entropies):

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)94 Hausdorff or box-counting fd: Box counting plot: Log( N ( r ) ) vs Log ( r) r: grid side N (r ): count of non-empty cells (Hausdorff) fractal dimension D0:

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)95 Definitions (cont’d) Hausdorff fd: r log(r) log(#non-empty cells) D0D0

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)96 Observations q=0: Hausdorff fractal dimension q=2: Correlation fractal dimension (identical to the exponent of the number of neighbors vs radius) q=1: Information fractal dimension

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)97 Observations, cont’d in general, the Dq’s take similar, but not identical, values. except for perfectly self-similar point-sets, where Dq=Dq’ for any q, q’

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)98 Examples:MG county Montgomery County of MD (road end- points)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)99 Examples:LB county Long Beach county of CA (road end-points)

Carnegie Mellon MDIC 04Copyright: C. Faloutsos (2004)100 Conclusions many fractal dimensions, with nearby values can be computed quickly (O(N) or O(N log(N)) (code: on the web)