Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
Carnegie Mellon NSF-IDM C. Faloutsos2 Past Data mining : ‘find rules / interesting patterns’ ML -> decision trees, ANN,… DB -> A.R., DataCubes, OLAP, clustering (BIRCH, BFR, …), decision trees Stat: SVD/PCA, … Most of them: already in commercial products
Carnegie Mellon NSF-IDM C. Faloutsos3 Past often, (implicit) assumptions about -Gaussian distributions (eg., clustering) -Poisson arrivals (time series) -Uniformity/independence Often, inadequate – e.g.:
Carnegie Mellon NSF-IDM C. Faloutsos4 Road end-points of Montgomery county: Q: distribution? not uniform not gaussian no rules?? Problem #1: GIS - points
Carnegie Mellon NSF-IDM C. Faloutsos5 Problem #2: Internet Internet routers: how many neighbors within h hops?
Carnegie Mellon NSF-IDM C. Faloutsos6 Problem #3: traffic disk trace (from HP); Web traffic - fit a model time #bytes Poisson
Carnegie Mellon NSF-IDM C. Faloutsos7 Common answer: Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski, Cantor, Mandelbrot, (Hausdorff, Lyapunov, Wilson, …)
Carnegie Mellon NSF-IDM C. Faloutsos8 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: Important: intrinsic, or ‘fractal’ dimension =log(N)/log(r ) = log(3)/log(2) = 1.58 (!)
Carnegie Mellon NSF-IDM C. Faloutsos9 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( r ) ~ r^1 Q: fd of a plane? A: nn ( r ) = r^2
Carnegie Mellon NSF-IDM C. Faloutsos10 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58
Carnegie Mellon NSF-IDM C. Faloutsos11 Cross-roads of Montgomery county: any rules? Problem #1: GIS points
Carnegie Mellon NSF-IDM C. Faloutsos12 Solution #1 A: self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2) avg#neighbors(<= r ) = r^D log( r ) log(#pairs(within <= r))
Carnegie Mellon NSF-IDM C. Faloutsos13 Solution #2: Internet topology Internet routers: how many neighbors within h hops? Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops) log(#pairs) 3.3
Carnegie Mellon NSF-IDM C. Faloutsos14 Solution #3: traffic disk traces (Hurst exponent, variance plot, Fractional Gaussian noise, multifractals) time #bytes
Carnegie Mellon NSF-IDM C. Faloutsos15 More examples of fractals Galaxies (Sloan Digital Sky Survey)
Carnegie Mellon NSF-IDM C. Faloutsos16 Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd
Carnegie Mellon NSF-IDM C. Faloutsos17 More fractals and power laws: Coastlines: (Norway!) cardiovascular system: 3 (!) stock prices: 1.5
Carnegie Mellon NSF-IDM C. Faloutsos18 More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree)
Carnegie Mellon NSF-IDM C. Faloutsos19 More tools: ‘fat fractals’ -> islands, lakes etc Multi-fractals: ‘law’ … (multi-fractal spectrum, Hoelder exponent…)
Carnegie Mellon NSF-IDM C. Faloutsos20 More power laws: GIS areas Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
Carnegie Mellon NSF-IDM C. Faloutsos21 More power laws: GIS areas Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))
Carnegie Mellon NSF-IDM C. Faloutsos22 Multifractals – law ‘law’, recursively applied - bias: p
Carnegie Mellon NSF-IDM C. Faloutsos23 Tape accesses time Tape#1 Tape# N # tapes retrieved # qual. records unif
Carnegie Mellon NSF-IDM C. Faloutsos24 More power laws Distribution of file sizes (‘Zipf’s law’) Income distribution (Pareto’s law) publication counts (Lotka’s law) length of articles in a newspaper (Zipf) web hit counts [Huberman] duration of UNIX jobs [Harchol-Balter] length of file transfers [Bestavros+]
Carnegie Mellon NSF-IDM C. Faloutsos25 Conclusions Real datasets: very often, self-similar: –geographic, medical, astrophysics, financial … settings; in –network/web traffic; the internet topology Then, we could look for –fractal/intrinsic dimension –power laws: y=x^a
Carnegie Mellon NSF-IDM C. Faloutsos26 Therefore: Need to ‘borrow’ tools + scale them up, or to develop new data mining tools –Tools from physics, math, graphics, … –beyond Gaussian, Poisson, uniformity, independence, –Beyond ‘mean’ and ‘variance’: slopes and exponents instead.
Carnegie Mellon NSF-IDM C. Faloutsos27 Resource: Manfred Schroeder “Fractals, Chaos, Power Laws”, Freeman and Co., 1991