Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU www.cs.cmu.edu/~christos.

Similar presentations


Presentation on theme: "Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU www.cs.cmu.edu/~christos."— Presentation transcript:

1 Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU www.cs.cmu.edu/~christos

2 Carnegie Mellon NSF-IDM2000 - C. Faloutsos2 Past Data mining : ‘find rules / interesting patterns’ ML -> decision trees, ANN,… DB -> A.R., DataCubes, OLAP, clustering (BIRCH, BFR, …), decision trees Stat: SVD/PCA, … Most of them: already in commercial products

3 Carnegie Mellon NSF-IDM2000 - C. Faloutsos3 Past often, (implicit) assumptions about -Gaussian distributions (eg., clustering) -Poisson arrivals (time series) -Uniformity/independence Often, inadequate – e.g.:

4 Carnegie Mellon NSF-IDM2000 - C. Faloutsos4 Road end-points of Montgomery county: Q: distribution? not uniform not gaussian no rules?? Problem #1: GIS - points

5 Carnegie Mellon NSF-IDM2000 - C. Faloutsos5 Problem #2: Internet Internet routers: how many neighbors within h hops?

6 Carnegie Mellon NSF-IDM2000 - C. Faloutsos6 Problem #3: traffic disk trace (from HP); Web traffic - fit a model time #bytes Poisson

7 Carnegie Mellon NSF-IDM2000 - C. Faloutsos7 Common answer: Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski, Cantor, Mandelbrot, (Hausdorff, Lyapunov, Wilson, …)

8 Carnegie Mellon NSF-IDM2000 - C. Faloutsos8 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: Important: intrinsic, or ‘fractal’ dimension =log(N)/log(r ) = log(3)/log(2) = 1.58 (!)

9 Carnegie Mellon NSF-IDM2000 - C. Faloutsos9 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( r ) ~ r^1 Q: fd of a plane? A: nn ( r ) = r^2

10 Carnegie Mellon NSF-IDM2000 - C. Faloutsos10 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58

11 Carnegie Mellon NSF-IDM2000 - C. Faloutsos11 Cross-roads of Montgomery county: any rules? Problem #1: GIS points

12 Carnegie Mellon NSF-IDM2000 - C. Faloutsos12 Solution #1 A: self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2) avg#neighbors(<= r ) = r^D log( r ) log(#pairs(within <= r))

13 Carnegie Mellon NSF-IDM2000 - C. Faloutsos13 Solution #2: Internet topology Internet routers: how many neighbors within h hops? Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops) log(#pairs) 3.3

14 Carnegie Mellon NSF-IDM2000 - C. Faloutsos14 Solution #3: traffic disk traces (Hurst exponent, variance plot, Fractional Gaussian noise, multifractals) time #bytes

15 Carnegie Mellon NSF-IDM2000 - C. Faloutsos15 More examples of fractals Galaxies (Sloan Digital Sky Survey)

16 Carnegie Mellon NSF-IDM2000 - C. Faloutsos16 Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd

17 Carnegie Mellon NSF-IDM2000 - C. Faloutsos17 More fractals and power laws: Coastlines: 1.2-1.58 (Norway!) cardiovascular system: 3 (!) stock prices: 1.5

18 Carnegie Mellon NSF-IDM2000 - C. Faloutsos18 More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree)

19 Carnegie Mellon NSF-IDM2000 - C. Faloutsos19 More tools: ‘fat fractals’ -> islands, lakes etc Multi-fractals: 80-20 ‘law’ … (multi-fractal spectrum, Hoelder exponent…)

20 Carnegie Mellon NSF-IDM2000 - C. Faloutsos20 More power laws: GIS areas Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)

21 Carnegie Mellon NSF-IDM2000 - C. Faloutsos21 More power laws: GIS areas Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))

22 Carnegie Mellon NSF-IDM2000 - C. Faloutsos22 Multifractals – 80-20 law 80-20 ‘law’, recursively applied - bias: p

23 Carnegie Mellon NSF-IDM2000 - C. Faloutsos23 Tape accesses time Tape#1 Tape# N # tapes retrieved # qual. records unif

24 Carnegie Mellon NSF-IDM2000 - C. Faloutsos24 More power laws Distribution of file sizes (‘Zipf’s law’) Income distribution (Pareto’s law) publication counts (Lotka’s law) length of articles in a newspaper (Zipf) web hit counts [Huberman] duration of UNIX jobs [Harchol-Balter] length of file transfers [Bestavros+]

25 Carnegie Mellon NSF-IDM2000 - C. Faloutsos25 Conclusions Real datasets: very often, self-similar: –geographic, medical, astrophysics, financial … settings; in –network/web traffic; the internet topology Then, we could look for –fractal/intrinsic dimension –power laws: y=x^a

26 Carnegie Mellon NSF-IDM2000 - C. Faloutsos26 Therefore: Need to ‘borrow’ tools + scale them up, or to develop new data mining tools –Tools from physics, math, graphics, … –beyond Gaussian, Poisson, uniformity, independence, –Beyond ‘mean’ and ‘variance’: slopes and exponents instead.

27 Carnegie Mellon NSF-IDM2000 - C. Faloutsos27 Resource: Manfred Schroeder “Fractals, Chaos, Power Laws”, Freeman and Co., 1991


Download ppt "Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU www.cs.cmu.edu/~christos."

Similar presentations


Ads by Google