Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU www.cs.cmu.edu/~christos.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
Chapter 9: Recursive Methods and Fractals E. Angel and D. Shreiner: Interactive Computer Graphics 6E © Addison-Wesley Mohan Sridharan Based on Slides.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Alon Arad Alon Arad Hurst Exponent of Complex Networks.
CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
CMU SCS Graph and stream mining Christos Faloutsos CMU.
School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Carnegie Mellon Powerful Tools for Data Mining Fractals, Power laws, SVD C. Faloutsos Carnegie Mellon University.
CS4395: Computer Graphics 1 Fractals Mohan Sridharan Based on slides created by Edward Angel.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Data Mining using Fractals and Power laws
CS 6401 Network Traffic Characteristics Outline Motivation Self-similarity Ethernet traffic WAN traffic Web traffic.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
Introduction to Fractals and Fractal Dimension Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
Seiji Armstrong Huy Luong Huy Luong Alon Arad Alon Arad Kane Hill Kane Hill.
Self-Similarity of Network Traffic Presented by Wei Lu Supervised by Niclas Meier 05/
Applications of Poisson Process
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
Network Traffic Modeling Punit Shah CSE581 Internet Technologies OGI, OHSU 2002, March 6.
Information Networks Power Laws and Network Models Lecture 3.
School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional.
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals – examples & algo’s C. Faloutsos.
Correlation Dimension d c Another measure of dimension Consider one point on a fractal and calculate the number of other points N(s) which have distances.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals (fractals for fun and profit) Christos Faloutsos Carnegie Mellon University.
1 Self Similar Traffic. 2 Self Similarity The idea is that something looks the same when viewed from different degrees of “magnification” or different.
Dimension A line segment has one dimension, namely length. length = 1 unit length = 2 units Euclidean Dimension = 1.
FRACTAL DIMENSION. DIMENSION Point 0 Line 1 Plane 2 Space 3.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Carnegie Mellon Finding patterns in large, real networks Christos Faloutsos CMU
CMU SCS Finding patterns in large, real networks Christos Faloutsos CMU.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
Fractals Ed Angel Professor Emeritus of Computer Science
School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Notices of the AMS, September Internet traffic Standard Poisson models don’t capture long-range correlations. Poisson Measured “bursty” on all time.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
SCS-CMU Data Mining Tools A crash course C. Faloutsos.
Next Generation Data Mining Tools: SVD and Fractals
Spatial Indexing.
Indexing and Data Mining in Multimedia Databases
15-826: Multimedia Databases and Data Mining
Part 1: Graph Mining – patterns
15-826: Multimedia Databases and Data Mining
Notices of the AMS, September 1998
15-826: Multimedia Databases and Data Mining
Mark E. Crovella and Azer Bestavros Computer Science Dept,
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Graph and Tensor Mining for fun and profit
15-826: Multimedia Databases and Data Mining
Data Mining using Fractals and Power laws
R-trees: An Average Case Analysis
Presentation transcript:

Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU

Carnegie Mellon NSF-IDM C. Faloutsos2 Past Data mining : ‘find rules / interesting patterns’ ML -> decision trees, ANN,… DB -> A.R., DataCubes, OLAP, clustering (BIRCH, BFR, …), decision trees Stat: SVD/PCA, … Most of them: already in commercial products

Carnegie Mellon NSF-IDM C. Faloutsos3 Past often, (implicit) assumptions about -Gaussian distributions (eg., clustering) -Poisson arrivals (time series) -Uniformity/independence Often, inadequate – e.g.:

Carnegie Mellon NSF-IDM C. Faloutsos4 Road end-points of Montgomery county: Q: distribution? not uniform not gaussian no rules?? Problem #1: GIS - points

Carnegie Mellon NSF-IDM C. Faloutsos5 Problem #2: Internet Internet routers: how many neighbors within h hops?

Carnegie Mellon NSF-IDM C. Faloutsos6 Problem #3: traffic disk trace (from HP); Web traffic - fit a model time #bytes Poisson

Carnegie Mellon NSF-IDM C. Faloutsos7 Common answer: Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski, Cantor, Mandelbrot, (Hausdorff, Lyapunov, Wilson, …)

Carnegie Mellon NSF-IDM C. Faloutsos8 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: Important: intrinsic, or ‘fractal’ dimension =log(N)/log(r ) = log(3)/log(2) = 1.58 (!)

Carnegie Mellon NSF-IDM C. Faloutsos9 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( r ) ~ r^1 Q: fd of a plane? A: nn ( r ) = r^2

Carnegie Mellon NSF-IDM C. Faloutsos10 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58

Carnegie Mellon NSF-IDM C. Faloutsos11 Cross-roads of Montgomery county: any rules? Problem #1: GIS points

Carnegie Mellon NSF-IDM C. Faloutsos12 Solution #1 A: self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2) avg#neighbors(<= r ) = r^D log( r ) log(#pairs(within <= r))

Carnegie Mellon NSF-IDM C. Faloutsos13 Solution #2: Internet topology Internet routers: how many neighbors within h hops? Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops) log(#pairs) 3.3

Carnegie Mellon NSF-IDM C. Faloutsos14 Solution #3: traffic disk traces (Hurst exponent, variance plot, Fractional Gaussian noise, multifractals) time #bytes

Carnegie Mellon NSF-IDM C. Faloutsos15 More examples of fractals Galaxies (Sloan Digital Sky Survey)

Carnegie Mellon NSF-IDM C. Faloutsos16 Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd

Carnegie Mellon NSF-IDM C. Faloutsos17 More fractals and power laws: Coastlines: (Norway!) cardiovascular system: 3 (!) stock prices: 1.5

Carnegie Mellon NSF-IDM C. Faloutsos18 More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree)

Carnegie Mellon NSF-IDM C. Faloutsos19 More tools: ‘fat fractals’ -> islands, lakes etc Multi-fractals: ‘law’ … (multi-fractal spectrum, Hoelder exponent…)

Carnegie Mellon NSF-IDM C. Faloutsos20 More power laws: GIS areas Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)

Carnegie Mellon NSF-IDM C. Faloutsos21 More power laws: GIS areas Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))

Carnegie Mellon NSF-IDM C. Faloutsos22 Multifractals – law ‘law’, recursively applied - bias: p

Carnegie Mellon NSF-IDM C. Faloutsos23 Tape accesses time Tape#1 Tape# N # tapes retrieved # qual. records unif

Carnegie Mellon NSF-IDM C. Faloutsos24 More power laws Distribution of file sizes (‘Zipf’s law’) Income distribution (Pareto’s law) publication counts (Lotka’s law) length of articles in a newspaper (Zipf) web hit counts [Huberman] duration of UNIX jobs [Harchol-Balter] length of file transfers [Bestavros+]

Carnegie Mellon NSF-IDM C. Faloutsos25 Conclusions Real datasets: very often, self-similar: –geographic, medical, astrophysics, financial … settings; in –network/web traffic; the internet topology Then, we could look for –fractal/intrinsic dimension –power laws: y=x^a

Carnegie Mellon NSF-IDM C. Faloutsos26 Therefore: Need to ‘borrow’ tools + scale them up, or to develop new data mining tools –Tools from physics, math, graphics, … –beyond Gaussian, Poisson, uniformity, independence, –Beyond ‘mean’ and ‘variance’: slopes and exponents instead.

Carnegie Mellon NSF-IDM C. Faloutsos27 Resource: Manfred Schroeder “Fractals, Chaos, Power Laws”, Freeman and Co., 1991