15-826: Multimedia Databases and Data Mining

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
CMU SCS Graph and stream mining Christos Faloutsos CMU.
School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Carnegie Mellon Powerful Tools for Data Mining Fractals, Power laws, SVD C. Faloutsos Carnegie Mellon University.
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Data Mining using Fractals and Power laws
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
Introduction to Fractals and Fractal Dimension Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional.
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals – examples & algo’s C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals (fractals for fun and profit) Christos Faloutsos Carnegie Mellon University.
1 Self Similar Traffic. 2 Self Similarity The idea is that something looks the same when viewed from different degrees of “magnification” or different.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
SCS-CMU Data Mining Tools A crash course C. Faloutsos.
WIS/COLLNET’2016 Nancy, France
15-826: Multimedia Databases and Data Mining
Next Generation Data Mining Tools: SVD and Fractals
15-826: Multimedia Databases and Data Mining
Spatial Indexing.
Indexing and Data Mining in Multimedia Databases
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Graph and Tensor Mining for fun and profit
15-826: Multimedia Databases and Data Mining
Data Mining using Fractals and Power laws
15-826: Multimedia Databases and Data Mining
R-trees: An Average Case Analysis
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Presentation transcript:

15-826: Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos

Copyright: C. Faloutsos (2017) 15-826 C. Faloutsos Must-read Material Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, Proc. ACM SIGACT-SIGMOD-SIGART PODS, May 1994, pp. 4-13, Minneapolis, MN. 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Recommended Material optional, but very useful: Manfred Schroeder Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 Chapter 10: boxcounting method Chapter 1: Sierpinski triangle 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) 15-826 C. Faloutsos Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining 15-826 Copyright: C. Faloutsos (2017)

Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods z-ordering R-trees misc fractals intro applications text 15-826 Copyright: C. Faloutsos (2017)

Intro to fractals - outline Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #1: GIS - points Road end-points of Montgomery county: Q1: how many d.a. for an R-tree? Q2 : distribution? not uniform not Gaussian no rules?? 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #2 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores and households ...) - patterns? - attraction/repulsion? - how many ‘spi’ within r from an ‘ell’? 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #3: traffic # bytes disk trace (from HP - J. Wilkes); Web traffic - fit a model how many explosions to expect? queue length distr.? time 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #3: traffic # bytes time Poisson indep., ident. distr 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #3: traffic # bytes time Poisson indep., ident. distr 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #3: traffic # bytes time Poisson indep., ident. distr Q: Then, how to generate such bursty traffic? 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Common answer: Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski, Cantor, Mandelbrot, (Hausdorff, Lyapunov, Ken Wilson, …) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) What is a fractal? = self-similar point set, e.g., Sierpinski triangle: ... zero area; infinite length! Dimensionality?? 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Definitions (cont’d) Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58... 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Dfn of fd: ONLY for a perfectly self-similar point set: zero area; infinite length! ... =log(n)/log(f) = log(3)/log(2) = 1.58 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: 1 (= log(2)/log(2)!) 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: 1 (= log(2)/log(2)!) 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension Q: dfn for a given set of points? 4 2 3 1 5 y x 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension EXPLANATIONS Intrinsic (‘fractal’) dimension Local fractal dimension of point ‘P’? A: nnP ( <= r ) ~ r^1 If this equation holds for several values of r, Then, the local fractal dimension of point P: Local fd = exp = 1 P 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension EXPLANATIONS Intrinsic (‘fractal’) dimension Local fractal dimension of point ‘A’? A: nnP ( <= r ) ~ r^1 If this is true for all points of the cloud Then the exponent is the global f.d. Or simply the f.d. P 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension EXPLANATIONS Intrinsic (‘fractal’) dimension Global fractal dimension? A: if sumall_P [ nnP ( <= r ) ] ~ r^1 Then: exp = global f.d. If this is true for all points of the cloud Then the exponent is the global f.d. Or simply the f.d. A 15-826 Copyright: C. Faloutsos (2017)

Intrinsic (‘fractal’) dimension EXPLANATIONS Intrinsic (‘fractal’) dimension Algorithm, to estimate it? Notice Sumall_P [ nnP (<=r) ] is exactly tot#pairs(<=r) including ‘mirror’ pairs 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Sierpinsky triangle == ‘correlation integral’ log( r ) log(#pairs within <=r ) 1.58 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Observations: Euclidean objects have integer fractal dimensions point: 0 lines and smooth curves: 1 smooth surfaces: 2 fractal dimension -> roughness of the periphery 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Important properties fd = embedding dimension -> uniform pointset a point set may have several fd, depending on scale 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Important properties fd = embedding dimension -> uniform pointset a point set may have several fd, depending on scale 2-d 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Important properties fd = embedding dimension -> uniform pointset a point set may have several fd, depending on scale 1-d 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Important properties 0-d 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem #1: GIS points Cross-roads of Montgomery county: any rules? 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution #1 A: self-similarity -> <=> fractals <=> scale-free <=> power-laws (y=x^a, F=C*r^(-2)) avg#neighbors(<= r ) = r^D log(#pairs(within <= r)) 1.51 log( r ) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution #1 A: self-similarity avg#neighbors(<= r ) ~ r^(1.51) log(#pairs(within <= r)) 1.51 log( r ) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Examples:MG county Montgomery County of MD (road end-points) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Examples:LB county Long Beach county of CA (road end-points) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution#2: spatial d.m. Galaxies ( ‘BOPS’ plot - [sigmod2000]) log(#pairs) log(r) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution#2: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Spatial d.m. r1 r2 r2 r1 Heuristic on choosing # of clusters 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution #3: traffic disk traces: self-similar: time #bytes 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution #3: traffic disk traces (80-20 ‘law’ = ‘multifractal’) 20% 80% #bytes time 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) 80-20 / multifractals 20 80 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) 80-20 / multifractals 20 80 p ; (1-p) in general yes, there are dependencies 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) More on 80/20: PQRS Part of ‘self-* storage’ project [Wang+’02] time cylinder# 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) More on 80/20: PQRS Part of ‘self-* storage’ project [Wang+’02] p q r s q r s 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Solution#3: traffic Clarification: fractal: a set of points that is self-similar multifractal: a probability density function that is self-similar Many other time-sequences are bursty/clustered: (such as?) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Example: network traffic http://repository.cs.vt.edu/lbl-conn-7.tar.Z 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Web traffic [Crovella Bestavros, SIGMETRICS’96] 1000 sec; 100sec 10sec; 1sec 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Tape accesses # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...) time Tape#1 Tape# N 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Tape accesses 50-50 = Poisson # tapes retrieved time Tape#1 Tape# N real # qual. records 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Road map Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More tools and examples Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting plots 15-826 Copyright: C. Faloutsos (2017)

A counter-intuitive example avg degree is, say 3.3 pick a node at random – guess its degree, exactly (-> “mode”) count ? avg: 3.3 degree 15-826 Copyright: C. Faloutsos (2017)

A counter-intuitive example avg degree is, say 3.3 pick a node at random – guess its degree, exactly (-> “mode”) A: 1!! count avg: 3.3 degree 15-826 Copyright: C. Faloutsos (2017)

A counter-intuitive example avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! A’: very skewed distr. Corollary: the mean is meaningless! (and std -> infinity (!)) count avg: 3.3 degree 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Rank exponent R Power law in the degree distribution [SIGCOMM99] internet domains log(rank) log(degree) -0.82 att.com ibm.com 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) More tools Zipf’s law Korcak’s law / “fat fractals” 15-826 Copyright: C. Faloutsos (2017)

A famous power law: Zipf’s law Q: vocabulary word frequency in a document - any pattern? freq. aaron zoo 15-826 Copyright: C. Faloutsos (2017)

A famous power law: Zipf’s law log(freq) “a” Bible - rank vs frequency (log-log) “the” log(rank) 15-826 Copyright: C. Faloutsos (2017)

A famous power law: Zipf’s law log(freq) Bible - rank vs frequency (log-log) similarly, in many other languages; for customers and sales volume; city populations etc etc log(rank) 15-826 Copyright: C. Faloutsos (2017)

A famous power law: Zipf’s law log(freq) Zipf distr: freq = 1/ rank generalized Zipf: freq = 1 / (rank)^a log(rank) 15-826 Copyright: C. Faloutsos (2017)

Olympic medals (Sydney): log(#medals) rank 15-826 Copyright: C. Faloutsos (2017)

Olympic medals (Sydney’00, Athens’04): log(#medals) log( rank) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) TELCO data count of customers ‘best customer’ # of service units 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) SALES data – store#96 count of products “aspirin” # units sold 15-826 Copyright: C. Faloutsos (2017)

More power laws: areas – Korcak’s law Scandinavian lakes Any pattern? 15-826 Copyright: C. Faloutsos (2017)

More power laws: areas – Korcak’s law log(count( >= area)) Scandinavian lakes area vs complementary cumulative count (log-log axes) log(area) 15-826 Copyright: C. Faloutsos (2017)

More power laws: Korcak log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) log(area) 15-826 Copyright: C. Faloutsos (2017)

(Korcak’s law: Aegean islands) 15-826 Copyright: C. Faloutsos (2017)

Korcak’s law & “fat fractals” How to generate such regions? 15-826 Copyright: C. Faloutsos (2017)

Korcak’s law & “fat fractals” Q: How to generate such regions? A: recursively, from a single region ... 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) so far we’ve seen: concepts: fractals, multifractals and fat fractals tools: correlation integral (= pair-count plot) rank/frequency plot (Zipf’s law) CCDF (Korcak’s law) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) so far we’ve seen: concepts: fractals, multifractals and fat fractals tools: correlation integral (= pair-count plot) rank/frequency plot (Zipf’s law) CCDF (Korcak’s law) same info 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Next: More examples / applications Practitioner’s guide Box-counting: fast estimation of correlation integral 15-826 Copyright: C. Faloutsos (2017)