Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
U. of Alberta, 2001C. Faloutsos2 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources
U. of Alberta, 2001C. Faloutsos3 Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: Allow fast, approximate queries, and Find rules/patterns
U. of Alberta, 2001C. Faloutsos4 Sample queries Similarity search –Find pairs of branches with similar sales patterns –find medical cases similar to Smith's –Find pairs of sensor series that move in sync
U. of Alberta, 2001C. Faloutsos5 Sample queries –cont’d Rule discovery –Clusters (of patients; of customers;...) –Forecasting (total sales for next year?) –Outliers (eg., fraud detection)
U. of Alberta, 2001C. Faloutsos6 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses
U. of Alberta, 2001C. Faloutsos7 Indexing - Multimedia Problem: given a set of (multimedia) objects, find the ones similar to a desirable query object (quickly!)
U. of Alberta, 2001C. Faloutsos8 day $price 1365 day $price 1365 day $price 1365 distance function: by expert
U. of Alberta, 2001C. Faloutsos9 day 1365 day 1365 S1 Sn F(S1) F(Sn) ‘GEMINI’ - Pictorially eg, avg eg,. std off-the-shelf S.A.Ms (spatial Access Methods)
U. of Alberta, 2001C. Faloutsos10 ‘GEMINI’ fast; ‘correct’ (=no false dismissals) used for –images (eg., QBIC) (2x, 10x faster) –shapes (27x faster) –video (eg., InforMedia) –time sequences ([Rafiei+Mendelzon], ++)
U. of Alberta, 2001C. Faloutsos11 Remaining issues how to extract features automatically? how to merge similarity scores from different media
U. of Alberta, 2001C. Faloutsos12 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search –Visualization: Fastmap –Relevance feedback: FALCON Data Mining / Fractals Conclusions
U. of Alberta, 2001C. Faloutsos13 FastMap O1O2O3O4O5 O O O O O ~100 ~1 ??
U. of Alberta, 2001C. Faloutsos14 FastMap Multi-dimensional scaling (MDS) can do that, but in O(N**2) time We want a linear algorithm: FastMap [SIGMOD95]
U. of Alberta, 2001C. Faloutsos15 Applications: time sequences given n co-evolving time sequences visualize them + find rules [ICDE00] time rate HKD JPY DEM
U. of Alberta, 2001C. Faloutsos16 Applications - financial currency exchange rates [ICDE00] USD(t) USD(t-5) FRF GBP JPY HKD
U. of Alberta, 2001C. Faloutsos17 Applications - financial currency exchange rates [ICDE00] USD HKD JPY FRF DEM GBP USD(t) USD(t-5)
U. of Alberta, 2001C. Faloutsos18 Application: VideoTrails [ACM MM97]
U. of Alberta, 2001C. Faloutsos19 VideoTrails - usage scene-cut detection (about 10% errors) scene classification (eg., dialogue vs action)
U. of Alberta, 2001C. Faloutsos20 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search –Visualization: Fastmap –Relevance feedback: FALCON Data Mining / Fractals Conclusions
U. of Alberta, 2001C. Faloutsos21 Merging similarity scores eg., video: text, color, motion, audio –weights change with the query! solution 1: user specifies weights solution 2: user gives examples –and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) –but: how about disjunctive queries?
U. of Alberta, 2001C. Faloutsos22 DEMO server demo
U. of Alberta, 2001C. Faloutsos23 ‘FALCON’ Inverted VsVs Trader wants only ‘unstable’ stocks
U. of Alberta, 2001C. Faloutsos24 ‘FALCON’ Inverted VsVs average: is flat!
U. of Alberta, 2001C. Faloutsos25 “Single query point” methods Rocchio x avg std
U. of Alberta, 2001C. Faloutsos26 “Single query point” methods RocchioMindReader MARS The averaging affect in action... xx x
U. of Alberta, 2001C. Faloutsos Main idea: FALCON Contours feature1 (eg., avg) feature2 eg., std [Wu+, vldb2000]
U. of Alberta, 2001C. Faloutsos28 A: Aggregate Dissimilarity : parameter (~ -5 ~ ‘soft OR’) g1 g2 x
U. of Alberta, 2001C. Faloutsos29 converges quickly (~5 iterations) good precision/recall is fast (can use off-the-shelf ‘spatial/metric access methods’) FALCON
U. of Alberta, 2001C. Faloutsos30 Conclusions for indexing + visualization GEMINI: fast indexing, exploiting off-the- shelf SAMs FastMap: automatic feature extraction in O(N) time FALCON: relevance feedback for disjunctive queries
U. of Alberta, 2001C. Faloutsos31 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses
U. of Alberta, 2001C. Faloutsos32 Data mining & fractals – Road map Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples
U. of Alberta, 2001C. Faloutsos33 Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores & households; healthy & ill subjects) - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability??
U. of Alberta, 2001C. Faloutsos34 Problem#2: dim. reduction given attributes x 1,... x n –possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) engine size mpg
U. of Alberta, 2001C. Faloutsos35 Answer: Fractals / self-similarities / power laws
U. of Alberta, 2001C. Faloutsos36 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area; infinite length!
U. of Alberta, 2001C. Faloutsos37 Definitions (cont’d) Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58… (long story)
U. of Alberta, 2001C. Faloutsos38 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? xy Eg: #cylinders; miles / gallon
U. of Alberta, 2001C. Faloutsos39 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1
U. of Alberta, 2001C. Faloutsos40 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) )
U. of Alberta, 2001C. Faloutsos41 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58 == ‘correlation integral’
U. of Alberta, 2001C. Faloutsos42 Observations self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2)) log( r ) log(#pairs within <=r ) 1.58
U. of Alberta, 2001C. Faloutsos43 Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions
U. of Alberta, 2001C. Faloutsos44 Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) clusters? separable? attraction/repulsion? data ‘scrubbing’ – duplicates?
U. of Alberta, 2001C. Faloutsos45 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! - repulsion!
U. of Alberta, 2001C. Faloutsos46 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]
U. of Alberta, 2001C. Faloutsos47 spatial d.m. r1r2 r1 r2 Heuristic on choosing # of clusters
U. of Alberta, 2001C. Faloutsos48 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! - repulsion!
U. of Alberta, 2001C. Faloutsos49 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell slope - plateau! -repulsion!! -duplicates
U. of Alberta, 2001C. Faloutsos50 Problem #2: Dim. reduction
U. of Alberta, 2001C. Faloutsos51 Solution: drop the attributes that don’t increase the ‘partial f.d.’ PFD dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]
U. of Alberta, 2001C. Faloutsos52 Problem #2: dim. reduction PFD~1 global FD=1 PFD=1 PFD=0 PFD=1
U. of Alberta, 2001C. Faloutsos53 Problem #2: dim. reduction PFD~1 PFD=1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: ‘max variance’ would fail here
U. of Alberta, 2001C. Faloutsos54 Problem #2: dim. reduction PFD~1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: SVD would fail here
U. of Alberta, 2001C. Faloutsos55 Currency dataset
U. of Alberta, 2001C. Faloutsos56 self-similar? fd=1.98 fd=4.25 currency eigenfaces
U. of Alberta, 2001C. Faloutsos57 FDR on the ‘currency’ dataset if unif + indep.
U. of Alberta, 2001C. Faloutsos58 FDR on the ‘currency’ dataset if unif + indep. HKD: “useless” >1.98 axis are needed
U. of Alberta, 2001C. Faloutsos59 Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions
U. of Alberta, 2001C. Faloutsos60 App. : traffic disk traces: self-similar (also: web traffic; comm. errors; etc) time #bytes
U. of Alberta, 2001C. Faloutsos61 More apps: Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd
U. of Alberta, 2001C. Faloutsos62 More fractals: stock prices (LYCOS) - random walks: year2 years
U. of Alberta, 2001C. Faloutsos63 More fractals: coast-lines: (up to 1.58)
U. of Alberta, 2001C. Faloutsos64
U. of Alberta, 2001C. Faloutsos65 Examples:MG county Montgomery County of MD (road end- points)
U. of Alberta, 2001C. Faloutsos66 Examples:LB county Long Beach county of CA (road end-points)
U. of Alberta, 2001C. Faloutsos67 More power laws: Zipf’s law Bible - rank vs frequency (log-log) log(rank) log(freq) “a” “the”
U. of Alberta, 2001C. Faloutsos68 More power laws Freq. distr. of first names; last names (Mandelbrot)
U. of Alberta, 2001C. Faloutsos69 Internet Internet routers: how many neighbors within h hops? U of Alberta
U. of Alberta, 2001C. Faloutsos70 Internet topology Internet routers: how many neighbors within h hops? [SIGCOMM 99] Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops) log(#pairs) 2.8
U. of Alberta, 2001C. Faloutsos71 More power laws: areas – Korcak’s law Scandinavian lakes ([icde99], w/ Proietti)
U. of Alberta, 2001C. Faloutsos72 More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
U. of Alberta, 2001C. Faloutsos73 Olympic medals: log rank log(# medals)
U. of Alberta, 2001C. Faloutsos74 More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) magnitudeday amplitude
U. of Alberta, 2001C. Faloutsos75 Even more power laws: Income distribution (Pareto’s law); sales distributions; duration of UNIX jobs Distribution of UNIX file sizes publication counts (Lotka’s law)
U. of Alberta, 2001C. Faloutsos76 Even more power laws: web hit frequencies ([Huberman]) hyper-link distribution [Barabasi], ++
U. of Alberta, 2001C. Faloutsos77 Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases Indexing: feature extraction (‘GEMINI’) –automatic feature extraction: FastMap –Relevance feedback: FALCON
U. of Alberta, 2001C. Faloutsos78 Conclusions - cont’d New tools for Data Mining: Fractals/power laws: –appear everywhere –lead to skewed distributions (Gaussian, Poisson, uniformity, independence) –‘correlation integral’ for separability/cluster detection –PFD for dimensionality reduction
U. of Alberta, 2001C. Faloutsos79 Conclusions - cont’d –can model bursty time sequences (buffering/prefetching) –selectivity estimation (‘how many neighbors within x km?) –dim. curse diagnosis (it’s the fractal dim. that matters! [ICDE2000])
U. of Alberta, 2001C. Faloutsos80 Resources: Software and papers: – –Fractal dimension (FracDim) –Separability (sigmod 2000) –Relevance feedback for query by content (FALCON – vldb 2000)