Download presentation
Presentation is loading. Please wait.
1
Indexing and Data Mining in Multimedia Databases
Christos Faloutsos CMU
2
Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources USC 2001 C. Faloutsos
3
Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: Allow fast, approximate queries, and Find rules/patterns USC 2001 C. Faloutsos
4
Sample queries Similarity search
Find pairs of branches with similar sales patterns find medical cases similar to Smith's Find pairs of sensor series that move in sync Find shapes like a spark-plug USC 2001 C. Faloutsos
5
Sample queries –cont’d
Rule discovery Clusters (of branches; of sensor data; ...) Forecasting (total sales for next year?) Outliers (eg., unexpected part failures; fraud detection) USC 2001 C. Faloutsos
6
Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions related CMU and resourses USC 2001 C. Faloutsos
7
Indexing - Multimedia Problem: given a set of (multimedia) objects,
find the ones similar to a desirable query object USC 2001 C. Faloutsos
8
distance function: by expert
day $price 1 365 day $price 1 365 day $price 1 365 distance function: by expert USC 2001 C. Faloutsos
9
‘GEMINI’ - Pictorially
eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg 1 365 day USC 2001 C. Faloutsos
10
Remaining issues how to extract features automatically?
how to merge similarity scores from different media USC 2001 C. Faloutsos
11
Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos
12
FastMap ~100 O1 O2 O3 O4 O5 1 100 ?? ~1 USC 2001 C. Faloutsos
13
FastMap Multi-dimensional scaling (MDS) can do that, but in O(N**2) time We want a linear algorithm: FastMap [SIGMOD95] USC 2001 C. Faloutsos
14
Applications: time sequences
given n co-evolving time sequences visualize them + find rules [ICDE00] DEM rate JPY HKD time USC 2001 C. Faloutsos
15
Applications - financial
currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) USC 2001 C. Faloutsos
16
Applications - financial
currency exchange rates [ICDE00] USD HKD JPY FRF DEM GBP USD(t) USD(t-5) USC 2001 C. Faloutsos
17
Application: VideoTrails
[ACM MM97] USC 2001 C. Faloutsos
18
VideoTrails - usage scene-cut detection (about 10% errors)
scene classification (eg., dialogue vs action) USC 2001 C. Faloutsos
19
Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos
20
Merging similarity scores
eg., video: text, color, motion, audio weights change with the query! solution 1: user specifies weights solution 2: user gives examples and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) but: how about disjunctive queries? USC 2001 C. Faloutsos
21
‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks USC 2001
C. Faloutsos
22
“Single query point” methods
+ + + x + + + Rocchio USC 2001 C. Faloutsos
23
“Single query point” methods
+ + + + + x x x + + + Rocchio MindReader MARS The averaging affect in action... USC 2001 C. Faloutsos
24
Main idea: FALCON Contours
[Wu+, vldb2000] + + feature2 eg., frequency + + + feature1 (eg., temperature) USC 2001 C. Faloutsos
25
Conclusions for indexing + visualization
GEMINI: fast indexing, exploiting off-the-shelf SAMs FastMap: automatic feature extraction in O(N) time FALCON: relevance feedback for disjunctive queries USC 2001 C. Faloutsos
26
Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses USC 2001 C. Faloutsos
27
Data mining & fractals – Road map
Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples USC 2001 C. Faloutsos
28
Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) attraction/repulsion? separability?? USC 2001 C. Faloutsos
29
Problem#2: dim. reduction
given attributes x1, ... xn possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) USC 2001 C. Faloutsos
30
Answer: Fractals / self-similarities / power laws USC 2001
C. Faloutsos
31
What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area; infinite length! ... USC 2001 C. Faloutsos
32
Definitions (cont’d) Paradox: Infinite perimeter ; Zero area!
‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58… (long story) USC 2001 C. Faloutsos
33
Intrinsic (‘fractal’) dimension
Eg: #cylinders; miles / gallon Q: fractal dimension of a line? x y 5 1 4 2 3 USC 2001 C. Faloutsos
34
Intrinsic (‘fractal’) dimension
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 C. Faloutsos
35
Intrinsic (‘fractal’) dimension
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) USC 2001 C. Faloutsos
36
Sierpinsky triangle == ‘correlation integral’ log(#pairs
log( r ) log(#pairs within <=r ) 1.58 USC 2001 C. Faloutsos
37
Road map Motivation – problems / case studies
Definition of fractals and power laws Solutions to posed problems More examples Conclusions USC 2001 C. Faloutsos
38
Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) clusters? separable? attraction/repulsion? data ‘scrubbing’ – duplicates? USC 2001 C. Faloutsos
39
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope
- plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos
40
[w/ Seeger, Traina, Traina, SIGMOD00]
Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos
41
spatial d.m. r1 r2 r2 r1 Heuristic on choosing # of clusters USC 2001
C. Faloutsos
42
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope
- plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos
43
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope
- plateau! repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) USC 2001 C. Faloutsos
44
Problem #2: Dim. reduction
USC 2001 C. Faloutsos
45
Solution: drop the attributes that don’t increase the ‘partial f.d.’ PFD dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] USC 2001 C. Faloutsos
46
Problem #2: dim. reduction
global FD=1 PFD=1 PFD~1 PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos
47
Problem #2: dim. reduction
global FD=1 PFD=1 PFD=1 Notice: ‘max variance’ would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos
48
Problem #2: dim. reduction
global FD=1 PFD=1 PFD~1 Notice: SVD would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos
49
Road map Motivation – problems / case studies
Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos
50
disk traffic Not Poisson, not(?) iid - BUT: self-similar
How to model it? time #bytes USC 2001 C. Faloutsos
51
traffic disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) 20% 80%
#bytes time USC 2001 C. Faloutsos
52
Traffic Many other time-sequences are bursty/clustered: (such as?)
USC 2001 C. Faloutsos
53
Tape accesses # tapes needed, to retrieve n records?
(# days down, due to failures / hurricanes / communication noise...) time Tape#1 Tape# N USC 2001 C. Faloutsos
54
Tape accesses 50-50 = Poisson # tapes retrieved Tape#1 Tape# N real
time Tape#1 Tape# N real # qual. records USC 2001 C. Faloutsos
55
More apps: Brain scans Oct-trees; brain-scans Log(#octants) 2.63 = fd
octree levels Log(#octants) 2.63 = fd USC 2001 C. Faloutsos
56
GIS points Cross-roads of Montgomery county: any rules? USC 2001
C. Faloutsos
57
GIS A: self-similarity: intrinsic dim. = 1.51
avg#neighbors(<= r ) = r^D log(#pairs(within <= r)) 1.51 log( r ) USC 2001 C. Faloutsos
58
Examples:LB county Long Beach county of CA (road end-points) USC 2001
C. Faloutsos
59
More fractals: cardiovascular system: 3 (!)
stock prices (LYCOS) - random walks: 1.5 Coastlines: (?) 1 year 2 years USC 2001 C. Faloutsos
60
USC 2001 C. Faloutsos
61
Road map Motivation – problems / case studies
Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos
62
Fractals <-> Power laws
self-similarity -> <=> fractals <=> scale-free <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos
63
Zipf’s law “the” log(freq) “and” Bible
RANK-FREQUENCY plot: (in log-log scales) log(rank) Zipf’s (first) Law: USC 2001 C. Faloutsos
64
Zipf’s law similarly for first names (slope ~-1) last names (~ -0.7)
etc USC 2001 C. Faloutsos
65
More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) amplitude day magnitude USC 2001 C. Faloutsos
66
Clickstream data <url, u-id, ....> Web Site Traffic log(count)
log(freq) log(count) Zipf USC 2001 C. Faloutsos
67
Lotka’s law library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) J. Ullman log(#citations) USC 2001 C. Faloutsos
68
Korcak’s law log(count( >= area))
Scandinavian lakes area vs complementary cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos
69
More power laws: Korcak
log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos
70
(Korcak’s law: Aegean islands)
USC 2001 C. Faloutsos
71
Olympic medals: log(# medals) Russia China USA log rank USC 2001
C. Faloutsos
72
SALES data – store#96 count of products # units sold USC 2001
C. Faloutsos
73
TELCO data count of customers # of service units USC 2001 C. Faloutsos
74
More power laws on the Internet
log(degree) -0.82 log(rank) degree vs rank, for Internet domains (log-log) [sigcomm99] USC 2001 C. Faloutsos
75
Even more power laws: Income distribution (Pareto’s law);
duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi] USC 2001 C. Faloutsos
76
Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases Indexing: feature extraction (‘GEMINI’) automatic feature extraction: FastMap Relevance feedback: FALCON USC 2001 C. Faloutsos
77
Conclusions - cont’d New tools for Data Mining: Fractals/power laws:
appear everywhere lead to skewed distributions (Gaussian, Poisson, uniformity, independence) ‘correlation integral’ for separability/cluster detection PFD for dimensionality reduction USC 2001 C. Faloutsos
78
Resources: Software and papers: www.cs.cmu.edu/~christos
Fractal dimension (FracDim) Separability (sigmod 2000, kdd2001) Relevance feedback for query by content (FALCON – vldb 2000) USC 2001 C. Faloutsos
79
Resources Manfred Schroeder “Chaos, Fractals and Power Laws” USC 2001
C. Faloutsos
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.