Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos.

Similar presentations


Presentation on theme: "Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos."— Presentation transcript:

1 Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos

2 USC 2001C. Faloutsos2 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources

3 USC 2001C. Faloutsos3 Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: Allow fast, approximate queries, and Find rules/patterns

4 USC 2001C. Faloutsos4 Sample queries Similarity search –Find pairs of branches with similar sales patterns –find medical cases similar to Smith's –Find pairs of sensor series that move in sync –Find shapes like a spark-plug

5 USC 2001C. Faloutsos5 Sample queries –cont’d Rule discovery –Clusters (of branches; of sensor data;...) –Forecasting (total sales for next year?) –Outliers (eg., unexpected part failures; fraud detection)

6 USC 2001C. Faloutsos6 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions related projects @ CMU and resourses

7 USC 2001C. Faloutsos7 Indexing - Multimedia Problem: given a set of (multimedia) objects, find the ones similar to a desirable query object

8 USC 2001C. Faloutsos8 day $price 1365 day $price 1365 day $price 1365 distance function: by expert

9 USC 2001C. Faloutsos9 day 1365 day 1365 S1 Sn F(S1) F(Sn) ‘GEMINI’ - Pictorially eg, avg eg,. std

10 USC 2001C. Faloutsos10 Remaining issues how to extract features automatically? how to merge similarity scores from different media

11 USC 2001C. Faloutsos11 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search –Visualization: Fastmap –Relevance feedback: FALCON Data Mining / Fractals Conclusions

12 USC 2001C. Faloutsos12 FastMap O1O2O3O4O5 O1011100 O2101100 O3110100 O4100 01 O5100 10 ~100 ~1 ??

13 USC 2001C. Faloutsos13 FastMap Multi-dimensional scaling (MDS) can do that, but in O(N**2) time We want a linear algorithm: FastMap [SIGMOD95]

14 USC 2001C. Faloutsos14 Applications: time sequences given n co-evolving time sequences visualize them + find rules [ICDE00] time rate HKD JPY DEM

15 USC 2001C. Faloutsos15 Applications - financial currency exchange rates [ICDE00] USD(t) USD(t-5) FRF GBP JPY HKD

16 USC 2001C. Faloutsos16 Applications - financial currency exchange rates [ICDE00] USD HKD JPY FRF DEM GBP USD(t) USD(t-5)

17 USC 2001C. Faloutsos17 Application: VideoTrails [ACM MM97]

18 USC 2001C. Faloutsos18 VideoTrails - usage scene-cut detection (about 10% errors) scene classification (eg., dialogue vs action)

19 USC 2001C. Faloutsos19 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search –Visualization: Fastmap –Relevance feedback: FALCON Data Mining / Fractals Conclusions

20 USC 2001C. Faloutsos20 Merging similarity scores eg., video: text, color, motion, audio –weights change with the query! solution 1: user specifies weights solution 2: user gives examples –and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) –but: how about disjunctive queries?

21 USC 2001C. Faloutsos21 ‘FALCON’ Inverted VsVs Trader wants only ‘unstable’ stocks

22 USC 2001C. Faloutsos22 “Single query point” methods Rocchio + + + + + + x

23 USC 2001C. Faloutsos23 “Single query point” methods RocchioMindReader + + + + + + + + + + + + + + + + + + MARS The averaging affect in action... xx x

24 USC 2001C. Faloutsos24 + + + + + Main idea: FALCON Contours feature1 (eg., temperature) feature2 eg., frequency [Wu+, vldb2000]

25 USC 2001C. Faloutsos25 Conclusions for indexing + visualization GEMINI: fast indexing, exploiting off-the- shelf SAMs FastMap: automatic feature extraction in O(N) time FALCON: relevance feedback for disjunctive queries

26 USC 2001C. Faloutsos26 Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses

27 USC 2001C. Faloutsos27 Data mining & fractals – Road map Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples

28 USC 2001C. Faloutsos28 Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability??

29 USC 2001C. Faloutsos29 Problem#2: dim. reduction given attributes x 1,... x n –possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’)

30 USC 2001C. Faloutsos30 Answer: Fractals / self-similarities / power laws

31 USC 2001C. Faloutsos31 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area; infinite length!

32 USC 2001C. Faloutsos32 Definitions (cont’d) Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58… (long story)

33 USC 2001C. Faloutsos33 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? xy 51 42 33 24 Eg: #cylinders; miles / gallon

34 USC 2001C. Faloutsos34 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a)

35 USC 2001C. Faloutsos35 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) )

36 USC 2001C. Faloutsos36 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58 == ‘correlation integral’

37 USC 2001C. Faloutsos37 Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions

38 USC 2001C. Faloutsos38 Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) clusters? separable? attraction/repulsion? data ‘scrubbing’ – duplicates?

39 USC 2001C. Faloutsos39 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!

40 USC 2001C. Faloutsos40 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]

41 USC 2001C. Faloutsos41 spatial d.m. r1r2 r1 r2 Heuristic on choosing # of clusters

42 USC 2001C. Faloutsos42 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!

43 USC 2001C. Faloutsos43 Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! -repulsion!! -duplicates

44 USC 2001C. Faloutsos44 Problem #2: Dim. reduction

45 USC 2001C. Faloutsos45 Solution: drop the attributes that don’t increase the ‘partial f.d.’ PFD dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]

46 USC 2001C. Faloutsos46 Problem #2: dim. reduction PFD~1 global FD=1 PFD=1 PFD=0 PFD=1

47 USC 2001C. Faloutsos47 Problem #2: dim. reduction PFD~1 PFD=1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: ‘max variance’ would fail here

48 USC 2001C. Faloutsos48 Problem #2: dim. reduction PFD~1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: SVD would fail here

49 USC 2001C. Faloutsos49 Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples –fractals –power laws Conclusions

50 USC 2001C. Faloutsos50 disk traffic Not Poisson, not(?) iid - BUT: self-similar How to model it? time #bytes

51 USC 2001C. Faloutsos51 traffic disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) time #bytes 20% 80%

52 USC 2001C. Faloutsos52 Traffic Many other time-sequences are bursty/clustered: (such as?)

53 USC 2001C. Faloutsos53 Tape accesses time Tape#1 Tape# N # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...)

54 USC 2001C. Faloutsos54 Tape accesses time Tape#1 Tape# N # tapes retrieved # qual. records 50-50 = Poisson real

55 USC 2001C. Faloutsos55 More apps: Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd

56 USC 2001C. Faloutsos56 Cross-roads of Montgomery county: any rules? GIS points

57 USC 2001C. Faloutsos57 GIS A: self-similarity: intrinsic dim. = 1.51 avg#neighbors(<= r ) = r^D log( r ) log(#pairs(within <= r)) 1.51

58 USC 2001C. Faloutsos58 Examples:LB county Long Beach county of CA (road end-points)

59 USC 2001C. Faloutsos59 More fractals: cardiovascular system: 3 (!) stock prices (LYCOS) - random walks: 1.5 Coastlines: 1.2-1.58 (?) 1 year2 years

60 USC 2001C. Faloutsos60

61 USC 2001C. Faloutsos61 Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples –fractals –power laws Conclusions

62 USC 2001C. Faloutsos62 Fractals Power laws self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2)) log( r ) log(#pairs within <=r ) 1.58

63 USC 2001C. Faloutsos63 Bible RANK-FREQUENCY plot: (in log-log scales) Zipf’s (first) Law: Zipf’s law log(rank) log(freq)  “the” “and”

64 USC 2001C. Faloutsos64 Zipf’s law similarly for first names (slope ~-1) last names (~ -0.7) etc

65 USC 2001C. Faloutsos65 More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) magnitudeday amplitude

66 USC 2001C. Faloutsos66 Web Site Traffic log(freq) log(count) Zipf Clickstream data

67 USC 2001C. Faloutsos67 Lotka’s law library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(#citations) log(count) J. Ullman

68 USC 2001C. Faloutsos68 Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)

69 USC 2001C. Faloutsos69 More power laws: Korcak Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))

70 USC 2001C. Faloutsos70 (Korcak’s law: Aegean islands)

71 USC 2001C. Faloutsos71 Olympic medals: log rank log(# medals) USA China Russia

72 USC 2001C. Faloutsos72 SALES data – store#96 # units sold count of products

73 USC 2001C. Faloutsos73 TELCO data # of service units count of customers

74 USC 2001C. Faloutsos74 More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree) -0.82

75 USC 2001C. Faloutsos75 Even more power laws: Income distribution (Pareto’s law); duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi]

76 USC 2001C. Faloutsos76 Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases Indexing: feature extraction (‘GEMINI’) –automatic feature extraction: FastMap –Relevance feedback: FALCON

77 USC 2001C. Faloutsos77 Conclusions - cont’d New tools for Data Mining: Fractals/power laws: –appear everywhere –lead to skewed distributions (Gaussian, Poisson, uniformity, independence) –‘correlation integral’ for separability/cluster detection –PFD for dimensionality reduction

78 USC 2001C. Faloutsos78 Resources: Software and papers: –www.cs.cmu.edu/~christoswww.cs.cmu.edu/~christos –Fractal dimension (FracDim) –Separability (sigmod 2000, kdd2001) –Relevance feedback for query by content (FALCON – vldb 2000)

79 USC 2001C. Faloutsos79 Resources Manfred Schroeder “Chaos, Fractals and Power Laws”


Download ppt "Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos."

Similar presentations


Ads by Google