Presentation is loading. Please wait.

Presentation is loading. Please wait.

15-826: Multimedia Databases and Data Mining

Similar presentations


Presentation on theme: "15-826: Multimedia Databases and Data Mining"— Presentation transcript:

1 15-826: Multimedia Databases and Data Mining
Faloutsos 15-826 15-826: Multimedia Databases and Data Mining Lecture #22: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos 15-826 (c) C. Faloutsos and J-Y Pan (2017)

2 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Must-read Material AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto, PAKDD 2004, Sydney, Australia 15-826 (c) C. Faloutsos and J-Y Pan (2017)

3 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Outline Motivation Formulation PCA and ICA Example applications Conclusion 15-826 (c) C. Faloutsos and J-Y Pan (2017)

4 Motivation: (Q1) Find patterns in data
Faloutsos 15-826 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Sample at 60 Hz (550 ticks=9.2sec) Left Knee Energy exerted Take-off Right Knee Landing Energy exerted 15-826 (c) C. Faloutsos and J-Y Pan (2017)

5 Motivation: (Q1) Find patterns in data
Faloutsos 15-826 Motivation: (Q1) Find patterns in data Take-off Landing Human would say Pattern 1: along diagonal Pattern 2: along vertical axis How to find these automatically? R:L=60:1 R:L=1:1 x:y= 15.51:14.12 X:y = -0.29:17.65 Each point is the measurement at a time tick (total 550 points). 15-826 (c) C. Faloutsos and J-Y Pan (2017)

6 Motivation: (Q2) Find hidden variables
Faloutsos 15-826 Motivation: (Q2) Find hidden variables Hidden variables (=‘topics’ = concepts) Stock prices Alcoa American Express AA: Alcoa AXP: American Express BA: Boeing CAT: Caterpillar C: Citigroup “General trend” Boeing Citi Group “Internet bubble” 15-826 (c) C. Faloutsos and J-Y Pan (2017)

7 Motivation: (Q2) Find hidden variables
Faloutsos 15-826 Motivation: (Q2) Find hidden variables Caterpillar Intel 0.94 0.03 0.63 0.64 “General trend” Hidden variables “Internet bubble” 15-826 (c) C. Faloutsos and J-Y Pan (2017)

8 Motivation: (Q2) Find hidden variables
Faloutsos 15-826 Motivation: (Q2) Find hidden variables Caterpillar Intel ? B1,CAT B2,INTC B1,INTC B2,CAT ? ? Hidden variable 1 Hidden variable 2 15-826 (c) C. Faloutsos and J-Y Pan (2017)

9 Motivation: Find hidden variables
Faloutsos 15-826 Motivation: Find hidden variables There are two sound sources in a cocktail party… =“blind source separation” (= we don’t know the sources, nor their mixing) 15-826 (c) C. Faloutsos and J-Y Pan (2017)

10 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Outline Motivation Formulation PCA and ICA Example applications Conclusion 15-826 (c) C. Faloutsos and J-Y Pan (2017)

11 Formulation: Finding patterns
Faloutsos 15-826 Formulation: Finding patterns x:y= 15.51:14.12 X:y = -0.29:17.65 Given n data points, each with m attributes. Find patterns that describe data properties the best. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

12 Linear representation
Faloutsos 15-826 Linear representation Surprisingly, the good ones are those given by ICA. A small comparison between PCA and ICA: PCA: major variations in distribution (!= patterns) together gives good low-rank approximation individually may not capture patterns Find vectors that describe the data set the best. Each point: linear combination of the vectors (patterns): 15-826 (c) C. Faloutsos and J-Y Pan (2017)

13 Patterns as data “vocabulary”
Faloutsos 15-826 Patterns as data “vocabulary” Good pattern ≈ sparse coding A data point can be viewed as a linear sum of the two patterns. A set of patterns is good, if the data representation is sparse – only one pattern is used. (Question) … Only b1 is needed to describe xi. (Q) Given data xi’s, compute hi,j’s and bi’s that are “sparse”? 15-826 (c) C. Faloutsos and J-Y Pan (2017)

14 Patterns in motion capture data
Faloutsos 15-826 Patterns in motion capture data Sparse ~ non-Gaussian ~ “Independent” Data matrix Left Right Basis vectors Hidden variables ? b2 b1 n=550 ticks “Independent”: e.g., minimize mutual information. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

15 Patterns in motion capture data
Faloutsos 15-826 Patterns in motion capture data Sparse ~ non-Gaussian ~ “Independent” Data matrix Basis vectors Hidden variables ? X = (U L ) VT 15-826 (c) C. Faloutsos and J-Y Pan (2017)

16 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Outline Motivation Formulation PCA and ICA Example applications Find topics in documents Hidden variables in stock prices Conclusion 15-826 (c) C. Faloutsos and J-Y Pan (2017)

17 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05]
Faloutsos 15-826 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05] (Q) Different modalities Video frames Step 1: Data points (matrix) or Stock prices Step 2: Compute patterns (Q) What pattern? So, this is how AutoSplit works. or Step 3: Interpret patterns (Q) How? Text documents Data mining (Case studies) 15-826 (c) C. Faloutsos and J-Y Pan (2017)

18 Finding patterns in high-dimensional data
Faloutsos 15-826 details Finding patterns in high-dimensional data Dimensionality reduction PCA finds the hyperplane. ICA finds the correct patterns. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

19 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Outline Motivation Formulation PCA and ICA Example applications Find topics in documents Hidden variables in stock prices Visual vocabulary for retinal images Conclusion 15-826 (c) C. Faloutsos and J-Y Pan (2017)

20 Topic discovery on text streams
Faloutsos 15-826 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream Documents are sorted by date/time Subsequent documents may have different topics Topic 1 Topic 3 Topic 1 Date/Time 15-826 (c) C. Faloutsos and J-Y Pan (2017)

21 Topic discovery on text streams
Faloutsos 15-826 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream FIND: the document boundaries AND: the terms of each topic Date/Time 15-826 (c) C. Faloutsos and J-Y Pan (2017) ??

22 Topic discovery on text streams
Faloutsos 15-826 Topic discovery on text streams Known: number of topics = 10 Unknown: (1) topic of each document (2) topic description Topic 1 Topic 3 Topic 1 Date/Time 15-826 (c) C. Faloutsos and J-Y Pan (2017)

23 Topic discovery in documents
Faloutsos 15-826 Topic discovery in documents Step 1 aaron zoo xi = [1, 5, …, 0] Windowing (n=1659) New stories X[nxm] (30 words) m=3887 (dictionary size) Continuing our example on text, Step 1: give a data matrix, Step 2: extract patterns, Step 3: Data: CNN headline news (Jan.-Jun. 1998) Windows of 30 words are taken 50% overlap, 1659 windows Word vectors (3887-Dim) PCA to 10-Dim Task: Segmentation and topic discovery Extract PCA bases for comparison ICA on the 10-Dim data vectors Extract ICA bases Note: 10-Dim bases are mapped back to original space to identify member terms. Documents of 10 topics in one single text stream (sorted by date) [Kitagawa and Hamamoto] Step 3 b’i = [0, 0.7, …, 0.6] aaron zoo animal B’[10x3887] Step 2 X[nxm’] = H[nxm’] B[m’xm’] Find hyperplane (m’=10) Find patterns (Q) What does b’i mean? 15-826 (c) C. Faloutsos and J-Y Pan (2017)

24 Step 3: Interpret the patterns
Faloutsos 15-826 Step 3: Interpret the patterns aaron animal zoo b’i = [0, 0.7, …, 0.6] Top words : “animal”, “zoo”, … A hidden topic! m=3887 (dictionary size) ID Sorted word list A Mckinne Sergeant sexual Major Armi B bomb Rudolph Clinic Atlanta Birmingham C Winfrei Beef Texa Oprah Cattl D Viagra Drug Impot Pill Doctor E Zamora Graham Kill Former Jone F Medal Olymp Gold Women Game G Pope Cube Castro Cuban Visit H Asia Economi Japan Econom Asian I Super Bowl Team Re J Peopl Tornado Florida Topics found General idea: related to the data attributes 15-826 (c) C. Faloutsos and J-Y Pan (2017)

25 Step 3: Evaluate the patterns
Faloutsos 15-826 Step 3: Evaluate the patterns ID True Topic 1 Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2 A bomb explodes in a Birmingham, AL abortion clinic 3 The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4 New impotency drug Viagra is approved for use 5 Diane Zamora is convicted of helping to murder her lover’s girlfriend 6 1998 Winter Olympic games 7 The Pope’s historic visit to Cube in Winter 1998 8 The economic crisis in Asia 9 Superbowl XXXII 10 Tornado in Florida ID Sorted word list A mckinne sergeant sexual major armi B bomb rudolph clinic atlanta birmingham C winfrei beef texa oprah cattl D viagra drug Impot pill doctor E zamora graham kill former jone AutoSplit finds correct topics. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

26 Step 3: Evaluate the patterns
Faloutsos 15-826 Step 3: Evaluate the patterns ID AutoSplit A mckinne sergeant sexual major armi B bomb rudolph clinic atlanta birmingham C winfrei beef texa oprah cattl D viagra drug Impot pill doctor E zamora graham kill former jone ID PCA A’ mckinne bomb women sexual sergeant B’ rudolph clinic atlanta C’ winfrei viagra texa beef oprah D’ drug E’ zamora graham olymp ID PCA A’ mckinne bomb women sexual sergeant B’ rudolph clinic atlanta C’ winfrei viagra texa beef oprah D’ drug E’ zamora graham olymp AutoSplit’s topics are better than PCA. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

27 Step 3: Evaluate the patterns
Faloutsos 15-826 Step 3: Evaluate the patterns AutoSplit A B C D E Topic 1 Topic 2 PCA A’ B’ C’ D’ E’ PCA vectors mix the topics. AutoSplit’s topics are better than PCA. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

28 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Outline Motivation Formulation PCA and ICA Example applications Find topics in documents Hidden variables in stock prices Conclusion 15-826 (c) C. Faloutsos and J-Y Pan (2017)

29 Find hidden variables (DJIA stocks)
Faloutsos 15-826 Find hidden variables (DJIA stocks) Weekly DJIA closing prices 01/02/ /05/2002, n=660 data points A data point: prices of 29 companies at the time Alcoa American Express Boeing Caterpillar Citi Group 15-826 (c) C. Faloutsos and J-Y Pan (2017)

30 Formulation: Find hidden variables
Faloutsos 15-826 Formulation: Find hidden variables AA1, …, XOM1 AAn, …, XOMn H11, H12, …, H1m Hn1, Hn2, …, Hnm B11, B12, …, B1m Bm1, Bm2, …, Bmm ? = ? Date Hidden variable Date 15-826 (c) C. Faloutsos and J-Y Pan (2017)

31 Characterize hidden variable by the companies it influences
Faloutsos 15-826 Characterize hidden variable by the companies it influences Caterpillar Intel B1,CAT 0.94 0.64 B2,INTC 0.63 0.03 B1,INTC B2,CAT “General trend” “Internet bubble” 15-826 (c) C. Faloutsos and J-Y Pan (2017)

32 Companies related to hidden variable 1
Faloutsos 15-826 Companies related to hidden variable 1 B1,j Highest Lowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard Talk a little bit about “which company doing what”. #AA Alcoa #AXP American Express #BA Boeing #CAT Caterpillar #C CitiGroup #DIS Disney #DD Du Pont #EK Eastman Kodak #GE General Electric #GM General Motors #HWP Hewlett-Packard #HD Home Depot #HON Honeywell #INTC Intel #IBM International Business Machines #IP International Paper #JNJ Johnson & Johnson #JPM JP Morgan Bank #KO Coca Cola #MCD McDonalds #MSFT Microsoft #MMM Minnesota Mining and Manufacturing (3M) #MO Philip Morris #MRK Merck #PG Procter and Gamble #SBC SBC Communications #T AT&T #UTX United Technologies #WMT WalMart Stores #XOM Exxon Mobil “General trend” 15-826 (c) C. Faloutsos and J-Y Pan (2017)

33 Companies related to hidden variable 1
Faloutsos Companies related to hidden variable 1 15-826 B1,j Highest Lowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard #AA Alcoa #AXP American Express #BA Boeing #CAT Caterpillar #C CitiGroup #DIS Disney #DD Du Pont #EK Eastman Kodak #GE General Electric #GM General Motors #HWP Hewlett-Packard #HD Home Depot #HON Honeywell #INTC Intel #IBM International Business Machines #IP International Paper #JNJ Johnson & Johnson #JPM JP Morgan Bank #KO Coca Cola #MCD McDonalds #MSFT Microsoft #MMM Minnesota Mining and Manufacturing (3M) #MO Philip Morris #MRK Merck #PG Procter and Gamble #SBC SBC Communications #T AT&T #UTX United Technologies #WMT WalMart Stores #XOM Exxon Mobil All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

34 General trend (and outlier)
Faloutsos 15-826 General trend (and outlier) “General trend” AT&T T: AT&T UTX: United Technologies WMT: Walmart Stores XOM: Exxon Mobil United Technologies Walmart Exxon Mobil 15-826 (c) C. Faloutsos and J-Y Pan (2017)

35 Companies related to hidden variable 2
Faloutsos Companies related to hidden variable 2 15-826 B2,j Highest Lowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont Philip Morris: big tobacco/cigarette company Caterpillar: Equipment company Procter and Gamble, Du Pont: chemical companies #AA Alcoa #AXP American Express #BA Boeing #CAT Caterpillar #C CitiGroup #DIS Disney #DD Du Pont #EK Eastman Kodak #GE General Electric #GM General Motors #HWP Hewlett-Packard #HD Home Depot #HON Honeywell #INTC Intel #IBM International Business Machines #IP International Paper #JNJ Johnson & Johnson #JPM JP Morgan Bank #KO Coca Cola #MCD McDonalds #MSFT Microsoft #MMM Minnesota Mining and Manufacturing (3M) #MO Philip Morris #MRK Merck #PG Procter and Gamble #SBC SBC Communications #T AT&T #UTX United Technologies #WMT WalMart Stores #XOM Exxon Mobil Tech company “Internet bubble” 15-826 (c) C. Faloutsos and J-Y Pan (2017)

36 Companies related to hidden variable 2
Faloutsos 15-826 Companies related to hidden variable 2 B2,j Highest Lowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont #AA Alcoa #AXP American Express #BA Boeing #CAT Caterpillar #C CitiGroup #DIS Disney #DD Du Pont #EK Eastman Kodak #GE General Electric #GM General Motors #HWP Hewlett-Packard #HD Home Depot #HON Honeywell #INTC Intel #IBM International Business Machines #IP International Paper #JNJ Johnson & Johnson #JPM JP Morgan Bank #KO Coca Cola #MCD McDonalds #MSFT Microsoft #MMM Minnesota Mining and Manufacturing (3M) #MO Philip Morris #MRK Merck #PG Procter and Gamble #SBC SBC Communications #T AT&T #UTX United Technologies #WMT WalMart Stores #XOM Exxon Mobil Tech company Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15). 15-826 (c) C. Faloutsos and J-Y Pan (2017)

37 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Outline Motivation Formulation PCA and ICA Example applications Find topics in documents Hidden variables in stock prices Visual vocabulary for retinal images Conclusion 15-826 (c) C. Faloutsos and J-Y Pan (2017)

38 (c) C. Faloutsos and J-Y Pan (2017)
15-826 ICA PCA Conclusion ICA: more flexible than PCA in finding patterns. Many applications Find topics and “vocabulary” for images Find hidden variables in time series (e.g., stock prices) Blind source separation Rule of thumb: plot after PCA; if ‘chicken-feet’, try ICA 15-826 (c) C. Faloutsos and J-Y Pan (2017)

39 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Citation AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto PAKDD 2004, Sydney, Australia 15-826 (c) C. Faloutsos and J-Y Pan (2017)

40 (c) C. Faloutsos and J-Y Pan (2017)
15-826 References Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006. Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005. Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004. 15-826 (c) C. Faloutsos and J-Y Pan (2017)

41 (c) C. Faloutsos and J-Y Pan (2017)
15-826 References Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001 15-826 (c) C. Faloutsos and J-Y Pan (2017)

42 (c) C. Faloutsos and J-Y Pan (2017)
15-826 Software Open source software: ‘fastICA’ Or ‘autosplit’: 15-826 (c) C. Faloutsos and J-Y Pan (2017)


Download ppt "15-826: Multimedia Databases and Data Mining"

Similar presentations


Ads by Google