Download presentation

Presentation is loading. Please wait.

Published byRosemary Whicker Modified over 2 years ago

1
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#1 15-826: Multimedia Databases and Data Mining Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos

2
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#2 Must-read Material AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto, PAKDD 2004, Sydney, Australia

3
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#3 Outline Motivation Formulation PCA and ICA Example applications Conclusion

4
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#4 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Left Knee Right Knee Energy exerted Take-off Landing Energy exerted

5
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#5 Motivation: (Q1) Find patterns in data Human would say –Pattern 1: along diagonal –Pattern 2: along vertical axis How to find these automatically? Each point is the measurement at a time tick (total 550 points). Take-off Landing R:L=60:1 R:L=1:1

6
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#6 Motivation: (Q2) Find hidden variables Hidden variables (=‘topics’ = concepts) “Internet bubble” Alcoa American Express Boeing Citi Group … “General trend” Stock prices

7
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#7 Motivation: (Q2) Find hidden variables “General trend” “Internet bubble” 0.94 0.030.63 0.64 CaterpillarIntel Hidden variables

8
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#8 Motivation: (Q2) Find hidden variables Hidden variable 1 Hidden variable 2 B 1,CAT B 1,INTC B 2,CAT B 2,INTC ? ?? CaterpillarIntel

9
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#9 Motivation: Find hidden variables There are two sound sources in a cocktail party… =“blind source separation” (= we don’t know the sources, nor their mixing)

10
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#10 Outline Motivation Formulation PCA and ICA Example applications Conclusion

11
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#11 Formulation: Finding patterns Given n data points, each with m attributes. Find patterns that describe data properties the best.

12
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#12 Linear representation Find vectors that describe the data set the best. Each point: linear combination of the vectors (patterns):

13
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#13 Patterns as data “vocabulary” Good pattern ≈ sparse coding (Q) Given data x i ’s, compute h i,j ’s and b i ’s that are “sparse”? Only b 1 is needed to describe x i.

14
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#14 Patterns in motion capture data b2b2 b1b1 n=550 ticks Data matrix LeftRight Basis vectors Hidden variables ?? Sparse ~ non-Gaussian ~ “Independent” “Independent”: e.g., minimize mutual information.

15
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#15 Patterns in motion capture data Data matrix Basis vectors Hidden variables ?? X = (U ) V T Sparse ~ non-Gaussian ~ “Independent”

16
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#16 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion

17
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#17 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05] Step 1: Data points (matrix) Step 2: Compute patterns Step 3: Interpret patterns … or … … Data mining (Case studies) (Q) What pattern? (Q) Different modalities (Q) How? Video frames Stock prices Text documents

18
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#18 Finding patterns in high- dimensional data PCA finds the hyperplane.ICA finds the correct patterns. Dimensionality reduction details

19
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#19 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion

20
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#20 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –Documents are sorted by date/time –Subsequent documents may have different topics Date/Time Topic 1 Topic 3Topic 1 … …

21
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#21 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –FIND: the document boundaries –AND: the terms of each topic Date/Time … ??

22
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#22 Topic discovery on text streams Known: number of topics = 10 Unknown: (1) topic of each document (2) topic description Date/Time Topic 1 Topic 3Topic 1 …

23
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#23 Topic discovery in documents X [nxm] x i = [1, 5, …, 0] Step 1 Windowing Step 2 X [nxm’] = H [nxm’] B [m’xm’] (1)Find hyperplane (m’=10) (2)Find patterns aaron zoo m=3887 (dictionary size) (n=1659) New stories … (30 words) Step 3 b’ i = [0, 0.7, …, 0.6] aaron zoo animal B’ [10x3887] (Q) What does b’ i mean?

24
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#24 Step 3: Interpret the patterns b’ i = [0, 0.7, …, 0.6] aaron zoo animal Topics found A hidden topic! Top words : “animal”, “zoo”, … IDSorted word list AMckinneSergeantsexualMajorArmi BbombRudolphClinicAtlantaBirmingham CWinfreiBeefTexaOprahCattl DViagraDrugImpotPillDoctor EZamoraGrahamKillFormerJone FMedalOlympGoldWomenGame GPopeCubeCastroCubanVisit HAsiaEconomiJapanEconomAsian ISuperBowlGameTeamRe JPeoplTornadoFloridaRebomb General idea: related to the data attributes m=3887 (dictionary size)

25
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#25 Step 3: Evaluate the patterns IDTrue Topic 1Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2A bomb explodes in a Birmingham, AL abortion clinic 3The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4New impotency drug Viagra is approved for use 5Diane Zamora is convicted of helping to murder her lover’s girlfriend 61998 Winter Olympic games 7The Pope’s historic visit to Cube in Winter 1998 8The economic crisis in Asia 9Superbowl XXXII 10Tornado in Florida AutoSplit finds correct topics. IDSorted word list Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone

26
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#26 Step 3: Evaluate the patterns IDAutoSplit Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone AutoSplit’s topics are better than PCA. IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp

27
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#27 Step 3: Evaluate the patterns AutoSplit A B C D E AutoSplit’s topics are better than PCA. PCA A’ B’ C’ D’ E’ PCA vectors mix the topics. Topic 1 Topic 2

28
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#28 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion

29
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#29 Find hidden variables (DJIA stocks) Weekly DJIA closing prices –01/02/1990-08/05/2002, n=660 data points –A data point: prices of 29 companies at the time Alcoa American Express Boeing Caterpillar Citi Group

30
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#30 Formulation: Find hidden variables AA 1, …, XOM 1 … AA n, …, XOM n H 11, H 12, …, H 1m … H n1, H n2, …, H nm = B 11, B 12, …, B 1m … B m1, B m2, …, B mm ? ? DateHidden variable Date

31
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#31 Characterize hidden variable by the companies it influences “General trend” “Internet bubble” 0.94 0.630.03 0.64 CaterpillarIntel B 1,CAT B 1,INTC B 2,CAT B 2,INTC

32
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#32 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar0.938512AT&T0.021885 Boeing0.911120WalMart0.624570 MMM0.906542Intel0.638010 Coca Cola0.903858Home Depot0.647774 Du Pont0.900317Hewlett-Packard0.658768 “General trend”

33
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#33 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar0.938512AT&T0.021885 Boeing0.911120WalMart0.624570 MMM0.906542Intel0.638010 Coca Cola0.903858Home Depot0.647774 Du Pont0.900317Hewlett-Packard0.658768 All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.

34
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#34 General trend (and outlier) “General trend” AT&T United Technologies Walmart Exxon Mobil

35
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#35 Companies related to hidden variable 2 B 2,j HighestLowest Intel0.641102Philip Morris-0.194843 Hewlett-Packard0.621159 International Paper -0.089569 GE0.509164Caterpillar0.031678 American Express0.504871 Procter and Gamble 0.109576 Disney0.490529Du Pont0.133337 2000-2001 “Internet bubble” Tech company

36
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#36 Companies related to hidden variable 2 B 2,j HighestLowest Intel0.641102Philip Morris-0.194843 Hewlett-Packard0.621159 International Paper -0.089569 GE0.509164Caterpillar0.031678 American Express0.504871 Procter and Gamble 0.109576 Disney0.490529Du Pont0.133337 Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15). Tech company

37
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#37 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion

38
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#38 Conclusion ICA: more flexible than PCA in finding patterns. Many applications –Find topics and “vocabulary” for images –Find hidden variables in time series (e.g., stock prices) –Blind source separation ICA PCA

39
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#39 Citation AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto PAKDD 2004, Sydney, Australia

40
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#40 References Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006. Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005. Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130. Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.

41
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#41 References Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001

42
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#42 Software Open source software: ‘fastICA’ http://research.ics.tkk.fi/ica/fastica/ http://research.ics.tkk.fi/ica/fastica/ Or ‘autosplit’: www.cs.cmu.edu/~jypan/software/autosplit_cmu.tar.gz

Similar presentations

OK

PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.

PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google