Download presentation

Presentation is loading. Please wait.

Published byRosemary Whicker Modified about 1 year ago

1
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)# : Multimedia Databases and Data Mining Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos

2
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#2 Must-read Material AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto, PAKDD 2004, Sydney, Australia

3
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#3 Outline Motivation Formulation PCA and ICA Example applications Conclusion

4
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#4 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Left Knee Right Knee Energy exerted Take-off Landing Energy exerted

5
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#5 Motivation: (Q1) Find patterns in data Human would say –Pattern 1: along diagonal –Pattern 2: along vertical axis How to find these automatically? Each point is the measurement at a time tick (total 550 points). Take-off Landing R:L=60:1 R:L=1:1

6
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#6 Motivation: (Q2) Find hidden variables Hidden variables (=‘topics’ = concepts) “Internet bubble” Alcoa American Express Boeing Citi Group … “General trend” Stock prices

7
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#7 Motivation: (Q2) Find hidden variables “General trend” “Internet bubble” CaterpillarIntel Hidden variables

8
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#8 Motivation: (Q2) Find hidden variables Hidden variable 1 Hidden variable 2 B 1,CAT B 1,INTC B 2,CAT B 2,INTC ? ?? CaterpillarIntel

9
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#9 Motivation: Find hidden variables There are two sound sources in a cocktail party… =“blind source separation” (= we don’t know the sources, nor their mixing)

10
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#10 Outline Motivation Formulation PCA and ICA Example applications Conclusion

11
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#11 Formulation: Finding patterns Given n data points, each with m attributes. Find patterns that describe data properties the best.

12
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#12 Linear representation Find vectors that describe the data set the best. Each point: linear combination of the vectors (patterns):

13
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#13 Patterns as data “vocabulary” Good pattern ≈ sparse coding (Q) Given data x i ’s, compute h i,j ’s and b i ’s that are “sparse”? Only b 1 is needed to describe x i.

14
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#14 Patterns in motion capture data b2b2 b1b1 n=550 ticks Data matrix LeftRight Basis vectors Hidden variables ?? Sparse ~ non-Gaussian ~ “Independent” “Independent”: e.g., minimize mutual information.

15
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#15 Patterns in motion capture data Data matrix Basis vectors Hidden variables ?? X = (U ) V T Sparse ~ non-Gaussian ~ “Independent”

16
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#16 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion

17
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#17 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05] Step 1: Data points (matrix) Step 2: Compute patterns Step 3: Interpret patterns … or … … Data mining (Case studies) (Q) What pattern? (Q) Different modalities (Q) How? Video frames Stock prices Text documents

18
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#18 Finding patterns in high- dimensional data PCA finds the hyperplane.ICA finds the correct patterns. Dimensionality reduction details

19
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#19 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion

20
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#20 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –Documents are sorted by date/time –Subsequent documents may have different topics Date/Time Topic 1 Topic 3Topic 1 … …

21
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#21 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –FIND: the document boundaries –AND: the terms of each topic Date/Time … ??

22
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#22 Topic discovery on text streams Known: number of topics = 10 Unknown: (1) topic of each document (2) topic description Date/Time Topic 1 Topic 3Topic 1 …

23
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#23 Topic discovery in documents X [nxm] x i = [1, 5, …, 0] Step 1 Windowing Step 2 X [nxm’] = H [nxm’] B [m’xm’] (1)Find hyperplane (m’=10) (2)Find patterns aaron zoo m=3887 (dictionary size) (n=1659) New stories … (30 words) Step 3 b’ i = [0, 0.7, …, 0.6] aaron zoo animal B’ [10x3887] (Q) What does b’ i mean?

24
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#24 Step 3: Interpret the patterns b’ i = [0, 0.7, …, 0.6] aaron zoo animal Topics found A hidden topic! Top words : “animal”, “zoo”, … IDSorted word list AMckinneSergeantsexualMajorArmi BbombRudolphClinicAtlantaBirmingham CWinfreiBeefTexaOprahCattl DViagraDrugImpotPillDoctor EZamoraGrahamKillFormerJone FMedalOlympGoldWomenGame GPopeCubeCastroCubanVisit HAsiaEconomiJapanEconomAsian ISuperBowlGameTeamRe JPeoplTornadoFloridaRebomb General idea: related to the data attributes m=3887 (dictionary size)

25
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#25 Step 3: Evaluate the patterns IDTrue Topic 1Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2A bomb explodes in a Birmingham, AL abortion clinic 3The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4New impotency drug Viagra is approved for use 5Diane Zamora is convicted of helping to murder her lover’s girlfriend Winter Olympic games 7The Pope’s historic visit to Cube in Winter The economic crisis in Asia 9Superbowl XXXII 10Tornado in Florida AutoSplit finds correct topics. IDSorted word list Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone

26
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#26 Step 3: Evaluate the patterns IDAutoSplit Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone AutoSplit’s topics are better than PCA. IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp

27
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#27 Step 3: Evaluate the patterns AutoSplit A B C D E AutoSplit’s topics are better than PCA. PCA A’ B’ C’ D’ E’ PCA vectors mix the topics. Topic 1 Topic 2

28
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#28 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion

29
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#29 Find hidden variables (DJIA stocks) Weekly DJIA closing prices –01/02/ /05/2002, n=660 data points –A data point: prices of 29 companies at the time Alcoa American Express Boeing Caterpillar Citi Group

30
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#30 Formulation: Find hidden variables AA 1, …, XOM 1 … AA n, …, XOM n H 11, H 12, …, H 1m … H n1, H n2, …, H nm = B 11, B 12, …, B 1m … B m1, B m2, …, B mm ? ? DateHidden variable Date

31
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#31 Characterize hidden variable by the companies it influences “General trend” “Internet bubble” CaterpillarIntel B 1,CAT B 1,INTC B 2,CAT B 2,INTC

32
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#32 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard “General trend”

33
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#33 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.

34
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#34 General trend (and outlier) “General trend” AT&T United Technologies Walmart Exxon Mobil

35
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#35 Companies related to hidden variable 2 B 2,j HighestLowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont “Internet bubble” Tech company

36
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#36 Companies related to hidden variable 2 B 2,j HighestLowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15). Tech company

37
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#37 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion

38
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#38 Conclusion ICA: more flexible than PCA in finding patterns. Many applications –Find topics and “vocabulary” for images –Find hidden variables in time series (e.g., stock prices) –Blind source separation ICA PCA

39
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#39 Citation AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto PAKDD 2004, Sydney, Australia

40
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#40 References Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.

41
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#41 References Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001

42
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#42 Software Open source software: ‘fastICA’ Or ‘autosplit’:

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google