Download presentation
Presentation is loading. Please wait.
Published byRosemary Whicker Modified over 9 years ago
1
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#1 15-826: Multimedia Databases and Data Mining Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos
2
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#2 Must-read Material AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto, PAKDD 2004, Sydney, Australia
3
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#3 Outline Motivation Formulation PCA and ICA Example applications Conclusion
4
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#4 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Left Knee Right Knee Energy exerted Take-off Landing Energy exerted
5
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#5 Motivation: (Q1) Find patterns in data Human would say –Pattern 1: along diagonal –Pattern 2: along vertical axis How to find these automatically? Each point is the measurement at a time tick (total 550 points). Take-off Landing R:L=60:1 R:L=1:1
6
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#6 Motivation: (Q2) Find hidden variables Hidden variables (=‘topics’ = concepts) “Internet bubble” Alcoa American Express Boeing Citi Group … “General trend” Stock prices
7
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#7 Motivation: (Q2) Find hidden variables “General trend” “Internet bubble” 0.94 0.030.63 0.64 CaterpillarIntel Hidden variables
8
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#8 Motivation: (Q2) Find hidden variables Hidden variable 1 Hidden variable 2 B 1,CAT B 1,INTC B 2,CAT B 2,INTC ? ?? CaterpillarIntel
9
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#9 Motivation: Find hidden variables There are two sound sources in a cocktail party… =“blind source separation” (= we don’t know the sources, nor their mixing)
10
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#10 Outline Motivation Formulation PCA and ICA Example applications Conclusion
11
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#11 Formulation: Finding patterns Given n data points, each with m attributes. Find patterns that describe data properties the best.
12
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#12 Linear representation Find vectors that describe the data set the best. Each point: linear combination of the vectors (patterns):
13
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#13 Patterns as data “vocabulary” Good pattern ≈ sparse coding (Q) Given data x i ’s, compute h i,j ’s and b i ’s that are “sparse”? Only b 1 is needed to describe x i.
14
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#14 Patterns in motion capture data b2b2 b1b1 n=550 ticks Data matrix LeftRight Basis vectors Hidden variables ?? Sparse ~ non-Gaussian ~ “Independent” “Independent”: e.g., minimize mutual information.
15
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#15 Patterns in motion capture data Data matrix Basis vectors Hidden variables ?? X = (U ) V T Sparse ~ non-Gaussian ~ “Independent”
16
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#16 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion
17
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#17 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05] Step 1: Data points (matrix) Step 2: Compute patterns Step 3: Interpret patterns … or … … Data mining (Case studies) (Q) What pattern? (Q) Different modalities (Q) How? Video frames Stock prices Text documents
18
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#18 Finding patterns in high- dimensional data PCA finds the hyperplane.ICA finds the correct patterns. Dimensionality reduction details
19
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#19 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion
20
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#20 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –Documents are sorted by date/time –Subsequent documents may have different topics Date/Time Topic 1 Topic 3Topic 1 … …
21
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#21 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –FIND: the document boundaries –AND: the terms of each topic Date/Time … ??
22
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#22 Topic discovery on text streams Known: number of topics = 10 Unknown: (1) topic of each document (2) topic description Date/Time Topic 1 Topic 3Topic 1 …
23
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#23 Topic discovery in documents X [nxm] x i = [1, 5, …, 0] Step 1 Windowing Step 2 X [nxm’] = H [nxm’] B [m’xm’] (1)Find hyperplane (m’=10) (2)Find patterns aaron zoo m=3887 (dictionary size) (n=1659) New stories … (30 words) Step 3 b’ i = [0, 0.7, …, 0.6] aaron zoo animal B’ [10x3887] (Q) What does b’ i mean?
24
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#24 Step 3: Interpret the patterns b’ i = [0, 0.7, …, 0.6] aaron zoo animal Topics found A hidden topic! Top words : “animal”, “zoo”, … IDSorted word list AMckinneSergeantsexualMajorArmi BbombRudolphClinicAtlantaBirmingham CWinfreiBeefTexaOprahCattl DViagraDrugImpotPillDoctor EZamoraGrahamKillFormerJone FMedalOlympGoldWomenGame GPopeCubeCastroCubanVisit HAsiaEconomiJapanEconomAsian ISuperBowlGameTeamRe JPeoplTornadoFloridaRebomb General idea: related to the data attributes m=3887 (dictionary size)
25
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#25 Step 3: Evaluate the patterns IDTrue Topic 1Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2A bomb explodes in a Birmingham, AL abortion clinic 3The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4New impotency drug Viagra is approved for use 5Diane Zamora is convicted of helping to murder her lover’s girlfriend 61998 Winter Olympic games 7The Pope’s historic visit to Cube in Winter 1998 8The economic crisis in Asia 9Superbowl XXXII 10Tornado in Florida AutoSplit finds correct topics. IDSorted word list Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone
26
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#26 Step 3: Evaluate the patterns IDAutoSplit Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone AutoSplit’s topics are better than PCA. IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp
27
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#27 Step 3: Evaluate the patterns AutoSplit A B C D E AutoSplit’s topics are better than PCA. PCA A’ B’ C’ D’ E’ PCA vectors mix the topics. Topic 1 Topic 2
28
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#28 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion
29
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#29 Find hidden variables (DJIA stocks) Weekly DJIA closing prices –01/02/1990-08/05/2002, n=660 data points –A data point: prices of 29 companies at the time Alcoa American Express Boeing Caterpillar Citi Group
30
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#30 Formulation: Find hidden variables AA 1, …, XOM 1 … AA n, …, XOM n H 11, H 12, …, H 1m … H n1, H n2, …, H nm = B 11, B 12, …, B 1m … B m1, B m2, …, B mm ? ? DateHidden variable Date
31
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#31 Characterize hidden variable by the companies it influences “General trend” “Internet bubble” 0.94 0.630.03 0.64 CaterpillarIntel B 1,CAT B 1,INTC B 2,CAT B 2,INTC
32
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#32 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar0.938512AT&T0.021885 Boeing0.911120WalMart0.624570 MMM0.906542Intel0.638010 Coca Cola0.903858Home Depot0.647774 Du Pont0.900317Hewlett-Packard0.658768 “General trend”
33
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#33 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar0.938512AT&T0.021885 Boeing0.911120WalMart0.624570 MMM0.906542Intel0.638010 Coca Cola0.903858Home Depot0.647774 Du Pont0.900317Hewlett-Packard0.658768 All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.
34
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#34 General trend (and outlier) “General trend” AT&T United Technologies Walmart Exxon Mobil
35
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#35 Companies related to hidden variable 2 B 2,j HighestLowest Intel0.641102Philip Morris-0.194843 Hewlett-Packard0.621159 International Paper -0.089569 GE0.509164Caterpillar0.031678 American Express0.504871 Procter and Gamble 0.109576 Disney0.490529Du Pont0.133337 2000-2001 “Internet bubble” Tech company
36
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#36 Companies related to hidden variable 2 B 2,j HighestLowest Intel0.641102Philip Morris-0.194843 Hewlett-Packard0.621159 International Paper -0.089569 GE0.509164Caterpillar0.031678 American Express0.504871 Procter and Gamble 0.109576 Disney0.490529Du Pont0.133337 Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15). Tech company
37
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#37 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion
38
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#38 Conclusion ICA: more flexible than PCA in finding patterns. Many applications –Find topics and “vocabulary” for images –Find hidden variables in time series (e.g., stock prices) –Blind source separation ICA PCA
39
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#39 Citation AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto PAKDD 2004, Sydney, Australia
40
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#40 References Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006. Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), 2005. Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp.125-130. Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.
41
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#41 References Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001
42
CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#42 Software Open source software: ‘fastICA’ http://research.ics.tkk.fi/ica/fastica/ http://research.ics.tkk.fi/ica/fastica/ Or ‘autosplit’: www.cs.cmu.edu/~jypan/software/autosplit_cmu.tar.gz
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.