CMU SCS (c) C. Faloutsos and J-Y Pan (2013)# : Multimedia Databases and Data Mining Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#2 Must-read Material AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto, PAKDD 2004, Sydney, Australia
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#3 Outline Motivation Formulation PCA and ICA Example applications Conclusion
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#4 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Left Knee Right Knee Energy exerted Take-off Landing Energy exerted
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#5 Motivation: (Q1) Find patterns in data Human would say –Pattern 1: along diagonal –Pattern 2: along vertical axis How to find these automatically? Each point is the measurement at a time tick (total 550 points). Take-off Landing R:L=60:1 R:L=1:1
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#6 Motivation: (Q2) Find hidden variables Hidden variables (=‘topics’ = concepts) “Internet bubble” Alcoa American Express Boeing Citi Group … “General trend” Stock prices
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#7 Motivation: (Q2) Find hidden variables “General trend” “Internet bubble” CaterpillarIntel Hidden variables
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#8 Motivation: (Q2) Find hidden variables Hidden variable 1 Hidden variable 2 B 1,CAT B 1,INTC B 2,CAT B 2,INTC ? ?? CaterpillarIntel
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#9 Motivation: Find hidden variables There are two sound sources in a cocktail party… =“blind source separation” (= we don’t know the sources, nor their mixing)
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#10 Outline Motivation Formulation PCA and ICA Example applications Conclusion
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#11 Formulation: Finding patterns Given n data points, each with m attributes. Find patterns that describe data properties the best.
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#12 Linear representation Find vectors that describe the data set the best. Each point: linear combination of the vectors (patterns):
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#13 Patterns as data “vocabulary” Good pattern ≈ sparse coding (Q) Given data x i ’s, compute h i,j ’s and b i ’s that are “sparse”? Only b 1 is needed to describe x i.
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#14 Patterns in motion capture data b2b2 b1b1 n=550 ticks Data matrix LeftRight Basis vectors Hidden variables ?? Sparse ~ non-Gaussian ~ “Independent” “Independent”: e.g., minimize mutual information.
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#15 Patterns in motion capture data Data matrix Basis vectors Hidden variables ?? X = (U ) V T Sparse ~ non-Gaussian ~ “Independent”
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#16 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#17 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05] Step 1: Data points (matrix) Step 2: Compute patterns Step 3: Interpret patterns … or … … Data mining (Case studies) (Q) What pattern? (Q) Different modalities (Q) How? Video frames Stock prices Text documents
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#18 Finding patterns in high- dimensional data PCA finds the hyperplane.ICA finds the correct patterns. Dimensionality reduction details
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#19 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#20 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –Documents are sorted by date/time –Subsequent documents may have different topics Date/Time Topic 1 Topic 3Topic 1 … …
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#21 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –FIND: the document boundaries –AND: the terms of each topic Date/Time … ??
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#22 Topic discovery on text streams Known: number of topics = 10 Unknown: (1) topic of each document (2) topic description Date/Time Topic 1 Topic 3Topic 1 …
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#23 Topic discovery in documents X [nxm] x i = [1, 5, …, 0] Step 1 Windowing Step 2 X [nxm’] = H [nxm’] B [m’xm’] (1)Find hyperplane (m’=10) (2)Find patterns aaron zoo m=3887 (dictionary size) (n=1659) New stories … (30 words) Step 3 b’ i = [0, 0.7, …, 0.6] aaron zoo animal B’ [10x3887] (Q) What does b’ i mean?
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#24 Step 3: Interpret the patterns b’ i = [0, 0.7, …, 0.6] aaron zoo animal Topics found A hidden topic! Top words : “animal”, “zoo”, … IDSorted word list AMckinneSergeantsexualMajorArmi BbombRudolphClinicAtlantaBirmingham CWinfreiBeefTexaOprahCattl DViagraDrugImpotPillDoctor EZamoraGrahamKillFormerJone FMedalOlympGoldWomenGame GPopeCubeCastroCubanVisit HAsiaEconomiJapanEconomAsian ISuperBowlGameTeamRe JPeoplTornadoFloridaRebomb General idea: related to the data attributes m=3887 (dictionary size)
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#25 Step 3: Evaluate the patterns IDTrue Topic 1Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2A bomb explodes in a Birmingham, AL abortion clinic 3The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4New impotency drug Viagra is approved for use 5Diane Zamora is convicted of helping to murder her lover’s girlfriend Winter Olympic games 7The Pope’s historic visit to Cube in Winter The economic crisis in Asia 9Superbowl XXXII 10Tornado in Florida AutoSplit finds correct topics. IDSorted word list Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#26 Step 3: Evaluate the patterns IDAutoSplit Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone AutoSplit’s topics are better than PCA. IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#27 Step 3: Evaluate the patterns AutoSplit A B C D E AutoSplit’s topics are better than PCA. PCA A’ B’ C’ D’ E’ PCA vectors mix the topics. Topic 1 Topic 2
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#28 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#29 Find hidden variables (DJIA stocks) Weekly DJIA closing prices –01/02/ /05/2002, n=660 data points –A data point: prices of 29 companies at the time Alcoa American Express Boeing Caterpillar Citi Group
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#30 Formulation: Find hidden variables AA 1, …, XOM 1 … AA n, …, XOM n H 11, H 12, …, H 1m … H n1, H n2, …, H nm = B 11, B 12, …, B 1m … B m1, B m2, …, B mm ? ? DateHidden variable Date
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#31 Characterize hidden variable by the companies it influences “General trend” “Internet bubble” CaterpillarIntel B 1,CAT B 1,INTC B 2,CAT B 2,INTC
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#32 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard “General trend”
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#33 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#34 General trend (and outlier) “General trend” AT&T United Technologies Walmart Exxon Mobil
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#35 Companies related to hidden variable 2 B 2,j HighestLowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont “Internet bubble” Tech company
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#36 Companies related to hidden variable 2 B 2,j HighestLowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15). Tech company
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#37 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#38 Conclusion ICA: more flexible than PCA in finding patterns. Many applications –Find topics and “vocabulary” for images –Find hidden variables in time series (e.g., stock prices) –Blind source separation ICA PCA
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#39 Citation AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto PAKDD 2004, Sydney, Australia
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#40 References Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#41 References Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001
CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#42 Software Open source software: ‘fastICA’ Or ‘autosplit’: