CMU SCS 15-826(c) C. Faloutsos and J-Y Pan (2013)#1 15-826: Multimedia Databases and Data Mining Lecture #21: Independent Component Analysis (ICA) Jia-Yu.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Visual Vocabulary Construction for Mining Biomedical Images Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan Presented by Li An, CIS, TU.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
Covariance Matrix Applications
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Color Imaging Analysis of Spatio-chromatic Decorrelation for Colour Image Reconstruction Mark S. Drew and Steven Bergner
Uncertainty and Information Integration in Biomedical Applications Claudia Plant Research Group for Bioimaging TU München.
Detecting Faces in Images: A Survey
Dimensionality Reduction PCA -- SVD
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
Independent Component Analysis & Blind Source Separation
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara.
Independent Component Analysis (ICA)
CMU SCS Indexing and Mining Biological Images Christos Faloutsos CMU.
Independent Component Analysis & Blind Source Separation Ata Kaban The University of Birmingham.
CMU SCS Indexing and Mining Biological Images Christos Faloutsos CMU.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Data Mining By Archana Ketkar.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
CMU SCS Multimedia and Graph mining Christos Faloutsos CMU.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Description of 3D-Shape Using a Complex Function on the Sphere Dejan Vranić and Dietmar Saupe Slides prepared by Nat for CS
1 Web Basics Section 1.1 Compare the Internet and the Web Compare Web sites and Web pages Identify Web browser components Describe types of Web sites Section.
Multimedia Databases (MMDB)
Image recognition using analysis of the frequency domain features 1.
Outline... Feature extraction Feature selection / Dim. reduction Classification...
Chapter 1 Introduction to Data Mining
Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Fast Mining and Forecasting of Complex Time-Stamped Events Yasuko Matsubara (Kyoto University), Yasushi Sakurai (NTT), Christos Faloutsos (CMU), Tomoharu.
CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.
Mining and Querying Multimedia Data Fan Guo Sep 19, 2011 Committee Members: Christos Faloutsos, Chair Eric P. Xing William W. Cohen Ambuj K. Singh, University.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Technische Universität.
Term Project Proposal By J. H. Wang Apr. 7, 2017.
15-826: Multimedia Databases and Data Mining
Automatic Video Shot Detection from MPEG Bit Stream
Principal Component Analysis (PCA)
LSI, SVD and Data Management
Principal Component Analysis
PCA vs ICA vs LDA.
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Bayesian belief networks 2. PCA and ICA
A Fast Fixed-Point Algorithm for Independent Component Analysis
Discovery of Hidden Structure in High-Dimensional Data
Restructuring Sparse High Dimensional Data for Effective Retrieval
Machine Learning.
Presentation transcript:

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)# : Multimedia Databases and Data Mining Lecture #21: Independent Component Analysis (ICA) Jia-Yu Pan and Christos Faloutsos

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#2 Must-read Material AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto, PAKDD 2004, Sydney, Australia

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#3 Outline Motivation Formulation PCA and ICA Example applications Conclusion

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#4 Motivation: (Q1) Find patterns in data Motion capture data: broad jumps Left Knee Right Knee Energy exerted Take-off Landing Energy exerted

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#5 Motivation: (Q1) Find patterns in data Human would say –Pattern 1: along diagonal –Pattern 2: along vertical axis How to find these automatically? Each point is the measurement at a time tick (total 550 points). Take-off Landing R:L=60:1 R:L=1:1

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#6 Motivation: (Q2) Find hidden variables Hidden variables (=‘topics’ = concepts) “Internet bubble” Alcoa American Express Boeing Citi Group … “General trend” Stock prices

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#7 Motivation: (Q2) Find hidden variables “General trend” “Internet bubble” CaterpillarIntel Hidden variables

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#8 Motivation: (Q2) Find hidden variables Hidden variable 1 Hidden variable 2 B 1,CAT B 1,INTC B 2,CAT B 2,INTC ? ?? CaterpillarIntel

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#9 Motivation: Find hidden variables There are two sound sources in a cocktail party… =“blind source separation” (= we don’t know the sources, nor their mixing)

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#10 Outline Motivation Formulation PCA and ICA Example applications Conclusion

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#11 Formulation: Finding patterns Given n data points, each with m attributes. Find patterns that describe data properties the best.

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#12 Linear representation Find vectors that describe the data set the best. Each point: linear combination of the vectors (patterns):

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#13 Patterns as data “vocabulary” Good pattern ≈ sparse coding (Q) Given data x i ’s, compute h i,j ’s and b i ’s that are “sparse”? Only b 1 is needed to describe x i.

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#14 Patterns in motion capture data b2b2 b1b1 n=550 ticks Data matrix LeftRight Basis vectors Hidden variables ?? Sparse ~ non-Gaussian ~ “Independent” “Independent”: e.g., minimize mutual information.

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#15 Patterns in motion capture data Data matrix Basis vectors Hidden variables ?? X = (U  ) V T Sparse ~ non-Gaussian ~ “Independent”

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#16 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#17 Pattern discovery with ICA: AutoSplit [PAKDD 04][WIRI 05] Step 1: Data points (matrix) Step 2: Compute patterns Step 3: Interpret patterns … or … … Data mining (Case studies) (Q) What pattern? (Q) Different modalities (Q) How? Video frames Stock prices Text documents

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#18 Finding patterns in high- dimensional data PCA finds the hyperplane.ICA finds the correct patterns. Dimensionality reduction details

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#19 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#20 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –Documents are sorted by date/time –Subsequent documents may have different topics Date/Time Topic 1 Topic 3Topic 1 … …

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#21 Topic discovery on text streams Data: CNN headline news (Jan.-Jun. 1998) Documents of 10 topics in one single text stream –FIND: the document boundaries –AND: the terms of each topic Date/Time … ??

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#22 Topic discovery on text streams Known: number of topics = 10 Unknown: (1) topic of each document (2) topic description Date/Time Topic 1 Topic 3Topic 1 …

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#23 Topic discovery in documents X [nxm] x i = [1, 5, …, 0] Step 1 Windowing Step 2 X [nxm’] = H [nxm’] B [m’xm’] (1)Find hyperplane (m’=10) (2)Find patterns aaron zoo m=3887 (dictionary size) (n=1659) New stories … (30 words) Step 3 b’ i = [0, 0.7, …, 0.6] aaron zoo animal B’ [10x3887] (Q) What does b’ i mean?

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#24 Step 3: Interpret the patterns b’ i = [0, 0.7, …, 0.6] aaron zoo animal Topics found A hidden topic! Top words : “animal”, “zoo”, … IDSorted word list AMckinneSergeantsexualMajorArmi BbombRudolphClinicAtlantaBirmingham CWinfreiBeefTexaOprahCattl DViagraDrugImpotPillDoctor EZamoraGrahamKillFormerJone FMedalOlympGoldWomenGame GPopeCubeCastroCubanVisit HAsiaEconomiJapanEconomAsian ISuperBowlGameTeamRe JPeoplTornadoFloridaRebomb General idea: related to the data attributes m=3887 (dictionary size)

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#25 Step 3: Evaluate the patterns IDTrue Topic 1Sgt. Gene Mckinney is on trial for alleged sexual misconduct 2A bomb explodes in a Birmingham, AL abortion clinic 3The Cattle Industry in Texas sues Oprah Winfrey for defaming beef 4New impotency drug Viagra is approved for use 5Diane Zamora is convicted of helping to murder her lover’s girlfriend Winter Olympic games 7The Pope’s historic visit to Cube in Winter The economic crisis in Asia 9Superbowl XXXII 10Tornado in Florida AutoSplit finds correct topics. IDSorted word list Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#26 Step 3: Evaluate the patterns IDAutoSplit Amckinnesergeantsexualmajorarmi Bbombrudolphclinicatlantabirmingham Cwinfreibeeftexaoprahcattl DviagradrugImpotpilldoctor Ezamoragrahamkillformerjone AutoSplit’s topics are better than PCA. IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp IDPCA A’mckinnebombwomensexualsergeant B’bombmckinnerudolphclinicatlanta C’winfreiviagratexabeefoprah D’viagrawinfreidrugtexabeef E’zamoraviagrawinfreigrahamolymp

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#27 Step 3: Evaluate the patterns AutoSplit A B C D E AutoSplit’s topics are better than PCA. PCA A’ B’ C’ D’ E’ PCA vectors mix the topics. Topic 1 Topic 2

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#28 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices Conclusion

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#29 Find hidden variables (DJIA stocks) Weekly DJIA closing prices –01/02/ /05/2002, n=660 data points –A data point: prices of 29 companies at the time Alcoa American Express Boeing Caterpillar Citi Group

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#30 Formulation: Find hidden variables AA 1, …, XOM 1 … AA n, …, XOM n H 11, H 12, …, H 1m … H n1, H n2, …, H nm = B 11, B 12, …, B 1m … B m1, B m2, …, B mm ? ? DateHidden variable Date

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#31 Characterize hidden variable by the companies it influences “General trend” “Internet bubble” CaterpillarIntel B 1,CAT B 1,INTC B 2,CAT B 2,INTC

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#32 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard “General trend”

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#33 Companies related to hidden variable 1 B 1,j HighestLowest Caterpillar AT&T Boeing WalMart MMM Intel Coca Cola Home Depot Du Pont Hewlett-Packard All companies are affected by the “general trend” variable (with weights 0.6~0.9), except AT&T.

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#34 General trend (and outlier) “General trend” AT&T United Technologies Walmart Exxon Mobil

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#35 Companies related to hidden variable 2 B 2,j HighestLowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont “Internet bubble” Tech company

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#36 Companies related to hidden variable 2 B 2,j HighestLowest Intel Philip Morris Hewlett-Packard International Paper GE Caterpillar American Express Procter and Gamble Disney Du Pont Companies affected by the “internet bubble” variable (with weights 0.5~0.6) are tech-related. Other companies are un-related (weights < 0.15). Tech company

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#37 Outline Motivation Formulation PCA and ICA Example applications –Find topics in documents –Hidden variables in stock prices –Visual vocabulary for retinal images Conclusion

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#38 Conclusion ICA: more flexible than PCA in finding patterns. Many applications –Find topics and “vocabulary” for images –Find hidden variables in time series (e.g., stock prices) –Blind source separation ICA PCA

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#39 Citation AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases, Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos and Masafumi Hamamoto PAKDD 2004, Sydney, Australia

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#40 References Jia-Yu Pan, Andre Guilherme Ribeiro Balan, Eric P. Xing, Agma Juci Machado Traina, and Christos Faloutsos. Automatic Mining of Fruit Fly Embryo Images. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan, Mark R. Verardo, Hyungjeong Yang, Christos Faloutsos, and Ambuj K. Singh. ViVo: Visual Vocabulary Construction for Mining Biomedical Images. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM), Masafumi Hamamoto, Hiroyuki Kitagawa, Jia-Yu Pan, and Christos Faloutsos. A Comparative Study of Feature Vector-Based Topic Detection Schemes for Text Streams. In Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration (WIRI), 2005, pp Jia-Yu Pan, Hiroyuki Kitagawa, Christos Faloutsos, and Masafumi Hamamoto. AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In Proceedings of the The Eighth Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2004.

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#41 References Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis, John Wiley & Sons, 2001

CMU SCS (c) C. Faloutsos and J-Y Pan (2013)#42 Software Open source software: ‘fastICA’ Or ‘autosplit’: