FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College.

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College of Computing Georgia Institute of Technology FODAVA Kick-off Meeting, Sep. 2008

FODAVA-Lead Proposed Research Fundamental Challenges: two important constraints on Data and Visual Analytics system –Speed: necessary for real-time, interactive use Even back-end data analysis and transformation operations must appear to be essentially instantaneous to users, massive size poses challenges –Screen Space: number of available pixels fundamentally limiting constraint Effective representation and efficient transformation of large data sets by data reduction and dimension reduction

FODAVA-Lead Research Goals Development of Fundamental Theory and Algorithms in Data Representations and Transformations to enable Visual Understanding –Dimension Reduction Feature selection by sparse recovery Manifold learning Dimension reduction with prior info/interpretability constraints … –Data Reduction Multi-resolution data approximation Anomaly cleaning and detection –Data Fusion … –Fast Algorithms Large-scale optimization problems/matrix decompositions Dynamic and time-varying data –Integration with DAVA systems (e.g.Text Analysis and Jigsaw)

Research Interests (H. Park) Effective Dimension Reduction with Prior Knowledge Dimension Reduction for Clustered Data: Linear Discriminant Analysis (LDA), Generalized LDA (LDA/GSVD), Orthogonal Centroid Method (OCM), Fast Adaptive algorithms Dimension Reduction for Nonnegative Data: Nonnegative Matrix Factorization (NMF) Applications: Text Classification, Face Recognition, Fingerprint Classification, Gene Clustering in Microarray Analysis … Efficient and Effective Numerical Algorithm Development and Analysis Algorithms for Massive Data Analysis -Dimension Reduction -Clustering and Classification -Adaptive Methods Applications -Microarray analysis: gene selection, missing value estimation -Protein structure prediction -Biometric Recognition -Text Analysis

2D Representation Utilize Cluster Structure if Known 2D representation of 700x1000 data with 7 clusters: LDA vs. SVD vs. PCA LDA+PCA(2)SVD(2)PCA(2)

A = [a 1  a n ]  mxn, clustered data N i = items in class i, | N i | = n i, total r classes c i = centroid, c = global centroid S b = ∑ 1≤ i≤ r ∑ j ∈ N i (c i – c) (c i – c) T S w = ∑ 1≤ i≤ r ∑ j ∈ N i (a j – c i ) (a j – c i ) T S t = ∑ 1≤ i≤ n (a i – c ) (a i – c ) T Dimension Reduction for Clustered Data (LDA/GSVD) (Howland, Jeon, Park SIMAX 03, Howland & Park TPAMI 04) Measure for Cluster Quality S w -1 S b x = x  S b x= S w x   2 H b H b T x =  2 H w H w T x GSVD: U T H b T X = D1,V T H w T X = D2 High quality clusters have small trace(S w ) & large trace(S b ) Want: G : mxq s.t. min trace(G T S w G) & max trace(G T S b G)

QRD Preprocessing in Dim. Reduction (Distance Preserving Dim. Redution) AQ1Q1 R For under-sampled data A:mxn, m>>n = Q 1 : orthonormal basis for range(A) when rank(A)=n Dimension reduction of A by Q 1 T, Q 1 T A = R: nxn Q 1 T preserves distance in L 2 norm: || a i || 2 = || Q 1 T a i || 2 || a i - a j || 2 = || Q 1 T (a i - a j )|| 2 in cos distance: cos(a i, a j ) = cos(Q 1 T a i, Q 1 T a j ) Q1Q1 Q2Q2 = R 0 Applicable to PCA, LDA, LDA/GSVD, regLDA, Isomap, LLE, … Updating and Downdating can be done fast, important for iterative vis.

Data Dim.# r LDA/GSVDregLDA (LDA) QR+LDA/GSVDQR+LDA/regGSVD Text 5896 x 210748.842.20.140.03 Yale 77760 x 16515-- 0.960.22 AT&T 10304 x 40040-- 0.070.02 Feret 3000 x 1301010.99.30.030.01 OptDigit 64 x 5610108.979.600.02 Isolet 617 x 77972698.199.336.70 Speed Up with QRD Preprocessing (computation time)

LDA for Data with Sub-clusters: Facial Recognition Cross-Language Processing  Unimodal Gaussian assumption for each cluster in LDA may not hold when sub-cluster structure exists. Sentiment #1Sentiment #2 Person #1 Technology Sports English Korean Person #2 Person #3 Sentiment RecognitionPCALDAtensorFacesRegularized h-LDA Accuracy(%)63.5375.8369.6181.95

Dimension Reduction for Visualization of Clustered Data max trace ((G T S w G) -1 (G T S b G))  LDA (Fisher 36, Rao 48) max trace (G T S b G)  Orthogonal Centroid (Park et al. 03) IN-SPIRE: OC with rank(G)=2, can be updated easily and nonlinearized max trace (G T (S w +S b )G)  PCA (Hotelling 33) max trace (G T (AA T )G)  LSI (Deerwester et al. 90) (

Nonlinear Discriminant Analysis by Kernel Function s Left Loop Right Loop Whorl Arch Tented Arch Construction of Directional Images by DFT 1. Compute directionality in local neighborhood by FFT 2. Compute the dominant direction 3. Find core point for unified centering of fingerprints within the same class  2D

Fingerprint Classification Results on NIST Fingerprint Database 4 4000 fingerprint images of size 512x512 By KDA/GSVD, dimension reduced from 105x105 to 4 KDA/GSVD: Nonlinear Extension of LDA/GSVD based on Kernel Functions Rejection rate(%) 0 1.8 8.5 KDA/GSVD 90.7 91.3 92.8 kNN & NN Jain et al., 99 - 90.0 91.2 SVM Yao et al., 03 - 90.0 92.2 (C. Park and H. Park, Pattern Recognition, 06)

Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06 Bioinformatics, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, …) AW H ~=~=  min || A – WH || F W>=0, H>=0 Why Nonnegativity Constraints? Better Approx. vs. Better Representation/Interpretation Nonnegative Constraints often physically meaningful Interpretation of analysis results possible Fastest Algorithm for NMF, with theoretical convergence Can be used as a clustering algorithm

How this research will influence FODAVA Better Representation and Transformation of Data: Improved theory and methods that more accurately incorporates prior knowledge Capacity to Process More Data Faster: Fast and scalable algorithms that can represent and transform larger data sets in shorter time Improved Visual Interaction Capability: Fast algorithms for efficient handling of dynamic and transient data Information Synthesis: Visual representation of information of different types on one map

Developing New Understanding Dimension reduction in DAVA requires new modeling, optimization criteria, algorithms Design efficient and effective algorithms for data representation and transformation. Balance between speed and accuracy Will address more on community building plans tomorrow. Thank you!

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College.

Similar presentations

Presentation on theme: "FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College.

Similar presentations

Presentation on theme: "FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College."— Presentation transcript:

Similar presentations

About project

Feedback