Data Fusion: why, how and when. Johan A. Westerhuis1,2, Age K

Data Fusion: why, how and when. Johan A. Westerhuis1,2, Age K
Data Fusion: why, how and when? Johan A. Westerhuis1,2, Age K. Smilde1,3,4 1:Swammerdam Institute for Life Sciences, University of Amsterdam 2:Metabolomics Platform, North-West University, Potchefstroom, South Africa 3:Amsterdam Medical Centre, University of Amsterdam 4:Department of Food Science, University of Copenhagen

Explorative analysis of complex Metabolomics data
s2x Experimental and measurement design -1 1 samples xgem time metabolites 2 2

Content What is fusion? Goals of fusion Types of fusion
Low-level fusion Framework Common and Distinct variation Fusion with prior knowledge Idea Challenge test Nutrikinetics Future perspectives and open issues

What is fusion: multiple data sets
Fusing multiple MX-platform data: NMR LCMS GCMS Fusing MX with other data: MX PX TX Fusing different compartments:

Statistical HeterospectroscopY (SHY)
Same sample Hierarchical PCA / PLS STOCSY Statistical HeterospectroscopY (SHY) Same individual different sample Hierarchical PCA Correlation Same sample or Same individual Correlation networks O2PLS

Hetero STOCSY HET STOCSY: correlation matrix between 1H NMR and P NMR
SHY (Statistical Heterospectroscopy) Correlation matrix between NMR and e.g. LC or GC or CE INTERPRETATION: Find correlation between peaks in different data blocks. PNMR

O2PLS : Populus WT and transgenic plants to explore oxidative stress response.
Transcriptomics Proteomics Metabolomics

O2PLS loadings and pathway mapping

Types of Fusion: Integration
Proteins Metabo lites samples mRNA miRNA samples mRNA When we use genome based metabolic models, pathways, gene protein connections, interactions with miRNA etc etc, I call it integration

Low-level fusion Framework Common and Distinct variation Fusion with prior knowledge Idea Challenge test Nutrikinetics Future perspectives and open issues

Goals of fusion Global exploratory analysis Comprehensive biomarkers
More information about the samples to look at Comprehensive biomarkers Use correlations between features of different datasets to discriminate groups Common vs distinctive processes Some information is in common in the two or more datasets, but other information might not Mechanistic modelling Information comes from different sources 11

Low-level fusion Framework DISCO Fusion with prior knowledge Idea Challenge test Nutrikinetics Future perspectives and open issues

Type of fusion: network visualization
Time series data Metabolomics data Significant changes RNA -seq miRNA-seq DNase-seq

Types of fusion: data driven (low-level)
Optimal preprocessing, for each block separately Consensus PCA, Multiblock PCA, hierarchical PCA DISCO, O2-PLS, JIVE 14

Types of fusion: data-driven (mid-level)
First variable selection based on block-wise arguments . To remove noise To focus on selected effects Then fusing selected variables. Consensus PCA, Multiblock PCA, hierarchical PCA DISCO, O2-PLS, JIVE 15

Types of fusion: data driven (high-level)
Result1 Result2 e.g. Prediction of class label Overall result 16

Types of fusion: model-driven
baseline level dose blood urine dose baseline level

Types of fusion: summary
Visual-driven Data-driven Model-driven Increasing effort Increasing complexity Increasing knowledge 18

Low-level fusion Framework Default: Multiblock /Hierarchical Common and distinct Fusion with prior knowledge Idea Challenge test Nutrikinetics Future perspectives and open issues

Framework for low-level fusion
Symmetric fusion Model for each block Quantification of modes Association rules Linking function Van Mechelen & Smilde, 2010

Quantification of a mode
Superscript=mode Subscript=block Mode 1 quantifyers Mode 2 quantifyers

Block-specific association rules
W= weights P1 I2 f can differ between blocks P2 I2

Other types of block models (1)
time time Individuals Model Metabolite 1 Model Metabolite 2 xmax (HA) xmax (sP) t (sP) t (HA) ke (HA) ke (sP)

Framework for low-level fusion
Symmetric fusion Model for each block Quantification of modes Association rules Linking function

Linking function L(.)

= Possibilities for L(.) a) Identity link: b) Inclusion link
c) Partial (vertical) identity link:

= Possibilities for L(.) d) Partial (horizontal) identity link:
e) Using prior information: time

Low-level fusion Framework Default: Multiblock /Hierarchical Common and distinct Fusion with prior knowledge Idea Challenge test Nutrikinetics Future perspectives and open issues

Default type of Multiblock, hierarchical, ... Component analysis
metabolites metabolites LC-MS GC-MS Block score Block score Block loadings Super weights Super loadings Super scores

Global scores Block scores Super weights
Sparse multi-block PLSR for biomarker discovery when integrating data from LC–MS and NMR metabolomics İbrahim Karaman1 ,et al, DOI: Global scores Block scores Super weights Only 1 block score Only 1 block score

Common and distinct: low-level fusion, idea
metabolites metabolites experimental conditions LC-MS GC-MS LC-MS GC-MS Distinctive LC-MS Distinctive GC-MS Common 31

DIStinct and COmmon Partial (vertical) identity link:

Separating sums of squares: Common vs Distinctive
Common over blocks vs Distinctive for single block Common (high correlations between both sets of data) Distinct (high correlations within single set of data) Focus on common, which could be disturbed by distinctive variation (time pattern, grouping) Common is known (not interesting; e.g. time effect ) but overwhelming, subtle effects in distinctive (filter) Common may contain systematic bias (batch effects) which can be filtered away

Analysis of common and distinct information

First data is subject to a SCA (PCA on all matrices simultanously).
DIStinctive and Common components with Simultaneous-Component Analysis (DISCO-SCA) variables variables variables objects X1 X2 X3 = First data is subject to a SCA (PCA on all matrices simultanously). Then SCA model loadings are rotated orthogonally towards easy structure with 0 for distinctive and values for common parts.

DIStinctive and Common components with Simultaneous-Component Analysis (DISCO-SCA)
objects TARGET LOADINGS Rotate loadings (orthogonally) towards target and counter rotate scores. For multiple components per group this becomes a massive loading matrix. Rotating towards target becomes more and more difficult with growing number of blocks. Target may not be reached and thus unique components are not really unique. X1 . . . . variables X2 . . . . variables X3 . . . . variables Global Unique Local Schouteden, M., et al (2013). Behavior Research Methods, 45(3), 822–33. doi: /s

DISCO: example Metabolome of E. coli screened under various exper.cond. & fermentation times Both GC/MS and LC/MS were used for the same 28 samples of E. coli Proof of concept: GC/MS and LC/MS are known to detect common and specific classes of compounds. Do we find back the associated common & specific biological processes? LC GC 37

DISCO: results (1) Small but not 0. Target is not exactly reached components Variation accounted for in each data block by five SCA and DISCO-SCA components 38

DISCO: results (2) GC LC First component: LC-specific
Effect of elevated pH in early growth phase Energy metabolism: GXP, UXP en CXP Second component: LC-specific Effect on flavin nucleotides (FAD, FMN) and CoA esters more abundant with elevated pH / reduced phosphate level at the mid-logarithmic phase+ depleted in the wild type strain GC Fourth component: GC-specific Effect of succinate catabolism, leading to an increase in concentration of metabolites like fumarate, malate, aspartate, and a-ketoglutarate Fifth component: Common Linear fermentation time effect LC 39

Joint and Individual Variation Explained (JIVE)
objects X1 PC1 TCT PD1 TD1T E1 variables = + Individual Distinct + Common Joint Residual X2 PC2 PD2 TD2T E2 variables X3 PC3 PD3 TD3T E3 variables Calculate common by SCA on [X1 X2 X3]; Ci = PCi*TCT Calculate Distinct on (Xi – Ci) orthogonal on Common; Di = PDi * TDiT Lock, E. F., Hoadley, K. a, Marron, J. S., & Nobel, A. B. (2013). The Annals of Applied Statistics, 7(1), 523–542.

OnPLS (n>2) Common part
Objects X1 Variables X1TX2 w2 w1 Variables Variables X2 Variables

OnPLS Distinctive (orthogonal) part
TCT objects X1 w1 T1T Po1 To1T E1 variables = + + X2 w2 T2T Po2 To2T E2 Residual variables X3 w3 Po3 To3T E3 T3T variables Common Distinct Common scores are related (same colour), but not the same. Distinct scores are orthogonal to common part, and cannot predict ALL other blocks. Trygg, J., & Wold, S. (2003). Journal of Chemometrics, 17(1), 53–64.

Example analysis with OnPLS Common information in transcripts, proteins and metabolites
objects (WT) controls and two different transgenic hybrid aspen plants (AS-SOD9, AS-SOD24) expressing a high-isoelectric-point superoxide dismutase (AS-SOD9) gene. HipI-SOD has a suggested role in ROS regulation and plant development. X1 transcripts X2 proteins X3 metabolites WT, SOD9, SOD24

Example analysis with OnPLS Common information in transcripts, proteins and metabolites
WT, AS-SOD9 and AS-SOD24 plants in triplicate Normalized on WT High expression for proteins related to ROS detoxification maintenance of cells’ redox balance Srivastava V. et al , OnPLS integration of transcriptomic, proteomic and metabolomic data shows multi-level oxidative stress responses in the cambium of transgenic hipI- superoxide dismutase Populus plants, BMC Genomics, (2013), 14:893.

egra Example data STATegra project Hematopoetic Stem cells
Time series 0, 2, 6, 12, 18, 24 h Control vs Ikaros cells 3 biological repeats Measured: mRNA, miRNA, Proteomics, Metabolomics, ... Hematopoetic Stem cells Time course of a cell differentiation process in mouse, of the pre B cell-like B3 cell line under the controlled induction of the transcription factor Ikaros (has a role as tumor suppressor and as regulator of cell cycle progression)

JIVE, [miRNA mRNA Proteomics, Metabolomics]
Explained Joint Variation Common variation slightly decreases with added Blocks. miRNA mRNA Prot Met 0.6669 0.6697 0.6740 0.6445 0.6455 0.6484 0.1701 0.1758 0.5686

Distinct variation Distinct variation shows effects not explained in the common part. Note the different size of the distinct score values!

OnPLS results Explained Joint Variation OnPLS
Each dataset has its own scores. One set of scores scores can be obtained. Data specific scores give a good impression how well the global scores fit each data set. miRNA mRNA Prot Met 0.57 0.55 0.58 0.54 0.53 0.63 0.30 0.40 0.65 Explained Joint Variation OnPLS

Low-level fusion Framework DISCO Fusion with prior knowledge Idea Future perspectives and open issues

Adding prior information to the method
Often prior information is available that potentially could be used to improve the data analysis methods Information on the samples Groups Time series Smoothness, kinetics, ... Same individual Information on the variables Sparsity (only few variables are expected to be affected by the treatment Variables work in groups / network PK metabolites PK metabolites X1 PK X2 Add penalties to scores / loadings

Class information added
X1 X2 Adding class information information eases group separation leads to overfitted situations Validation Calculate Tnew without the Class block 1 P1T P2T P3T Tnew P1 P2 X1,new X2,new

Real data (mRNA & miRNA)
Gene expression and miRNA data on a common set of GBM (Glioblastoma Multiforme) tumor samples from The Cancer Genome Atlas (TCGA) mRNA : x 234 (samples) miRNA: 534 x 234 (samples) For simplicity only two groups selected

Adding class information Block scaling used
No Class information Block scaling used [Class block gets too much attention]

Validation of Class effect
Class information is present in training samples (o) but lacking in test samples (*). Group effect was clearly overfit due to block scaling .

Validation / Discussion
Do we need fusion? Did it help to fuse? How do we know that? At which level do we need to fuse? Which type of fusion to use? Are the goals reached? How reliable are the results? How to use the classical resampling methods? Can we do prediction? … 55

PART II: Complexities Relative concentrations, untargeted profiling metabolomics fusion: Issues: Block scaling, sample alignment Same compartment / targeted metabolomics fusion Real concentrations Mechanistic, kinetic, mass balance Measurement errors Genes, proteins, metabolites Are samples representing the same “thing” Timing issues

Block scaling LC GC After autoscaling, common scores will be favored towards larger block Blockscaling: make SSQ of each block equal Ssq of each column is larger for smaller blocks IDEA: Blockscale towards rank of separate blocks Ssq(Xb) = rank(Xb)

Sample alignment 1 Technical replicates on two platforms are not aligned Averages should be used for the fusion.

Sample alignment 2: different biological batches
miRNA mRNA Prot. Meta miRNA mRNA Prot. Meta Different biological batches for metabolomics Samples not aligned Fusion analysis can only be done on averages or medians. miRNA mRNA Prot. Meta

Metabolomics ? Although no within group variation exists in metabolomics data, the samples are drawn apart in the distinctive plot. This is due to forced difference in the common scores. Deflation problem ?

PART II: Complexities Relative concentrations, profiling metabolomics fusion: Issues: Block scaling, sample alignment Concentrations metabolomics fusion Mechanistic, kinetic, mass balance Measurement errors Genes, proteins, metabolites Are samples representing the same “thing” Timing issues

Case study: LUMC (K. Willems van Dijk) Measured at DCL
16 obese patients and 15 obese patients with diabetes 187 lipids 15 samples replicated 11 Quality controls (QC) 43 amino acids 19 Quality controls (QC) Internal standard correction QC within batch correction

Two platforms used to measure the same samples
Metabolites, J1 Metabolites, J2 M2 Amino acids Lipids Samples Issues: within block and between block correlations?

Simultaneous Component Analysis for low level data fusion
Important metabolites Aminoacids Lipids X M1 M2 T P1 P2 VT = E1 E2 + Issues: Good coverage of metabolites in model. Scaling / block scaling

Simultaneous Component Discriminant Analysis
Amino acids Lipids To Obese PC2 training To+d PC1 Obese and diabetic LDA test SCA loadings, LDA weights Class prediction

Block scaling / weighing Use of measurement error
Measurement error depends on: Platform used Metabolite measured Concentration of metabolite Combination of the above Use measurement error of repeats to quantify quality of measurements QC sample (single concentration) Repeats (multiple concentrations QC sample -> RSD Standard Deviaton St.D Mean Intensity

Metabolomics platforms: measurement errors
Var. 15 Var. 365 Var. 118 Var. 213 STD Mean GC-MS LC-MS 68

Measurement error Amino Acids
Most amino acids have small concentrations and also low error variance. Only l-Glutamine is present in high levels. RL error model estimated on all amino acids combined.

Weighted Simultaneous Component (Discriminant) Analysis for data fusion
Important metabolites Amino acids Lipids X M1 M2 T P1 P2 VT = W-1 E1 E2 + Measurement error standard deviation Sorry, forgot the transpose

Amino acids log and glog transformation

AUROC results Complex weighted SCA approaches do NOT outperform simple (G)LOG/SQRT transformations. It is difficult to get good estimates of the error model using few replicates. I true error model is used, weighted methods perform better.

PART II: Complexities Relative concentrations, profiling metabolomics fusion: Issues: Block scaling, sample alignment Concentrations metabolomics fusion Mechanistic, kinetic, mass balance Measurement errors Genes, proteins, metabolites Are samples representing the same “thing” Timing issues

Different platform: same compound
Different RSD Different # repeats Different IS, QC, Different response factor

Simulated IS corrected GC and MS
LC RSD = 3*GC RSD Different responsefactor Heteroscedastic error Ik Co True

Log transform to make multiplicative error additive
LC RSD = 3*GC RSD Different responsefactor Ik Co True

Center data to have them on equal footing Calculate weighted average using 1/RSD as weight
LC RSD = 3*GC RSD Ik Co True Weighted average of GC and LC

Real concentrations from different platforms
Elimination Absorption q0 ka q(t) ke a Lag time (tlag) Calculate AUC (area under the curve: total amount) Kinetic parameters per individual!

Finalizing Low level data fusion: Many different methods (algorithms)
Sample alignment Blockscaling Real concentrations? Same compartment?

Separating variance according to experimental design (ANOVA)
Split-up of sums of squares requires orthogonality (balanced experimental design) x2 x1 Monkey 1 Monkey 2 Monkey 3 Monkey 5 Monkey 4 x2 x1 Monkey 1 Monkey 2 Monkey 3 Monkey 5 Monkey 4 x2 x1

Global, local and unique information from three blocks
All three methods separate variation into common (global) and distinctive (local + unique). The latter can again be separated into local and truely unique.

Data Fusion: why, how and when. Johan A. Westerhuis1,2, Age K

Similar presentations

Presentation on theme: "Data Fusion: why, how and when. Johan A. Westerhuis1,2, Age K"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Fusion: why, how and when. Johan A. Westerhuis1,2, Age K

Similar presentations

Presentation on theme: "Data Fusion: why, how and when. Johan A. Westerhuis1,2, Age K"— Presentation transcript:

Similar presentations

About project

Feedback