Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Module #: Title of Module
2

Backgrounder in Statistical Methods
Module 4 Backgrounder in Statistical Methods Jeff Xia Informatics and Statistics for Metabolomics May 26-27, 2016

Yesterday 1 2 3 4 5 6 7 ppm

Today -25 -20 -15 -10 -5 5 10 15 20 25 -30 PC1 PC2 PAP ANIT Control

Learning Objectives Learn about summary statistics and normal distributions Learn about univariate statistics (t-tests and ANOVA) Learn about p values calculation and hypothesis testing Learn about multivariate statistics (clustering, PCA and PLS-DA)

What is Statistics Data Statistics Information
A way to get information from (usually big & complex) data Statistics Information

Main Components Input: metabolomics data Output: useful information
A matrix containing numerical values Meta-data: data about data Class labels, experimental factors Output: useful information Significant features Clustering patterns Rules (for prediction) Models …...

Types of Data Y X Data Matrix Meta Data Quantitative Categorical
Discrete Continuous Categorical Binary Nominal Ordinal Data Matrix Meta Data Y X

Quantitative Data The data matrix Continuous Microarray intensities
Metabolite concentrations Discrete Read counts Need to be treated with different statistical models

Categorical Data Binary data Nominal Data (> two groups)
0/1, Y/N, Case/Control Nominal Data (> two groups) Single = 1, Married = 2, Divorced = 3, Widowed = 4 Orders are not important Ordinal data Low < Medium < High Orders matter

Some Jargons (I) Data are the observed values of a variable.
A variable is some characteristic of a population or sample A gene or a compound The values of the variable are the range of possible values for a variable. i.e. measurements of gene expression, metabolite concentration Dimension of data are based on the variables it contains Omics data are usually called high-dimensional data

Some Jargons (II) Univariate: Bivariate: Multivariate
Measuring one variable per subject Bivariate: Measuring two variables per subject Multivariate Measuring many variables per subject

Key Concepts in Statistics

Issues when making inferences

From samples to population
So how do we know whether the effect observed in our sample was genuine? We don’t Instead we use p values to indicate our level of certainty that our results represent a genuine effect present in the whole population Q Explain that by genuine we mean that the observed effect was caused by a true effect present within the whole population A

P values P values = the probability that the observed result was obtained by chance i.e. when the null hypothesis is true α level is set a priori (usually 0.05) If p < α level then we reject the null hypothesis and accept the experimental hypothesis If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis More on this topic later P = Probability this value tells us the probability that the observed result was obtained by chance That there is no difference between the two groups Each test result (e.g. t value) is associated with a particular p value α level is set a priori This is basically an acceptance level Usually this is set to 0.05 But as I understand, α levels are usually much lower than this in fMRI If p < α level then we reject the null hypothesis and accept the experimental hypothesis - concluding that we are 95% certain that our experimental effect is genuine If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis - that there was no sig diff in brain activation levels between the two conditions

Summary/Descriptive Statistics

How do we describe the data?
Central Tendency – center of the data location Mean, Median, Mode Variability – the spread of the data Variance Standard deviation Relative standing – distribution of data within the spread Quantiles Range IQR (inter-quantile range)

Mean, Median, Mode In a Normal Distribution the mean, mode and median are all equal In skewed distributions they are unequal Mean - average value, affected by extreme values in the distribution Median - the “middlemost” value, usually half way between the mode and the mean Mode - most common value

Mean, Median & Mode Mode Median Mean

Variance, SD and SEM Variance: average of the squared distance to the center (mean) It is squared, unit of change is not meaningful Increasing contributions from outliers SD: standard deviation: Square root of variance “Standardized”, unit is meaningful SEM: standard error of the mean Quantifies the precision of the mean. Takes into account both the value of the SD and the sample size. σ2 σ

Quantiles The 1st quantile Q1 is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as median (50% are smaller and 50% larger) Q3 is the value that only 25% of the observations are larger Range is minimum to maximum

Mean vs. Variance Most of univariate tests are comparing the difference in the means, assuming equal variance

Univariate Statistics
Univariate means a single variable If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:

A Bell Curve # of each Height
Also called a Gaussian or Normal Distribution

Features of a Normal Distribution
Symmetric distribution Has an average or mean value (m) at the centre Has a characteristic width called the standard deviation (s) Most common type of distribution known m = mean

Normal Distribution Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution The larger the set of measurements, the more “normal” the curve Minimum set of measurements to get a normal distribution is 30-40

Some Equations Mean m = Sxi N Variance s2 = S(xi - m)2 N
Standard Deviation s = S(xi - m)2 N

Standard Deviation (σ)
99% 95%

Different Distributions
Unimodal Bimodal

Skewed Distribution Resembles an exponential or Poisson-like distribution Lots of extreme values far from mean or mode Hard to do useful statistical tests with this type of distribution Outliers

Fixing a Skewed Distribution
A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian

Log Transformation Normal distribution Skewed distribution log
transformed exp’t B linear scale exp’t B

Log Transformation (Real Data)

Centering, scaling, and transformations
BMC Genomics. 2006; 7: 142

The Result # of each Height Are they different?

t-tests Compare the mean between 2 samples/ conditions
if 2 samples are taken from the same population, then they should have fairly similar means if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, i.e. they really are different

Types of t-tests Independent Samples Related Samples parametric
Independent samples t-test Paired samples t-test non-parametric Mann-Whitney U-Test Wilcoxon test

Paired t-tests Reduce Variance This

Another approach to group differences
Analysis Of VAriance (ANOVA) Variances not means Multiple groups (> 2) H0 = no differences between groups H1 = differences between groups Based on F distribution or F-tests

Calculating F F = the between group variance divided by the within group variance the model variance/error variance For F to be significant the between group variance should be considerably larger than the within group variance A large value of F indicates relatively more difference between groups than within groups Evidence against H0

What can be concluded from a significant ANOVA?
There is a significant difference between the groups NOT where this difference lies Finding exactly where the differences lie requires further statistical analyses Post-hoc tests

Different types of ANOVA
One-way ANOVA One factor with more than 2 levels Factorial ANOVAs More than 1 factor Mixed design ANOVAs Some factors independent, others related The type that I have described is referred to as a one-way ANOVA because it has one factor which = cartoon characters (with more than 2 levels) Can also have two-way, three-way ANOVAs These = factorial ANOVAs Allow for possible interactions between factors as well as main effects For example you could have 2 factors with 2 levels each This would = a 2 x 2 factorial design Can also have related or independent designs or a mixture

Conclusions T-tests assess if two group means differ significantly
Can compare two samples or one sample to a given value ANOVAs compare more than two groups or more complicated scenarios They use variances instead of means

Understanding P values

The p-value The p-value is the probability of seeing a result as extreme or more extreme than the result from a given sample, if the null hypothesis is true  How do we calculate a p value?

How to we compute a p value
95% 99%

Non-normal distribution
What if we don’t know the distribution? The only thing we know is that the data does not follow normal distribution Poor performance using normal distribution based model Common approaches Normalization Non-parametric tests Permutation (i.e. empirical p values)

Normalization Normal distribution is a basic assumption in many tests
T-tests, ANOVA, etc Log transformation Z-score (auto-scaling)

Boxplots of “standardized” data

Non-parametric tests Based on rank (loss of information)
Sample1 (1000, 1, ….) Sample2 (1.1, 1.09,....) Sometimes useful, but not as powerful as parametric tests when data is normally distributed or can be normalized

Empirical P values We previously mentioned p values is based on well defined models (i.e. normal distributions) What if we don’t know the distribution? The only thing we know is that the data does not follow normal distribution Poor performance using normal distribution based model We can find out the null distribution from the data itself (need to be relatively big), then calculate the p value (also known as empirical p value)

Basic Principle Under the null hypothesis, all data comes from the same distribution We calculate our statistic, such as mean difference, from the original data We then shuffle the data with respect to group and recalculate the statistic (mean difference) Repeat step 3 multiple times Find out where our statistic lies in comparison to the null distribution

A simple example To find out whether there is a mean difference between case vs. control case control 1 10 2 11 3 12 4 13 5 14 6 15 7 16 8 17 9 1.2812 18 Mean Mean difference .541

Permutation One Note how the different labels have been swapped for the permutation case control 9 1.2812 11 3 18 17 14 15 4 16 6 1 2 7 5 10 12 13 8 Mean Mean difference = .329

Permutations Repeat many many times (i.e. 1000 times) case control 9
1.2812 11 3 18 17 14 15 4 16 6 1 2 7 5 10 12 13 8 Mean Repeat many many times (i.e times)

Compute empirical p value
In 1000 times permutations There are three times the permuted data given large difference p = 0.003 None of the permuted mean difference is bigger than the original one  p < or (1/1001) # prevent p value equal to zero

General Advantages Does not rely on distributional assumptions
Corrects for hidden selection Corrects for hidden correlation Disadvantage?? Computationally intensive

Question If you were asked to compute empirical p values for multiple group comparisons (i.e. ANOVA) What will be the value to calculate from the permutated samples?

Hypothesis Testing & multiple testing issues

Hypothesis Testing We start by assuming the null (H0) is true, and asking . . . “How likely is the result we got from our sample?” If that probability (p-value) is small, it suggests the observed result cannot be easily explained by chance

Hypothesis Testing (more details)
Goal: Make statement(s) regarding unknown population parameter values based on sample data Elements of a hypothesis test: Assume that samples are randomly selected from the population State the null hypothesis (H0) that distribution of the values in the two populations are the same Define a threshold (α) do declaring a P value significant Select appropriate statistical test and calculate the P value If the P value is less than α, then conclude that the difference is statistically significant and decide to reject the null hypothesis. Otherwise, conclude that the difference is not statistically significant and declare that the null hypothesis cannot be rejected.

Hypothesis Testing & P Value
The P value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed if the null hypothesis is true One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 When the null hypothesis is rejected, the result is said to be statistically significant

Multiple Testing Issues
Omics is high-dimensional data 100s ~10,000s of variables Lots of hypothesis tests Performing T-tests on a typical microarray data might result in performing separate hypothesis tests. If we use a standard p value cut-off of 0.05, we would see 500 (10000*0.05) genes to be deemed “Significant” by chance

Multiple Testing Correction (I)
Family Wise Error Rate (FWER) - e.g. Bonferroni corrections Corrected P-value= p-value * n (number of genes in test) <0.05 As a consequence, if testing 1000 genes at a time, the highest accepted individual p value is , making the correction very stringent. With a Family-wise error rate of 0.05 (i.e., the probability of at least one error in the family), the expected number of false positives will be 0.05.

Multiple Testing Correction (II)
False Discovery Rate (FDR) - e.g. Benjamini-Hochberg A FDR of 0.05 means that 5% among the significant genes are expected to be false positives If 100 genes are declared DE, 20 of them will be false positive By controlling the FDR one can choose how many of the subsequent experiments one is willing to be in vain.

High-dimensional data
So far, our analyses are dealing with a single variable (i.e. univariate analysis) T-tests: one variable, two groups ANOVA: one variable, > 2 groups First analyze a single variable, and then apply the procedure to all variables, finally do multiple test adjustment Visualization are limited to three dimensions How can we analyze & visualize these high-dimensional data?

Multivariate Statistics

Multivariate Statistics
Multivariate means multiple variables If you measure a population using multiple measures at the same time such as height, weight, hair colour, clothing colour, eye colour, etc. you are performing multivariate statistics Multivariate statistics requires more complex, multidimensional analyses or dimensional reduction methods

Normal distribution – a single variable

Bivariate Normal

Trivariate Normal

The Reality Most statistical methods have been developed before the coming of the omics era Most statistical methods are designed for single or very few variables (typically less than 10) T-tests, ANOVA Linear/logistic regression These methods assume there are more samples (n) than the number of variables (p) (i.e. n > p) for parameter estimation In omics data, n << p

The Practice Omics data analyses have developed beyond the scope of the classical statistics. It borrows a lot of “strengths” from several other fields that typically involve dealing with large and complex data generated from the digital age Machine learning Chemometrics Visualization

Machine Learning Unsupervised learning: explore the data to find some intrinsic structures in them (disregard whether they are related to the class labels or not). Supervised learning: discover patterns in the data that relate data attributes with related to a target (class) attribute. These patterns can then be utilized to predict the values of the target attribute in future data instances

Unsupervised Learning methods for high-dimensional data
Clustering Organize the 1000s of variables into blocks Variables in each block are more homogenous, and treat these blocks as a unit. Dimension reduction Reduce the high-dimensional data into low-dimension i.e. from 1000s of variables to 2 ~ 3 variables Principal component analysis

Clustering Definition - a process by which objects that are logically similar in characteristics are grouped together.

Clustering Requires... A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects A threshold value with which to decide whether an object belongs with a cluster A way of measuring the “distance” between two clusters A cluster seed (an object to begin the clustering process)

Two common clustering algorithms
K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains

K-means clustering Make the first object the centroid for the first cluster For the next object calculate the similarity to each existing centroid If the similarity is greater than a threshold add the object to the existing cluster and re-compute the centroid, else use the object to start new cluster Return to step 2 and repeat until done

K-means clustering - Initial cluster choose 1 choose 2 test & join
centroid= centroid= -

Nearest Neighbor Algorithm
Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.

Hierarchical Clustering
Find the two closest objects and merge them into a cluster Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold If more than one cluster remains return to step 2 until finished

Key Parameters: similarities
Calculate the similarity between all possible combinations of two profiles Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Similarity between samples Similarity between clusters

Similarity Measurements
Euclidean Distance: Absolute difference

Similarity Measurements
Pearson Correlation: Trend similarity +1  Pearson Correlation  – 1

Clustering Single Linkage
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains”

Clustering Complete Linkage
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps”

Clustering Average Group Linkage
Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1

Hierarchical clustering & heatmap
B B C C D B E Find 2 most similar metabolite expression levels or curves Find the next closest pair of levels or curves F Iterate

Principal Component Analysis (PCA)
Project high-dimensional data into lower dimensions that capture the most variance of the data Assumption: Main directions of variance ≈ major data characteristics

Visualizing PCA PCA of a “bagel”
One projection produces a weiner (hotdog) Another projection produces an “O” The “O” projection captures most of the variation and has the largest eigenvector (PC1) The weiner projection is PC2 and gives depth info

PCA - The Details PCA involves the calculation of the eigenvalue (singular value) decomposition of a data covariance matrix PCA is an orthogonal linear transformation PCA transforms data to a new coordinate system so that the greatest variance of the data comes to lie on the first coordinate (1st PC), the second greatest variance on the 2nd PC etc. x1 x2 x3, … variables ……. xn t1 t2 ….. tm s1 s2 s3… samples. sk p1 p2 pk Scores = t (eigen vectors uncorrelated orthogonal) ….. Loadings = p scores = loadings x data t1 = p1x1 + p2x2 + p3x3 + … + pnxn

PCA - The Details From k original variables: x1,x2,...,xk:
Produce k new variables: t1,t2,...,tk t1 = a11x1 + a12x a1kxk t2 = a21x1 + a22x a2kxk ... tk = ak1x1 + ak2x akkxk

PCA - The Details Such that: tk's are uncorrelated (orthogonal)
t1 explains as much as possible of original variance in data set t2 explains as much as possible of remaining variance, etc.

2nd Principal Component, t2 1st Principal Component, t1

Principal Components Analysis on:
Covariance Matrix: Variables must be in same units Emphasizes variables with most variance Correlation Matrix: Variables are standardized (mean 0.0, SD 1.0) Variables can be in different units All variables have same impact on analysis

Eigenfaces PCA was mainly developed in face recognition
Eigenfaces look somewhat like generic faces.

Widely used in metabolomics
Hundreds of peaks 2 components ANIT PAP Control -25 -20 -15 -10 -5 5 10 15 20 25 -30 PC1 PC2 PAP ANIT Control Scores plot

PCA Loadings Plot Loadings plot shows how much each of the variables (metabolites) contributed to the different principal components Variables at the extreme corners contribute most to the scores plot separation

Scores & Loadings

PCA Details/Advice In some cases PCA will not succeed in identifying any clear clusters or obvious groupings no matter how many components are used. If this is the case, it is wise to accept the result and assume that the presumptive classes or groups cannot be distinguished As a general rule, if a PCA analysis fails to achieve even a modest separation of classes, then it is probably not worthwhile using other statistical techniques to try to separate them

PCA Summary Rotates multivariate dataset into a new configuration which is easier to interpret Purposes Data overview Outlier detection Look at relationships between variables

PLS-DA When the experimental effects are subtle or moderate, PCA will not show good separation patterns PLS-DA is a supervised method, it is calculated by maximizing the co-variance between the data matrix (X) and the class labels (Y). It tends to always produce certain separation patterns with regard the conditions

PCA vs. PLS-DA  

Use PLS-DA with Caution
PLS-DA is based on regressions. It first converts the class labels into numbers and then perform PLS regression between the data matrix and numerical Y. PLS-DA is susceptible to over-fitting by producing patterns of separation even for data randomly drawn from the same population. Need for cross validation Need to do permutation tests

Cross Validations Goal: test whether your model can predict class labels for new samples Train Test Dataset

Common Splitting Strategies
Leave-one-out (n-fold cross validation)

Components and Features
Cross validation is used to determine the optimal number of components needed to build the PLS-DA model. Three common performance measures: Sum of squares captured by the model (R2) Cross-validated R2 (also known as Q2) Prediction accuracy

Permutation Tests Goal: to test whether your model is significantly different from the null models Randomly shuffle the class labels (y) and build the (null) model between new y and x; Test whether there is still the similar patterns of separation; We can compute empirical p values If the result is similar as the permuted results (i.e. null model), then we can not say y and x is significantly correlated

Compute empirical p values
In 1000 times permutations, if none of the permuted mean difference is bigger than the original one p < or (1/1001) # prevent p value equal to zero

PLS-DA VIP Score Variable Importance in Projection (VIP) scores estimate the importance of each variable in the projection used in a PLS-DA model and is often used for variable selection A variable with a VIP Score >1 can be considered important in given model Variables with VIP scores <1 are less important and might be good candidates for exclusion from the model

VIP Plots

Assessing Regression Model Performance
Dependent on how we want to use the model. To understand the relationship between the predictor and the response. To use the model to predict future observations’ response. One basic measure Root mean square error (RMSE)

Assessing Classification Model Performance
For balanced data Accuracy: 9/13 correct => 69% accuracy Error rate: 1 – accuracy => 31% Not suitable for imbalanced data In a population, cancer incidence is low: ~5 cases in 1000 people. If a classifier predict all people to be healthy, then it is 99.5% accurate (majority vote)

Evaluating Performances
Basic concepts True positives (TP) True negatives (TN) False positives (FP) False negatives (FN). Sensitivity (Sn) Specificity (Sp) Sn = True positive rate Sp = True negative rate

Sn = TP/(TP + FN) Sp = TN/(TN + FP)
An Example Call these patients “negative” Call these patients “positive” True Positives (TP) Test Results Control Disease True Negatives (TN) False Positives (FP) False Negatives (FN) Sn = TP/(TP + FN) Sp = TN/(TN + FP)

ROC curves ROC = Receiver Operating Characteristic
A historic name from radar studies Very popular in biomedical applications To assess performance of classifiers. To compare different biomarker models A graphical plot of the true positive rate (TPR) vs. false positive rate (FPR), for a binary classifier (i.e. positive/negative) as its cutoff point is varied

True Positive Rate (sensitivity)
ROC Curve True Positive Rate (sensitivity) 0% 100% False Positive Rate (1-specificity) Tradeoff between sensitivity and specificity

Area under ROC curve (AUC)
Overall measure of test performance Comparisons between two tests based on differences between (estimated) AUC

AUC = 95% AUC = 70% A good test A poor test AUC = 50% AUC = 100%
TPR 0% 100% FPR Best test FRR Worst test A good test A poor test AUC = 95% AUC = 100% AUC = 70% AUC = 50%

Other Supervised Classification Methods
SIMCA – Soft Independent Modeling of Class Analogy OPLS – Orthogonal Projection of Latent Structures Support Vector Machines Random Forest Naïve Bayes Classifiers Neural Networks

Breaching the Data Barrier
Unsupervised Methods PCA K-means clustering Factor Analysis Supervised Methods PLS-DA LDA PLS-Regression Machine Learning Neural Networks Support Vector Machines Bayesian Belief Net

Data Analysis Progression
Unsupervised Methods PCA or cluster to see if natural clusters form or if data separates well Data is “unlabeled” (no prior knowledge) Supervised Methods/Machine Learning Data is labeled (prior knowledge) Used to see if data can be classified Helps separate less obvious clusters or features Statistical Significance Supervised methods always generate clusters -- this can be very misleading Check if clusters are real by label permutation

Note of Caution Supervised classification methods are powerful
Learn from experience Generalize from previous examples Perform pattern recognition Too many people skip the PCA or clustering steps and jump straight to supervised methods Some get great separation and think the job is done - this is where the errors begin… Too many don’t assess significance using permutation testing or n-fold cross validation If separation isn’t partially obvious by eye-balling your data, you may be treading on thin ice

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback