Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

Slides:



Advertisements
Similar presentations
Fast Algorithms For Hierarchical Range Histogram Constructions
Advertisements

Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Introduction to Microarry Data Analysis - II BMI 730
Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Microarray Data Preprocessing and Clustering Analysis
Gene Expression Data Analyses (3)
Differentially expressed genes
Lecture 14 – Thurs, Oct 23 Multiple Comparisons (Sections 6.3, 6.4). Next time: Simple linear regression (Sections )
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluation.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
. Differentially Expressed Genes, Class Discovery & Classification.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Two Groups Too Many? Try Analysis of Variance (ANOVA)
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Lecture 12 One-way Analysis of Variance (Chapter 15.2)
Chapter 2 Simple Comparative Experiments
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 14: Non-parametric tests Marshall University Genomics.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Clustering and Classification – Introduction to Machine Learning BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
AM Recitation 2/10/11.
Multiple testing in high- throughput biology Petter Mostad.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Whole Genome Expression Analysis
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Essential Statistics in Biology: Getting the Numbers Right
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
NTU & MSRA Ming-Feng Tsai
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Lecture 15 Wrap up of class. What we intended to do and what we have done: Topics: What is the Biological Problem at hand? Types of data: micro-array,
Presentation transcript:

Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University

Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles

How does two-channel microarray work? Printing process introduces errors and larger variance Comparative hybridization experiment

How does microarray work? Fabrication expense and frequency of error increases with the length of probe, therefore 25 oligonucleotide probes are employed. Problem: cross hybridization Solution: introduce mismatched probe with one position (central) different with the matched probe. The difference gives a more accurate reading.

How do we use microarray? Inference Clustering

Normalization Which normalization algorithm to use Inter-slide normalization Not just for Affymetrix arrays

Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles

Hypothesis Testing Two set of samples sampled from two distributions (N=2)

Hypothesis Testing Two set of samples sampled from two distributions (N=2) Hypothesis  1 and  2 are the means of the two distributions. Null hypothesis Alternative hypothesis

Student’s t-test

p-value can be computed from t-value and number of freedom (related to number of samples) to give a bound on the probability for type-I error (claiming insignificant difference to be significant) assuming normal distributions.

Student’s t-test Dependent (paired) t-test

Permutation (t-)test T-test relies on the parametric distribution assumption (normal distribution). Permutation tests do not depend on such an assumption. Examples include the permutation t-test and Wilcoxon rank-sum test. Perform regular t-test to obtain t-value t 0. The randomly permute the N 1 +N 2 samples and designate the first N 1 as group 1 with the rest being group 2. Perform t-test again and record the t-value t. For all possible permutations, count how many t- values are larger than t 0 and write down the number K 0.

Multiple Classes (N>2) F-test The null hypothesis is that the distribution of gene expression is the same for all classes. The alternative hypothesis is that at least one of the classes has a distribution that is different from the other classes. Which class is different cannot be determined in F-test (ANOVA). It can only be identified post hoc.

Example GEO Dataset Subgroup Effect

Gene Discovery and Multiple T-tests Controlling False Positives p-value cutoff = 0.05 (probability for false positive - type-I error) 22,000 probesets False discovery 22,000X0.05=1,100 Focus on the 1,100 genes in the second speciman. False discovery 1,100X0.05 = 55

Gene Discovery and Multiple T-tests Controlling False Positives State the set of genes explicitly before the experiments Problem: not always feasible, defeat the purpose of large scale screening, could miss important discovery Statistical tests to control the false positives

Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling for no false positives (very stringent, e.g. Bonferroni methods) Controlling the number of false positives ( Controlling the proportion of false positives Note that in the screening stage, false positive is better than false negative as the later means missing of possibly important discovery.

Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling for no false positives (very stringent) Bonferroni methods and multivariate permutation methods Bonferroni inequality Area of union < Sum of areas

Gene Discovery and Multiple T-tests Bonferroni methods Bonferroni adjustment If E i is the event for false positive discovery of gene I, conservative speaking, it is almost guaranteed to have false positive for K > 19. So change the p-value cutoff line from p 0 to p 0 /K. This is called Bonferroni adjustment. If K=20, p 0 =0.05, we call a gene i is significantly differentially expressed if pi<

Gene Discovery and Multiple T-tests Bonferroni methods Bonferroni adjustment Too conservative. Excessive stringency leads to increased false negative (type II error). Has problem with metaanalysis. Variations: sequential Bonferroni test (Holm-Bonferroni test) Sort the K p-values from small to large to get p 1  p 2  …  p K. So change the p-value cutoff line for the ith p-value to be p 0 /(K-i+1) (ie, p 1  p 0 /K, p 2  p 0 /(K-1), …, p K  p 0. If p j  p 0 /(K-j+1) for all j  i but p i+1 >p 0 /(K-i+1+1), reject all the alternative hypothesis from i+1 to K, but keep the hypothesis from 1 to i.

Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling the number of false positives Simple approach – choose a cutoff for p- values that are lower than the usual 0.05 but higher than that from Bonferroni adjustment More sophisticated way: a version of multivariate permutation.

Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling the proportion of false positives Let  be the portion (percentage) of false positive in the total discovered genes. False positive Total positive p D is the choice. There are other ways for estimating false positives. Details can be found in Tusher et. al. PNAS 98:

Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles

Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

-Clustering or classification? -Is training data available? -What domain specific knowledge can be applied? -What preprocessing of data is needed? -Log / data scale and numerical stability -Filtering / denoising -Nonlinear kernel -Feature selection (do I need to use all the data?) -Is the dimensionality of the data too high?

How do we process microarray data (clustering)? - Feature selection – genes, transformations of expression levels. - Genes discovered in the class comparison (t- test). Risk: missing genes. - Iterative approach : select genes under different p-value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA).

Distance Measure (Metric?) -What do you mean by “similar”? -Euclidean -Uncentered correlation -Pearson correlation

Distance Metric -Euclidean _atLip _atAp1s d E (Lip1, Ap1s1) = 12883

Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1 to -1.

Distance Metric -Pearson Correlation _atLip _atAp1s d P (Lip1, Ap1s1) = 0.904

Distance Metric -Uncentered Correlation _atLip _atAp1s d u (Lip1, Ap1s1) =  About 33.4 o

Distance Metric -Difference between Pearson correlation and uncentered correlation _atLip _atAp1s Pearson correlation Baseline expression possible Uncentered correlation All are considered signals

Distance Metric -Difference between Euclidean and correlation

Distance Metric -Missing: negative correlation may also mean “close” in signal pathway (1-|PCC|, 1-PCC^2)

Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters.

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters.

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage – Prone to chaining and sensitive to noise Complete linkage – Tends to produce compact clusters Average linkage – Sensitive to distance metric

-Unsupervised Learning – Hierarchical Clustering

Dendrograms Distance – the height each horizontal line represents the distance between the two groups it merges. Order – Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria.

-Unsupervised Learning - K-means -Vector quantization -K-D trees -Need to try different K, sensitive to initialization

-Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric

-Unsupervised Learning - K-means -Number of class K needs to be specified -Does not always converge -Sensitive to initialization

-Issues -Lack of consistency or representative features (5.3 TP PTEN doesn’t make sense) -Data structure is missing -Not robust to outliers and noise D’Haeseleer 2005 Nat. Biotechnol 23(12):

-Model-based clustering methods (Han) Pan et al. Genome Biology :research doi: /gb research0009

-Structure-based clustering methods

-Supervised Learning -Support vector machines (SVM) and Kernels -Only (binary) classifier, no data model

-Accuracy vs. generality -Overfitting -Model selection Model complexity Prediction error Training sample Testing sample (reproduced from Hastie et.al.)