Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail:

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Clustering.
CHOOSING A STATISTICAL TEST © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
Bivariate Analyses.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
PSY 307 – Statistics for the Behavioral Sciences
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Gene Expression Data Analyses (3)
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Basic Statistical Concepts
Statistics Psych 231: Research Methods in Psychology.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Cluster analysis  Function  Places genes with similar expression patterns in groups.  Sometimes genes of unknown function will be grouped with genes.
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Biol 500: basic statistics
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Homework #2: Calculating a correlation yearGDP/capitaODA (millions)
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Inference for regression - Simple linear regression
More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 16 Data Analysis: Testing for Associations.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Chapter 21prepared by Elizabeth Bauer, Ph.D. 1 Ranking Data –Sometimes your data is ordinal level –We can put people in order and assign them ranks Common.
CORRELATION ANALYSIS.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
PXGZ6102 BASIC STATISTICS FOR RESEARCH IN EDUCATION
Hypothesis Testing Procedures Many More Tests Exist!
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Cluster Analysis of Gene Expression Profiles
Statistical analysis.
Chapter 15 – Cluster Analysis
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Statistical analysis.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Volume 12, Issue 9, Pages (April 2002)
Descriptive Statistics
Presentation transcript:

Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919) Fax: (919)

What GeneSpring Can do? Works with both Affymetrix and two-color data. Views data graphically (classification, graph, tree, scatter plot, Vann Diagram …) Performs statistical analyses. Annotates genes (updating from GenBank, LocusLink, Unigene; biochemical pathways). ……

Clustering: k-means (non-hierarchical) Self-organizing map Gene trees (hierarchical dendrograms). principal component analysis T-Test analyses ( p-values) Like a known gene or average of genes Like a pattern drawn with the mouse Genes with high confidence Genes with relative expression in certain ranges Pathway analysis finding genes that fit in a certain place in a pathway. Sequence analysis to automatically find regulatory sequences. Automatic functional annotation of sub-trees in dendrograms. … What statistical analyses does GS do?

Tree Clustering 1.Standard correlation 2.Smooth correlation 3.Change correlation 4.Upregulated correlation 5.Pearson correlation 6.Spearman correlation 7.Spearman confidence 8.Two-sided Spearman confidence 9.Distance

Notations to the Formulas  Result: the result of the calculation for genes A and B.  n: the numbers of samples being correlated over.  a: the vector (a 1, a 2, a 3... a n ) of expression values for gene A.  b: the vector (b 1, b 2, b 3... b n ) of expression values for gene B.  a.b = a 1 b 1 +a 2 b a n b n.  |a|=square root(a.a )

Standard Correlation Equation: a.b/(|a||b|) also called “Pearson correlation around zero”. Measure the angular separation of expression vectors for genes A & B. Answer the question “do the peaks match up?”

Pearson Correlation Equation: A.B / ( | A || B | ) Very similar to the Std correlation, except it measures the angle of expression vector for genes A & B around the mean of the expression vectors. A = the mean of all element in vector a - the value from each element in a. Do the same for b to make a vector B

Spearman Confidence r = the value of the Spearman correlation, SC = 1-(probability you would get a value of r or higher by chance) A measure of similarity, not a correlation High SC value if a high Spearman corr, & a low p-value. Takes account of the number of sub- experiment in your experiment set.

Two-sided Spearman Confidence A measure of similarity, very similar to the Spearman conf. Two-sided test of whether the Spearman corr. is either significantly gt/lt zero. “what genes behave similarly/opposite to a specific gene?” Probably not good for k-means/tree clustering. 1-(probability you would get a Spearman correlation of |r| or higher, or -|r| or lower, by chance).

Distance A measurement of dissimilarity, not a correlation at all. Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B. Distance = |a-b|/square root of N (expt. points)

Special Case Correlations Smooth correlation, Change correlation and Upregulated correlation. All three modified version of the Std. correlation. Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series.

Smooth Correlation Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a. Insert this new value b/w the old values Do this for each pair of elements that would connected by a line in the graph screen Do the same to make a vector B from b.

Change Correlation The opposite of what the Smooth corr. looks for. Only the chg. in expression level of adjacent points. Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly. The value created b/w two values a i and a i+1 is atan(a i+1 /a i )-  /4

Upregulated Correlation Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero. Make a new vector A from a by looking at the change b/w each pair of elements of a. The value created b/w two values a i and a i+1 is max(atan(a i+1 /a i )-  /4.0).

Algorithm to Build Gene Tree Determine if there is only one gene or subtree left. If yes, go to step five. Find the two closest genes/subtrees. Merge these two into one subtree. Return to step one. Merge together branches where the distance between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart.

Algorithm to Build Tree The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific. The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree.

Principal Components Analysis Not a clustering method. PCA, the most abundant building blocks, a set of expression patterns. 1 st PC is obtained by finding the linear combination of expr. Patterns for the most of variability in the data. And so on.

k-Means Clustering Divides genes into a user-defined # (k) of equal-sized groups, based on their expression patterns. Creates centroids at the avg. location of each group of genes With each iteration, genes are reassigned to the group with closest centroid After all of the genes have been reassigned, the location of the centroids is recalculated.

Self-Organizing Maps Similar to k-means clustering. Relationship b/w groups in a 2-D map. Best represents the variability of the data, while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3.

What does t-test mean in GS Replicates: one-sample Student’s t-test Comparisons for 2 groups: Student’s two-sample t-test. Comparisons for multiple groups: one-way analysis of variance (ANOVA). Filtering genes: based on a one-sample t-test of the mean expression level across replicates vs. a reference value (Expression Percentage Restriction)

Filter Genes Analysis Tools Global Error Model: filters out genes with large std deviations or error values. Raw data filtering: gets rid of genes too close to the background. Sample to sample comparison: fold cmp. Among different samples. Statistical Group cmp.: filters out genes not vary significantly across different groups. Data File Restriction: based on other field ( P/S call, +/- pairs).

Statistical Group Comparison Genes statistically significant difference in the mean expression levels across all group. For two groups: Students’s two-sample t-test. For multiple groups: ANOVA Non-parametric cmp.: for each gene, the rank order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups.

Data Normalization In two-color experiments, normalizing vs. the control channel (green) for each gene. Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another. Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a two- color experiment.

NCI-60 cell lines

DrugActivity_AT