Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail:

Similar presentations


Presentation on theme: "Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail:"— Presentation transcript:

1 Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail: jjin@email.unc.edujjin@email.unc.edu Fax: (919)966-6821

2 What GeneSpring Can do? Works with both Affymetrix and two-color data. Views data graphically (classification, graph, tree, scatter plot, Vann Diagram …) Performs statistical analyses. Annotates genes (updating from GenBank, LocusLink, Unigene; biochemical pathways). ……

3 Clustering: k-means (non-hierarchical) Self-organizing map Gene trees (hierarchical dendrograms). principal component analysis T-Test analyses ( p-values) Like a known gene or average of genes Like a pattern drawn with the mouse Genes with high confidence Genes with relative expression in certain ranges Pathway analysis finding genes that fit in a certain place in a pathway. Sequence analysis to automatically find regulatory sequences. Automatic functional annotation of sub-trees in dendrograms. … What statistical analyses does GS do?

4 Tree Clustering 1.Standard correlation 2.Smooth correlation 3.Change correlation 4.Upregulated correlation 5.Pearson correlation 6.Spearman correlation 7.Spearman confidence 8.Two-sided Spearman confidence 9.Distance

5 Notations to the Formulas  Result: the result of the calculation for genes A and B.  n: the numbers of samples being correlated over.  a: the vector (a 1, a 2, a 3... a n ) of expression values for gene A.  b: the vector (b 1, b 2, b 3... b n ) of expression values for gene B.  a.b = a 1 b 1 +a 2 b 2 +...+a n b n.  |a|=square root(a.a )

6 Standard Correlation Equation: a.b/(|a||b|) also called “Pearson correlation around zero”. Measure the angular separation of expression vectors for genes A & B. Answer the question “do the peaks match up?”

7 Pearson Correlation Equation: A.B / ( | A || B | ) Very similar to the Std correlation, except it measures the angle of expression vector for genes A & B around the mean of the expression vectors. A = the mean of all element in vector a - the value from each element in a. Do the same for b to make a vector B

8 Spearman Confidence r = the value of the Spearman correlation, SC = 1-(probability you would get a value of r or higher by chance) A measure of similarity, not a correlation High SC value if a high Spearman corr, & a low p-value. Takes account of the number of sub- experiment in your experiment set.

9 Two-sided Spearman Confidence A measure of similarity, very similar to the Spearman conf. Two-sided test of whether the Spearman corr. is either significantly gt/lt zero. “what genes behave similarly/opposite to a specific gene?” Probably not good for k-means/tree clustering. 1-(probability you would get a Spearman correlation of |r| or higher, or -|r| or lower, by chance).

10 Distance A measurement of dissimilarity, not a correlation at all. Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B. Distance = |a-b|/square root of N (expt. points)

11 Special Case Correlations Smooth correlation, Change correlation and Upregulated correlation. All three modified version of the Std. correlation. Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series.

12 Smooth Correlation Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a. Insert this new value b/w the old values Do this for each pair of elements that would connected by a line in the graph screen Do the same to make a vector B from b.

13 Change Correlation The opposite of what the Smooth corr. looks for. Only the chg. in expression level of adjacent points. Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly. The value created b/w two values a i and a i+1 is atan(a i+1 /a i )-  /4

14 Upregulated Correlation Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero. Make a new vector A from a by looking at the change b/w each pair of elements of a. The value created b/w two values a i and a i+1 is max(atan(a i+1 /a i )-  /4.0).

15 Algorithm to Build Gene Tree Determine if there is only one gene or subtree left. If yes, go to step five. Find the two closest genes/subtrees. Merge these two into one subtree. Return to step one. Merge together branches where the distance between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart.

16 Algorithm to Build Tree The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific. The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree.

17 Principal Components Analysis Not a clustering method. PCA, the most abundant building blocks, a set of expression patterns. 1 st PC is obtained by finding the linear combination of expr. Patterns for the most of variability in the data. And so on.

18 k-Means Clustering Divides genes into a user-defined # (k) of equal-sized groups, based on their expression patterns. Creates centroids at the avg. location of each group of genes With each iteration, genes are reassigned to the group with closest centroid After all of the genes have been reassigned, the location of the centroids is recalculated.

19 Self-Organizing Maps Similar to k-means clustering. Relationship b/w groups in a 2-D map. Best represents the variability of the data, while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3.

20 What does t-test mean in GS Replicates: one-sample Student’s t-test Comparisons for 2 groups: Student’s two-sample t-test. Comparisons for multiple groups: one-way analysis of variance (ANOVA). Filtering genes: based on a one-sample t-test of the mean expression level across replicates vs. a reference value (Expression Percentage Restriction)

21 Filter Genes Analysis Tools Global Error Model: filters out genes with large std deviations or error values. Raw data filtering: gets rid of genes too close to the background. Sample to sample comparison: fold cmp. Among different samples. Statistical Group cmp.: filters out genes not vary significantly across different groups. Data File Restriction: based on other field ( P/S call, +/- pairs).

22 Statistical Group Comparison Genes statistically significant difference in the mean expression levels across all group. For two groups: Students’s two-sample t-test. For multiple groups: ANOVA Non-parametric cmp.: for each gene, the rank order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups.

23 Data Normalization In two-color experiments, normalizing vs. the control channel (green) for each gene. Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another. Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a two- color experiment.

24 NCI-60 cell lines

25 DrugActivity_AT


Download ppt "Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail:"

Similar presentations


Ads by Google