Frédéric Schütz Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge Center.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
. Differentially Expressed Genes, Class Discovery & Classification.
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Scaffold Download free viewer:
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Gene expression profiling identifies molecular subtypes of gliomas
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Gene Set Enrichment Analysis (GSEA)
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Inferring Function From Known Genes Naomi Altman Nov. 06.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Gene expression analysis
A Short Overview of Microarrays Tex Thompson Spring 2005.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Statistical Testing with Genes Saurabh Sinha CS 466.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
GO enrichment and GOrilla
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Please hand in homework on Law of Large Numbers Dan Gilbert “Stumbling on Happiness”
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Clustering Manpreet S. Katari.
Gene expression.
Statistical Testing with Genes
Interpretation of Similar Gene Expression Reordering
Parametric Methods Berlin Chen, 2005 References:
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Statistical Testing with Genes
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
Presentation transcript:

Frédéric Schütz Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge Center for Integrative Genomics University of Lausanne, Switzerland Bioinformatics Core Facility Swiss Institute of Bioinformatics

Class prediction 1-19 Gene Ontology analysis Geneset analysis (GSEA, etc) Contents Slides

Class discovery and class prediction Example: patients from which we obtained measurements (e.g. gene expression) Class discovery Gene 1 Gene 2 Find natural groups in the data (e.g. sets of patients with similar gene expression) Class prediction Given previous measurements for which the grouping is known (red and blue), can we predict the group to which a new observation belongs ? Gene 1 Gene 2 ?

Many questions in biology and medicine are “class prediction” questions: –Does a patient have a predisposition for a given disease ? –What is the prognosis for this patient ? –What will be the response of this patient to a given drug ? –Is this tumour benign or malign ? –What type is this tumour ? –Which treatment should be used ? Why do we want to do class prediction ?

Class prediction: easy case Gene 1 Gene 2 Classify everything on this side as “red” Classify everything on this side as “blue” Threshold

Example Pierre Farmer et al. Identification of molecular apocrine breast tumours by microarray analysis. Oncogene (2005) 24, 4660–4671 Blue points represent “oestrogen receptor (ER) status positive” determined by immunohistochemistry.

Class prediction: in practice Gene 1 Gene 2 The two groups are not perfectly separated (and this is still a pretty good case…) One variable (gene) is not sufficient to assign patients to groups Remember that with microarrays, we are not talking about just 2 measurements, but several 10,000s.

Goal: assign objects (e.g. patients) to classes based on some measurements (e.g. gene expression) Typically, in a microarray setting: –10s or (at best) 100s of patients –10,000s genes Unsupervised learning: nothing is known about the grouping of the data, and we try to find natural groups in the data Supervised learning: the classes are predefined; we use previously labelled objects to create a procedure for classification of future observations. Discrimination in general

K-nearest neighbours Linear Discriminant Analysis Classification trees Support Vector Machines (SVM) etc. Some supervised analysis methods

Example: 3-nearest neighbours Gene 1 Gene 2 Red or blue ?

Example: 3-nearest neighbours Gene 1 Gene 2 2 red vs 1 blue: the point is assigned to “red”

Choose a value for k (typical values: 3 or 5); in practice it can be chosen using the learning data (value that produces the best result) Find the k observations in the learning set that are closest to the new, unknown, observation Predict the class by a majority vote, that is, choose the class that is most common among the neighbours. Very simple method, with surprisingly good performance K-nearest neighbours

Suggested by R.A. Fisher in 1935 Procedure to find a linear combination of the observed variables that best separates (discriminates) two classes of objects. Using the “new variable”, objects from the same class are close together, and objects from different class are further away. Straightforward to calculate Can easily be extended to more than two classes Similar idea to Principal Component Analysis (PCA) Often forgotten in favour of PCA Linear Discriminant Analysis

Back to the easy case Gene 1 Gene 2 Classify everything on this side as “red” High value of the discriminant Classify everything on this side as “blue” Low value of the discriminant Threshold Discriminant = Gene 1

Linear Discriminant Analysis: Example Gene 1 Gene 2 The two groups are well separated Neither Gene1 nor Gene2 is able to discriminate between the two categories

Linear Discriminant Analysis: Example Gene 1 Gene 2 However, the linear combination L = Gene1 + Gene2 discriminates well between the two groups Blue points tend to have smaller L values Red points tend to have bigger L values Low values High values

Linear Discriminant Analysis: Example Gene 1 Gene 2 A threshold is set in between the mean of the two groups Points with a value L above the threshold are classified as red Points with a value L below the threshold are classified as blue Low values High values Threshold

Caveats: Overfitting It is easy to create classifiers which fit the training data perfectly It is harder to find classifiers which still work as well when validated on new data A classifier must ALWAYS be tested on data independent from the one used to actually train the classifier. This is particularly important in microarray analysis: –Few samples –Many different measurements If not careful, it is always possible to find a classifier that works well for your training data !

Caveats: Overfitting Gene 1 Gene 2 Classify everything in this region as red Perfect classifier for this data Probably not so good with any new data

Many microarray experiments produce lists of genes that are significantly differently expressed between two conditions (gene comparison). In some (rare) cases, only a few genes are of interest, and they can easily be examined and validated. In most cases, however, a long list of differentially expressed genes is returned, and these genes can not be considered individually. It is harder to obtain biological understanding from this data. One strategy: consider the functional annotation of the differentially expressed genes. Question: what do these genes have in common that could be of interest ? Gene Ontology analysis

Collaborative effort to address the need for consistent descriptions of gene products in different databases. Three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated –biological processes –cellular components –molecular functions in a species-independent manner. Reminder: Gene Ontology (GO) project (From

Example (From PPARA, NR1C1, PPAR: Peroxisome proliferator-activated receptor alpha (TAS: Traceable Author Statement, IPI: Inferred from Physical Interaction)

Example of GO analysis 10,000 genes in total 10% 1000 genes differentially expressed Simple microarray experience: WT vs KO The microarray has 10,000 genes, 100 of which have GO annotation “fatty acid transport” I obtain 1000 differentially expressed genes (10% of all genes) 90% If my experiment has nothing to do with “fatty acid transport”, I expect in average about 10% of genes (or 10) to be differentially expressed. If this proportion is higher, it means the list of differentially-expressed genes is enriched in “fatty acid transport” genes If the difference is significant, it suggests a link between differential expression and this GO annotation: genes with this annotation are more likely to be differentially expressed than others This indicates that this biological process may be related to my KO experiment.

10,000 genes in total 10% 1000 genes differentially expressed 90% 10 (10%) 90 (90%) Number of genes “fatty acid transport” 100 (100%) 0 (0%) Looks like a random distribution No apparent association Strong association ?...

Statistical analysis Assume that I found 20 differential expression with the GO annotation of interest. Count the numbers of genes with the GO annotation or not, and compare with differential expression: A statistical test such as Fisher’s exact test can tell us what is the probability of observing this result (or more extreme) if there is no association between the rows and columns In this case, this probability (p-value) is This indicates that this biological process may be important in the difference between WT and KO. Differentially expressed Not D.E.Total “Fatty acid transport” Others Total

In practice One can either suggest a GO annotation and see if it is enriched in the list of differentially expressed genes Or we may want to go “fishing” and try all potentially interesting GO annotations to see if any of them is enriched. Easy to do Multiple services available on the web –User indicates the list of genes differentially expressed –Returns the most significant GO annotations

Microarray with about 22,000 genes We look at the 1% of the genes that are most different between different subtypes of cancer. Which processes are likely to be different between these subtypes ? –Those for which more than 1% of the genes are differentially expressed are good candidates Gene Ontology analysis: example. I Pierre Farmer et al. Identification of molecular apocrine breast tumours by microarray analysis. Oncogene (2005) 24, 4660–4671 Prop. 5% 19% 10% 3% 5% 4%

To apply this GO analysis, we need first to define a list of differentially expressed genes. This usually means calculating a “score” (e.g. p- value), and selecting a cut-off point. While there are some traditional cut-off points (0.001, 0.01 or the “magical” 0.05), they remain fairly arbitrary –Is there really a difference between a gene associated with a p-value of and another one with a p-value of ? Gene Ontology analysis: example. II

Some genes may be differentially expressed, but the change may be so small (lost in the noise) that it will not appear in the list. However, the difference in expression may appear at the level of a set of genes rather than individual genes Set of genes may correspond e.g. to co-regulated genes, or genes belonging to the same pathway If the change of expression is consistent across genes in the set, it may indicate that the set is of interest, even if no individual gene shows a significant difference. Gene Ontology analysis: example. III

Gene set enrichment analysis (GSEA)

Series of papers describing a method for analyzing the expression of sets of genes Software available, along with a database of biologically relevant gene sets Relatively hot topic in bioinformatics/statistics: many differerent papers and methods published on the topic, with small or large differences GSEA usually refers to this particular program, but sometimes indicates any such method which examines sets of genes.

Principle of GSEA We have a list of genes sorted according to a given measure (score for differential expression, correlation to a phenotype, etc) Among this list, we have a smaller set of genes of interest (e.g. all belonging to a given pathway) Is the smaller set distributed randomly in the sorted list of genes ? –If yes, the set is less likely to be of interest –If no, it may indicate that the function represented by the set is linked with the measure.

Principle of GSEA (most methods) All genes, sorted High values (e.g. upregulated) Low values (e.g. down-regulated) Position in the list of genes of our set of interest The location of the genes of our set of interest within the list seem random (uniform); the set does not appear to be linked with differential expression.

Principle of GSEA (most methods) All genes, sorted High values (e.g. upregulated) Low values (e.g. down-regulated) Position in the list of genes of our set of interest Link with up-regulation Position in the list of genes of our set of interest Link with down-regulation

Statistical analysis “Random walk”: –The list of genes is walked down from left to right –Everytime a gene belong to our list S, the score goes up –Everytime a gene does not belong to the list, it goes down If the genes of the set are uniformly distributed, the score will never go very high (“up” soon followed by a “down”) If the genes are distributed together, the score will go higher before getting back to 0. Using a permutation test, a p- value can be associated to the geneset. From fig. 1 of Subramanian et al. PNAS 2005; 102;

Statistical analysis How can we summarise and assess an apparent link between a set and differential expression ? Each method uses different statistics Original GSEA method based on the Kolmogorov- Smirnov test (compare the distribution of genes with a uniform distribution) Later replaced by an “Enrichment Score” (similar but weighted)

Example mRNA expression profiles from lymphoblastoid cell lines derived from 15 males and 17 females Identify gene sets correlated with the difference between males and females (False Discovery Rate) From table 2 of Subramanian et al. PNAS 2005; 102;

Example Gene expression patterns from a collection of 50 cancer cell lines p53 regulates gene expression in response to various signals of cellular stress 33 cell lines carry a mutation on the p53 gene, and 17 are normal. From table 2 of Subramanian et al. PNAS 2005; 102;

Conclusions GeneSet Enrichment Analysis methods have quickly become widespread in the microarray community. Intuitive method Can be used to confirm an association known or suspected… (use a given geneset) … or to go “fishing” for unknown association (use a database of genesets) More generally, microarray analysis uses more and more this external biological knowledge.