Analysis of Gene Expression Data

Slides:



Advertisements
Similar presentations
1 A B C
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Simplifications of Context-Free Grammars
1
EuroCondens SGB E.
Worksheets.
STATISTICS HYPOTHESES TEST (I)
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Addition and Subtraction Equations
David Burdett May 11, 2004 Package Binding for WS CDL.
Whiteboardmaths.com © 2004 All rights reserved
CALENDAR.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
1 OFDM Synchronization Speaker:. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outline OFDM System Description Synchronization What is Synchronization?
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
1 IMDS Tutorial Integrated Microarray Database System.
MM4A6c: Apply the law of sines and the law of cosines.
Localisation and speech perception UK National Paediatric Bilateral Audit. Helen Cullington 11 April 2013.
Regression with Panel Data
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Biology 2 Plant Kingdom Identification Test Review.
2.5 Using Linear Models   Month Temp º F 70 º F 75 º F 78 º F.
Adding Up In Chunks.
Basic Gene Expression Data Analysis--Clustering
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Artificial Intelligence
When you see… Find the zeros You think….
Overview of Genevestigator
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Static Equilibrium; Elasticity and Fracture
Essential Cell Biology
Converting a Fraction to %
Ch 14 實習(2).
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Module 20: Correlation This module focuses on the calculating, interpreting and testing hypotheses about the Pearson Product Moment Correlation.
Physics for Scientists & Engineers, 3rd Edition
Chapter 14 Nonparametric Statistics
Select a time to count down from the clock above
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Presentation transcript:

Analysis of Gene Expression Data Rainer Breitling r.breitling@bio.gla.ac.uk Bioinformatics Research Centre and Institute of Biomedical and Life Sciences University of Glasgow

Outline Gene expression biology Measuring gene expression levels two technologies: Two-color cDNA arrays and single-color Affymetrix genechips Finding and understanding differentially expressed genes Advanced analysis (clustering and classification) Cutting-edge uses of microarray technology

Gene expression biology

The central dogma of biology

Genome information is complete for hundreds of organisms...

...but the complexity and diversity of the resulting phenotype is challenging whole-mount in situ hybridization of X. laevis tadpoles

The dramatic consequences of gene regulation in biology Same genome  Different tissues Different physiology Different proteome Different expression pattern Anise swallowtail, Papilio zelicaon

The complexity of eukaryotic gene expression regulation

Regulatory Networks – integrating it all together Genetic regulatory network controlling the development of the body plan of the sea urchin embryo Davidson et al., Science, 295(5560):1669-1678.

Gene expression distinguishes... ...physiological status (nutrition, environment) ...sex and age ...various tissues and cell types ...response to stimuli (drugs, signals, toxins) ...health and disease underlying pathogenic diversity progression and response to treatment patient classes of varying prospects

Measuring gene expression levels total amount of mRNA = optical density at appropriate (UV) wavelength mass separation and specific probing, one gene at a time = Northern blot comprehensive “molecular sorting” = microarray technology two-color cDNA or oligo arrays single-color Affymetrix genechips

cDNA microarray schema color code for relative expression From Duggan et al. Nature Genetics 21, 10 – 14 (1999)

cDNA microarray raw data can be custom-made in the laboratory always compares two samples relatively cheap up to about 20,000 mRNAs measured per array probes about 50 to a few hundred nucleotides Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. (DeRisi, Iyer & Brown, Science, 268: 680-687, 1997)

GeneChip® Affymetrix

GeneChip® Hybridization Image courtesy of Affymetrix.

Affymetrix genearrays single color (color code indicates only hybridization intensity) high density, perfectly addressable probes multiple probes per gene/mRNA

Affymetrix genechips contain “probe sets” instead of single probes per gene  better reliability of the results (each probe is [almost] an independent test)

Mismatch probes allow present/absence calls for every single probe set PM probes MM probes Wilcoxon Signed Rank Test : non-parametric test; Take the paired observations (PM-MM), calculate the differences, and rank them from smallest to largest by absolute value. Add all the ranks associated with positive differences, giving the T+ statistic. Finally, the p-value associated with this statistic is found from an appropriate table. (MathWorld)

Finding and understanding differentially expressed genes

Scatter plots classical scatter plot M-A plot for microarray analysis Differentially expressed genes are higher (or lower) in one of the samples Use an appropriate cut-off (‘distance’ from diagonal) to select relevant genes  highly arbitrary!

t-test = statistical significance of observed difference requires independent experimental replication assumes the data are identically normally distributed

Testing an intrinsic hypothesis Frequency Sample 2 Sample 1 Two samples (1, 2) with mean expression that differ by some amount d. If H0 : d = 0 is true, then the expected distribution of the test statistic t is -3 -2 -1 1 2 3 Probability

Volcano plot A volcano plot is a scatter plot of -log (p-value) from a t-test or one-way ANOVA, versus log ratio. It allows you to visualize fold-change and statistical significance at the same time, so that one can find genes that are significant and have large fold change, or genes that are significant but have small fold change. Scatter plot of -log(p-value) from a t-test vs. log ratio. Visualises fold-change and statistical significance at the same time: Find genes that are significant and have large fold change, and genes that are significant but have small fold change.

Is this gene changed? Rank Product: Comparison with all other genes on the array Expression of gene A Rank Product: RP = (3/10) * (1/10) * (2/10) * (5/10) intuitive non-parametric, powerful test statistic more reliable detection of changed genes in noisy data with few replicates Significance estimate based on random permutations: Probability that gene A shows such an effect by chance: p ≤ 0.03 Expectation to see any gene (out of 10) with such a effect: E-value ≈ 0.5 Breitling et al., FEBS Letters, 2004

Multiple Testing Problem microarrays measure expression of >10,000 genes at the same time  many thousands of statistical tests are performed type 1-error: Calling a gene significantly changed, even if it’s just by chance  protect yourself by Bonferroni correction type 2-error: Missing a significantly changed gene  reduce this problem by Benjamini-Hochberg false-discovery rate procedure

Multiple Testing Problem Bonferroni correction. n independent tests, control the probability that a spurious result passes the test at signficance level α  adjust acceptance level for each individual test as: Benjamini-Hochberg False Discovery Rate. Control the number of false positives (N1|0) among the top R genes at the significance level α.

The result of “differential expression” statistical analysis  a long list of genes!   Fold-Change Gene Symbol Gene Title 1 26.45 TNFAIP6 tumor necrosis factor, alpha-induced protein 6 2 25.79 THBS1 thrombospondin 1 3 23.08 SERPINE2 serine (or cysteine) proteinase inhibitor, clade E (nexin, plasminogen activator inhibitor type 1), member 2 4 21.5 PTX3 pentaxin-related gene, rapidly induced by IL-1 beta 5 18.82 6 16.68 CXCL10 chemokine (C-X-C motif) ligand 10 7 18.23 CCL4 chemokine (C-C motif) ligand 4 8 14.85 SOD2 superoxide dismutase 2, mitochondrial 9 13.62 IL1B interleukin 1, beta 10 11.53 CCL20 chemokine (C-C motif) ligand 20 11 11.82 CCL3 chemokine (C-C motif) ligand 3 12 11.27 13 10.89 GCH1 GTP cyclohydrolase 1 (dopa-responsive dystonia) 14 10.73 IL8 interleukin 8 15 9.98 ICAM1 intercellular adhesion molecule 1 (CD54), human rhinovirus receptor 16 9.97 SLC2A6 solute carrier family 2 (facilitated glucose transporter), member 6 17 8.36 BCL2A1 BCL2-related protein A1 18 7.33 TNFAIP2 tumor necrosis factor, alpha-induced protein 2 19 6.97 SERPINB2 serine (or cysteine) proteinase inhibitor, clade B (ovalbumin), member 2 20 6.69 MAFB v-maf musculoaponeurotic fibrosarcoma oncogene homolog B (avian)

Biological Interpretation Strategy Are certain types of genes more common at the top of the list and is that significant? Challenges: Some types of genes are more common in the genome/on the array The list of genes usually stops at an arbitrary cut-off (“significantly changed genes”) Classifying genes according to “gene type” is a tedious task Expectations and focused expertise might bias the interpretation Early discoveries might restrict further analysis Solution: Automated procedure using available annotations

iterative Group Analysis (iGA) iGA uses a simple hypergeometric distribution to obtain p-values Breitling et al. (2004), BMC Bioinformatics, 5:34.

Possible sources of classification adjacency in metabolic networks shared biological processes co-expression in microarray experiments co-occurrence in the biomedical literature gene ontology annotations (shared terms from a controlled vocabulary)

Graph-based iGA exploits the overlap of annotations to produce a comprehensive picture of the microarray results

Graph-based iGA 1. step: build the network

Graph-based iGA 2. step: assign experimentally determined ranks to genes

Graph-based iGA 3. step: find local minima p = 1/8 = 0.125

Graph-based iGA 4. step: extend subgraph from minima p=0.014 p=0.018

Graph-based iGA 5. step: select p-value minimum p=0.014 p=0.018

small ribosomal subunit large ribosomal subunit nucleolar rRNA processing translational elongation Breitling et al., BMC Bioinformatics, 2004

oxidative phosphorylation (complex V) respiratory chain complex II glyoxylate cycle citrate (TCA) cycle oxidative phosphorylation (complex V) respiratory chain complex III Breitling et al., BMC Bioinformatics, 2004

Advanced analysis (clustering and classification)

Classical study of cancer subtypes Golub et al. (1999) identification of diagnostic genes

Similarity between microarray experiments or expression patterns  distance between points in high dimensional space Pearson correlation (looks for similarity in shape of the response profile, not the absolute values) Euclidean distance (shortest direct path), takes absolute expression level into account There are many ways of conceptualizing the correlation coefficient. If you were to make a scatterplot of the values of x against y (pairing x1 with y1, x2 with y2, etc), then r reports how well you can fit a line to the values. The simplest way to think about the correlation coefficient is to plot x and y as curves, with r telling you how similar the shapes of the two curves are. The Pearson correlation coefficient is always between -1 and 1, with 1 meaning that the two series are identical, 0 meaning they are completely uncorrelated, and -1 meaning they are perfect opposites. The correlation coefficient is invariant under linear transformation of the data. That is, if you multiply all the values in y by 2, or add 7 to all the values in y, the correlation between x and y will be unchanged. Thus, two curves that have identical shape, but different magnitude, will still have a correlation of 1. The Euclidean distance takes the difference between two gene expression levels directly. It should therefore only be used for expression data that are suitably normalized, for example by converting the measured gene expression levels to log-ratios. Unlike the correlation-based distance measures, the Euclidean distance takes the magnitude of changes in the gene expression levels into account. It therefore preserves more information about the gene expression levels than the other distance measures mentioned above. An example of the Euclidean distance applied to k-means clustering can be found in De Hoon, Imoto, and Miyano (2002). The city-block distance, alternatively known as the Manhattan distance, is related to the Euclidean distance. Whereas the Euclidean distance corresponds to the length of the shortest path between two points, the city-block distance is the sum of distances along each dimension: This is equal to the distance you would have to walk between two points in a city, where you have to walk along city blocks. The city-block distance is a metric, as it satisfies the triangle inequality. As for the Euclidean distance, the expression data are subtracted directly from each other, and we should there fore make sure that they are properly normalized. Manhattan (or city-block) distance

Gene expression data analysis (Ramaswamy and Golub 2002)

Hierarchical clustering Combine most similar genes into agglomerative clusters, build tree of genes Do the same procedure along the second dimension to cluster samples Display the sorted expression values as a heatmap

Hierarchical clustering results Chi et al., PNAS | September 16, 2003 | vol. 100 | no. 19 | 10623-10628 “Endothelial cell diversity revealed by global expression profiling”

Biologically Valid Linear Factor Models of Gene Expression expression level of gene g in array a expression level of gene x in hypothetical process p contribution of process p to expression pattern in array a experiment- and gene-specific noise M. Girolami & R. Breitling (2004), Bioinformatics, 20(17):3021-33

Biologically Valid Linear Factor Models of Gene Expression M. Girolami & R. Breitling (2004), Bioinformatics, 20(17):3021-33

Support Vector Machines (SVM) for supervised classification Find separating hyperplane that maximizes the margin between the two classes  use this to classify new samples (e.g. in a microarray-based diagnostic test)

Excursus: Experimental design common reference loop Kerr & Churchill, Biostatistics. 2001. Jun;2(2):183-201 A-Optimality = minimize

Cutting-edge uses of microarray technology

Alternative splicing on microarrays Alternative Splicing Microarrays Reveal Functional Expression of Neuron-specific Regulators in Hodgkin Lymphoma Cells Examples of changes in alternative splicing in Hodgkin cell lines. A, changes in expression of p16/p14arf isoforms. The genomic structure of the locus and splicing patterns are represented. Alternative transcription initiation sites are indicated by brown arrows, exons are indicated by boxes, and splicing patterns are indicated by thin lines. Arrows indicate increase (arrows pointing up), decrease (arrows pointing down), or no change (double-headed horizontal arrows) in the detection of exons or splice junctions in RNAs from Hodgkin cell lines (indicated with a color code) as compared with a reference non-tumor B cell line. B, changes in expression of CD44 isoforms. Exons v2–v10 are variable exons that can be included or skipped in various combinations, affecting the cell adhesion properties of the protein product. Genomic structure, splicing patterns, and changes in splicing are indicated as in A. FIG. 3. Unsupervised clustering analyses of microarray data. Microarray results for splicing events (exons/splice junctions) for 16 RNA preparations were clustered using Gene Spring 6.2 software. Each column represents an RNA sample, and each row represents an exon/splicing event. Colors vary from blue to red with increasing signal intensity and from dark to bright with the confidence on the value observed in multiple measurements. Colors at the bottom row and in dendrograms represent the different cell lines, as indicated. Relogio et al., J. Biol. Chem., Vol. 280, Issue 6, 4779-4784, February 11, 2005

Customised detection of genetic polymorphisms in human patients individual genotype  personalised medicine example: ARRAYED PRIMER EXTENSION (APEX) 2.Complementary fragment of PCR amplified sample DNA is annealed to oligos. 1. Up to 6000 known 25-mer oligos are immobilized via 5’ end on a microarray 4. DNA fragments and unused dye terminators are washed off. Signal detection. 3. Template dependent single nucleotide extension by DNA polymerase. Terminator nucleotides are labelled with 4 different fluorescent dyes.

Identification of pathogens in environmental (patient) samples – Sequencing by hybridization Fig. 1. Pathogen screening with oligonucleotide arrays. (A) Scanned image of a portion of the MPID microarray showing relative fluorescence intensity (in pseudocolor scale, increasing from blue to white) of probes corresponding to 8 Yersinia pestis-diagnostic regions from which amplicons and probes are derived. (B) Magnification of the first 20 probe pairs of probe set Yp5 (from Y. pestis caf1 virulence gene) hybridizing to Y. pestis target DNA (yellow box in A above); each 20-base perfect match (PM) probe square distinctly matching the pathogen’s diagnostic sequence is paired with mismatch (MM) partner, physically located in the adjacent, lower position on the microarray. The relative fluorescence intensity results for each probe pair is graphically displayed with red bars for PM intensity and green bars for MM intensity. This probe set continues with additional 328 probe pairs. (C) Microarray probe design. Overlapping sequences advance by one nucleotide over the entire diagnostic sequence, essentially sequencing the entire region by hybridization. The beginning of the jcaf1 gene sequence (capitalized) is shown, with the first 20 corresponding MM probes underneath, identical to the PM probes except for a mismatch at position 11 of its partners 20-meroligonucleotide (substitution highlighted in red). between 3 and 10 probe sets per species, each containing a few hundred probes sensitivity about 500fg pathogen genomic DNA per sample Wilson et al. Molecular and Cellular Probes, Volume 16, Issue 2 , April 2002, Pages 119-127

Global identification of transcription factor target sites using chromatin immunoprecipitation plus whole-genome tiling microarrays (ChIP-chip) preferably the array should provide continuous genome coverage, not just ORFs Hanlon & Lieb: Current Opinion in Genetics & Development Volume 14, Issue 6 , December 2004, Pages 697-705

Directed graph of regulatory influences – gene network Inference of gene regulatory networks from gene expression data (indirect method, in contrast to the direct ChIP-chip approach remove ambiguous relationships (remove indirect connections) (A) Obtain the gene expression matrix  using several sets of the gene expression patterns resulting from disruption of one gene. (B) Binary relation R; Using the gene expression matrix  , if the intensity of gene ` ' is changed higher than a given threshold value  , or is changed lower than a given threshold 1/ resulting from the disruption of gene ` ,' it is defined that gene ` ' affects gene ` ' directly or indirectly. Thus the binary relation  is created by cutting the value of each element in the gene expression matrix  at the threshold ( or 1/ ). (C) Adjacency matrix A; The adjacency matrix A is derived directly from the binary relation  . If there is a relation that gene ` ' affects gene ` ', then the value of element  in the adjacency matrix  is set to 1; . (D) Equivalence set; If there is the relation, such that gene ` ' and ` ' affect each other, that is , we cannot decide which gene is located upstream or downstream in a gene regulatory network. This is the limitation and disadvantage of this method, however, we introduce an ``equivalence set'', which comprises a group of genes affecting each other and the group is treated as a one-gene node for determining influence on other genes. In Fig. 1, genes G3 and G4 are partitioned into a equivalence set G3**, or a new artificial gene G3**, and the adjacency matrix A is contracted to a matrix. After this process, any two genes, gene ` ' and gene ` ', stand in only one relation, that is ``Gene affects gene ,'' ``Gene affects gene '' or ``Gene and are independent.'' This shows that genes are in a semi-ordered (topologically sorted) matrix. (E) Skeleton matrix S; Semi-ordered matrix A( matrix) includes indirect effects between genes. In order to remove them, we process as follows: The value of line and column in semi-ordered matrix  and skeleton matrix  are represented as and , respectively. If equals 1, is set to max{ , 0}( ). Thus, all indirect effects are removed from the semi-ordered matrix. In Fig. 1, the relation between gene G1 and gene G3** are removed, and the skeleton matrix  is constructed. (F) Draw multi-level digraph; Arrows are drawn between nodes based on the value of each element in the skeleton matrix. The genes with parentheses indicate an equivalence set of genes. The multi-level digraph method shown in Fig. 1 is consistent with the experiment data E and the binary relation R. Directed graph of regulatory influences – gene network ABURATANI et al., DNA Res. 2003 Feb 28;10(1):1-8.

gene expression as a Quantitative Trait Genetical genomics gene expression as a Quantitative Trait qualitative expression quantitative expression the combination of genotype and expression information can identify cis- and trans-regulatory sites Fig. 2. Combined analysis of expression profile and genetic map data from Fig. 1. Four hypothetical cDNAs are given as example. For each, the microarray data is plotted in a histogram and analysed in combination with the molecular marker data, using multifactorial analysis of variance. The types of allele present for the markers A/a, B/b, D/d and F/f are differentiated by colour. The distribution of expression for cDNA1 shows a qualitative expression (a). Consequently, this is a marker in itself. The distribution of expression of cDNA2 (b) gives a quantitative profile, and its components can be resolved by grouping the segregating alleles at the markers A and B. The distribution of expression of cDNA3 (c) also gives a quantitative profile, but its components cannot be resolved despite the segregation of the F/f alleles of marker F. The quantitative expression profile cDNA4 (d) can be resolved based on D/d and F/f. Thus, marker F contributes to information about other cDNAs, whereas F by itself is not informative. Standard one-way analysis of variance of the profile of cDNA2 would distinguish Ab and AB (light shades) from ab and aB (dark shades), but a two-way analysis of variance is more appropriate and allows distinction between the four classes ab, aB, Ab and AB. For cDNA4, the same is true for the classes df, Df, dF and DF. In the case of cDNA2, there is no epistatic interaction, whereas the position of DF relative to the other three classes in case of cDNA4 indicates such an interaction. The multifactorial analysis of variance can easily be extended to three or more markers. It will allow assessing the significance and the strength of the actions and interactions of gene products. epistatic interaction Jansen & Nap, Trends Genet. 2001 Jul;17(7):388-91 and Jansen & Nap, Trends Genet. 2004 May;20(5):223-5.

Further reading Kerr MK, Churchill GA. Genet Res. 2001; 77: Statistical design and the analysis of gene expression microarray data. Eisen MB, Spellman PT, Brown PO, Botstein D. Proc Natl Acad Sci U S A. 1998; 95: Cluster analysis and display of genome-wide expression patterns. Hughes TR, Marton MJ, Jones AR, Roberts CJ, et al. Cell. 2000; 102: Functional discovery via a compendium of expression profiles. Wit E, McClure J. 2005: Statistics for Microarrays – Design, Analysis and Inference

Conclusions microarrays measure gene expression globally  new post-genomic biology two principal technologies: one-color (Affymetrix) and two-color (cDNA arrays) multiple measurements pose particular statistical challenges interpretation requires combination with previous knowledge creative application of microarrays opens new avenues for biological insight