Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Analysis of Gene Expression Anne R. Haake Rhys Price Jones.

Similar presentations

Presentation on theme: "1 Analysis of Gene Expression Anne R. Haake Rhys Price Jones."— Presentation transcript:

1 1 Analysis of Gene Expression Anne R. Haake Rhys Price Jones

2 2 Gene Expression Analysis Connecting structural genomics to functional genomics

3 3 How do we relate gene identity to cell physiology, disease & drug discovery?  Functional Genomics =“ development and application of global (genome-wide or system-wide) experimental approaches to assess gene function by making use of the information and reagents provided by structural genomics”

4 4 High Throughput Systems for Studying Global Gene Expression are Complex Need to consider: –the biology behind the experiments & the interpretation of the experiments –advancements in biotechnology –the computing issues

5 5 The Flow of Information A gene is expressed in 2 steps: DNA is transcribed into RNA (mRNA) RNA is translated into protein

6 6 Genotype to Phenotype Individual cells in an organism have the same genes (DNA) –the genotype but….not all genes are active (expressed) in each cell It is the expression of thousands of genes and their products (RNA, proteins), functioning in a complicated and orchestrated way, that make a specific cell what it is. –the phenotype

7 7 The Flow of Information

8 8 Gene Expression Depends on Context The subsets of genes that are expressed (RNA/protein) will differ among cells, tissues, organs, conditions… –the subset expressed confers unique properties to the cell muscle neuron liver

9 9 Differential Gene Expression The level of expression of genes also differs with the cellular context i.e. the amount of a given RNA will vary We can think of gene expression in eukaryotes as having both an “on/off” switch and “volume” control

10 10 Specific Patterns of Gene Expression Tissue/Cell type-specific -e.g. skin cell vs. brain cell -e.g. keratinocyte vs. melanocyte Developmental stage -e.g. embryonic skin cell vs. adult skin cell Disease state -e.g. normal skin cell vs. skin tumor cell Environment-specific (drugs, toxins) -e.g. skin cell untreated vs. treated

11 11 Analyze Gene Expression We measure gene expression by analyzing the genetic molecule, messenger RNA (mRNA) We also often are interested in measuring proteins

12 12 We Can Analyze RNA Content First, isolate mRNA from cells or tissues

13 13 Next, identify RNAs in the sample –One to few RNAs at a time –Multiple RNAs (high throughput techniques) Sometimes called global expression analysis Identify by hybridization (base-pairing) radioactive or enzyme-linked “probe” to the immobilized RNA Probe is complementary to RNA of interest –Called cDNA

14 14 RNA Expression Analysis One type of analysis is called a Dot Blot: samples are spotted onto filter and then hybridized with labeled probe So, the sequence is used to generate the data (via hybridization) but the data itself is image data. We scan the images to get intensities for each spot.

15 15 State of the Art: High-Throughput Methods Multiple genes  entire genome expression analyzed at once! RNA (the transcriptome) –DNA microarrays –SAGE:serial analysis of gene expression –MPSS: multiple parallel signature sequencing Proteins (the proteome) –protein arrays –mass spectrophotometry

16 16 Gene Expression Analysis Thousands of different mRNAs are present in a given cell; together they make up the transcriptional profile It is important to remember that when a gene expression profile is analyzed in a given sample, it is just a snapshot in time and space.

17 17 Regulation of Gene Expression: How Do Different Transcriptional Profiles Arise? Prokaryotes (e.g. bacteria) –simple organisms; gene expression responsive primarily to environment –sets of genes are generally on or off –functionally related genes are organized into units called Operons Eukaryotes –Not just on/off and volume control but even more complicated! –complex multi-level control

18 The Expression Snapshot All of these mechanisms together determine which RNA’s/proteins are present in the snapshot.

19 19 Gene Regulatory Networks Genes act in concert Interrelationships are complex! Scientists rarely get funded to study one gene anymore!! We need ways to understand what the snapshot means

20 20 Gene Networks "The approach to biology for the past 30 years has been to study individual proteins and genes in isolation. The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks." Leroy Hood, Director of the Institute for Systems Biology, Nature, Oct. 19, 2000.

21 21 Gene Networks: Some Examples Genes and their products are related through their roles in: –metabolic pathways –cell signalling networks

22 22 Metabolic Pathway

23 23 Cell Signalling Networks

24 24 What can we learn by studying global patterns of gene expression? Individual gene expression patterns Classifications: for diagnosis, prediction… –Groups of Genes –Molecular taxonomy of disease Gene Networks/Pathways: –Reconstruction of metabolic & regulatory pathways

25 25 Gene Expression Analysis Biotechnology –High-Throughput techniques of Biochemistry/ Molecular Biology –RNA or protein Informatics –Management of large, complex data sets –Data mining to gain useful information

26 26 High Throughput Techniques We will only consider microarrays in detail today

27 27 Experimental Design Array Production Sample Preparation Scanning Image Analysis Data Processing Data Analysis Information Integration Gene Expression Analysis Using Microarrays Biology “Wet Lab” Computer Workstation

28 28 GeneChip ® vs Spotted Arrays GeneChip ® Arrays use oligonucleotidesGeneChip ® Arrays use oligonucleotides Oligos arrays are built on a solid supportOligos arrays are built on a solid support GeneChip ® Arrays use oligonucleotidesGeneChip ® Arrays use oligonucleotides Oligos arrays are built on a solid supportOligos arrays are built on a solid support Spotted arrays utilize nucleic acids made in solutionSpotted arrays utilize nucleic acids made in solution Solutions are then “spotted” onto a solid supportSolutions are then “spotted” onto a solid support Spotted arrays utilize nucleic acids made in solutionSpotted arrays utilize nucleic acids made in solution Solutions are then “spotted” onto a solid supportSolutions are then “spotted” onto a solid support

29 29 What are cDNA (Spotted) Microarrays? A miniaturized simultaneous version of the traditional cDNA dot blot Enables massive gene expression profiling 10,000s genes at once cDNA “probes” are amplified directly from culture by PCR and purified Purified probes are printed on to coated glass slides

30 30 cDNA Microarray Expression Analysis Duggan et al. (1999) Nature Genetics 21: 11

31 ScanAnalyze: Image Analysis 1.Spot Finding 2.Background subtraction 3.Intensity Calculation

32 32 Image Analysis Software Freeware or Shareware –ScanAlyze –MAGIC (MicroArray Genome Imaging and Clustering Tool) Spot –Automated spot finding

33 33 What do the spots represent? Fluorescence intensity is a measure of the relative abundance of individual mRNAs Experimental relative to control –expressed as a ratio from spotted arrays (2-color) SM1801 SM1801 Have to be careful when comparing between arrays; from experiment to experiment….

34 34 GeneChip ® Oligonucleotide Array

35 GeneChip ® Expression Analysis Hybridization and Staining Array cRNA Target Hybridized Array Streptavidin- phycoerythrin conjugate Courtesy of M. Hessner, CAAGED Workshop

36 36 Affymetrix Chips 300,000 “Probes” Perfect Match and Mismatch Average Difference Values Courtesy of J. Glasner CAAGED Workshop

37 37 Current Problems Facing Expression Analysis on the Biotech side Standardization & quality control in the experiments (data quality at many levels) Cost

38 38 Problem in reproducibility of experimental data Lots of variation in arrays –more than 100 experimental steps Sources of variation –biological variability in each RNA extract –each labeling reaction is different –each slide is a separate hybridization –spots on the slide are variable across slides (and within slides when double spotted) –each “color” is scanned separately Need Replicates and Statistics!

39 39 The Value of Replicates What is a replicate? –Doing the same experiment more than once –An experimental design issue –How many? Most people don’t do true replicates –Why not? cost is primary limiting sample size ignorance of statistical considerations competition

40 40 Outcome “Noisy” data Data preprocessing is necessary –normalization –scaling Heavy reliance on statistics today

41 41 Pre-processing Gene filtering –control genes –uninformative genes Normalization and scaling –allows comparisons across arrays –scaling to control dynamic range Transformation logarithmic transformation for improved statistical properties

42 Normalization Cy3 signal (log 2 ) Cy5 signal (log 2 )

43 43 Outcomes of Microarray Analysis Large, complex data sets –example of a routine study: 50,000 “genes” from 20 samples -  approx. 1-2 X 10 6 pieces of data  challenges for Bioinformatics annotation, storage, retrieval, sharing of data information from the data

44 44 Current State of Microarray Data Availability Wide availability of technology has given rise to a large number of distributed databases data scattered among many independent sites (accessible via Internet) or not publicly available at all Need for standardization!

45 45 Public Repositories & Efforts Towards Standardization GeneX at US National Center for Genome Resources ArrayExpress at European Bioinformatics Institute Gene Expression Omnibus at US National Center for Biotechnology Information Stanford University Database

46 46 Standardization of the biological databases is a big issue A prime example of databases in need of standardization: the gene expression databases Why? –wide availability of technologies such as the microarray has given rise to a large number of heterogeneous, distributed databases –differ in annotation, database structure, availability –standardization is necessary to enable scientists to share and compare data

47 47 MGED Group and Standardization Issues Microarray Gene Expression Database (MGED) Group MGED is taking on the challenge of standardization Four major projects

48 48 MIAME - The formulation of the minimum information about a microarray experiment required to interpret and verify the results. MAGE - The establishment of a data exchange format (MAGE-ML) and object model (MAGE-OM) for microarray experiments. MGED Projects

49 49 MGED Projects Ontologies - The development of ontologies for microarray experiment description and biological material (biomaterial) annotation in particular. Normalization - The development of recommendations regarding experimental controls and data normalization methods.

50 50 Some Basic Statistics dot product mean standard deviation log base 2 etc.

51 51 Outline of Studio/Lab I will display some Scheme code –And provide some Java code for you I will progress towards hierarchical clustering You will progress towards k-means clustering You will work with random data –But next week we’ll be introduced to real data that is in need of investigation

52 52 Gene chips Spots representing thousands of genes Two populations of cDNA –different conditions to be compared One colored with Cy5 (red) One colored with Cy3 (green) Mixed, incubated with the chip Figures from Campbell-Heyer Chapter 4

53 53 Red/Green Intensity measurements (define redgreens '((2345 2467) (3589 2158) (4109 1469) (1500 3589) (1246 1258) (1937 2104) (2561 1562) (2962 3012) (3585 1209) (2796 1005) (2170 4245) (1896 2996) (1023 3354) (1698 2896))) Shows (red green) intensities for 14 (out of 6200!) genes This is the kind of data you can get from the.cel files (raw data from laser scans of microarrays)

54 54 Should we normalize? Average of reds is 2386.9 Average of greens is 2380.3 What does John Quackenbush say? (page 420) Calculate standard deviations. Return to this issue For now, no normalization

55 55 Ratios of red values to green (define redgreenratios (map (lambda (x) (round2 (/ (car x) (cadr x)))) redgreens)) Produces (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98 2.97 2.78 0.51 0.63 0.31 0.59) Which genes are expressed more in red than green? Should these values be normalized?

56 56 Yet another Color scheme (0.95 1.66 2.8 0.42 0.99 0.92 1.64 0.98 2.97 2.78 0.51 0.63 0.31 0.59) Highly expressed Neutral Less expressed >2.0 >1.3 close to 1.0 >0.5 <0.5 –Seems arbitrary? –Log scale?? Why oh why did they re-use red and green? Clustering? Meaning?

57 57 Larger experiment 12 Genes Expression values at 0, 2, 4, 6, 8 and 10 hours

58 58 Table 4.2 of Campbell/Heyer Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 G 1 2 3 4 3 2 H I 1 4 8 4 1.5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M N 1.125.0833.0625.0833.125 Normalized how?

59 59 Take logs C 0 3.0 3.58 4.0 3.58 3.0 D 0 1.58 2.0 2.0 1.58 1.0 E 0 2.0 3.0 3.0 3.0 3.0 F 0 0 0 -2.0 -2.0 -3.32 G 0 1.0 1.58 2.0 1.58 1.0 H 0 -1.0 -1.6 -2.0 -1.6 -1.0 I 0 2.0 3.0 2.0 0 -1.0 J 0 1.0 0 1.0 0 1.0 K 0 0 0 0 1.58 1.58 L 0 1.0 1.58 2.0 1.58 1.0 M 0 -1.6 -2.0 -2.0 -1.6 -1.0 N 0 -3.0 -3.59 -4.0 -3.59 -3.0 Compare

60 60 How Similar are two Rows? How similar are the expressions of two genes? First we’ll normalize each row (define normalize ; subtract mean and divide by sd (lambda (l) (let ((m (mean l)) (s (standarddeviation l))) (map (lambda (x) (/ (- x m) s)) l)))) What are the new mean and standard deviation?

61 61 How Similar are two Rows? Calculate the Pearson Correlation between pairs of rows (define pc ; pearson correlation (lambda (xs ys) (/ (dotproduct (normalize xs) (normalize ys)) (length xs)))) > (pc '( 1 2 3 4 3 2 ) ; row G '( 1 2 3 4 3 2 )) ; row L 1.0 > (pc '( 1 2 3 4 3 2 ) ; row G '( 1 3 4 4 3 2 )) ; row D 0.8971499589146109

62 62 Some other pairs Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 G 1 2 3 4 3 2 H I 1 4 8 4 1.5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M N 1.125.0833.0625.0833.125 > (pc '( 1 3 4 4 3 2) ; row D '( ; row M -0.9260278787295065 > (pc '( 1 2 3 4 3 2) ; row G '( ; row H -0.9090853650855358

63 63 Correlation is sensitive to relative magnitudes pc(G,L) = 1 -- identically expressed genes pc(G,D) =.897 -- similarly expressed genes pc(D,M) = -.926 -- reciprocally expressed pc(G,H) = -.909 -- also reciprocally expressed What happens if, instead of using the expression data we use the log transforms? –pc(G,L) = 1.0 –pc(G,D) = 0.939 –pc(D,M) = -1.0 –pc(G,H) = -1.0

64 64 Hierarchical Clustering Repeat –Replace the two closest objects by their combination Until only one object remains

65 65 What are the objects? (define objects (map (lambda (x) (cons (symbol->string (car x)) (cdr x))) logtable42)) Initially, the objects are the genes with the log transformed expression levels Typical object ("E" 0 2.0 3.0 3.0 3.0 3.0)

66 66 Combining objects (define combine ; (lambda (xs ys) (cons (string-append (car xs) (car ys)) (map (lambda (x y) (/ (+ x y) 2.0)) (cdr xs) (cdr ys))))) combine names average the entries Typical combined pair: ("EG" 0 1.5 2.29 2.5 2.29 2.0))

67 67 Manual Hierarchical Clustering Let’s go to emacsemacs

68 68 K-means Clustering -- Lloyd’s Algorithm Partition data into k clusters REPEAT FOR each datapoint { Calculate its distance to the centroid of each cluster IF this is minimal for its own cluster Leave the datapoint in its current cluster ELSE Place it in its closest cluster } UNTIL no datapoint is moved Goal: minimize sum of distances from datapoints to centroids

69 69 Analysis of k-means clustering There are always exactly k clusters No cluster is empty (why?) The clusters are not hierarchical The clusters do not overlap Run time with n datapoints: Partitioning O(n) FOR loop is O(nk) REPEAT loop is ??? kanungo et al Partition data into k clusters REPEAT FOR each datapoint { Calculate its distance to the centroid of each cluster IF this is minimal for its own cluster Leave the datapoint in its current cluster ELSE Place it in its closest cluster } UNTIL no datapoint is moved

70 70 Pro and Con Pro With small k, may be faster than hierarchical Clusters may be “tighter” Con Sensitive to initial choice of k Sensitive to initial partition May converge to local, rather than global minimum Not clear how good resulting clusters are

71 71 Other Methods for Clustering Self Organizing Maps SWARM technology SOM/SWARM hybrid

72 72 Mining of Expression Data A gene expression pattern derived from a single microarray is simply a snapshot (one experimental sample vs reference) Usually want to understand a process or changes in expression over a collection of samples  gene expression profile

73 73 Three levels of microarray gene expression data processing Brazma et al., Nature Genetics, 29:365-371, 2001

74 74 Goal of Analysis of Expression Matrix Some statistical methods applied to: 1.“Group” similar genes together => groups of functionally similar genes. 2.”Group” similar cell samples together. 3.“Extract” representative genes in each group.

75 75 Typical approach Look for patterns –compare rows to find evidence for co-regulation of genes –compare columns to find evidence for relatedness among samples 1) Choose a measure of similarity (distance) among the objects being compared-each row or column is considered a vector in space 2) Then, group together objects (genes or samples) with similar properties-is a multidimensional analysis

76 76 Pattern Recognition Clustering Feature extraction/selection Classification-discrimination analysis

77 77 Analytic Approaches Clustering: Identification of associations between data points; organization of data into groups— exploratory analysis Clustering Algorithms –Hierarchical –K-means –Self-organizing maps –Others

78 Eisen et al. nt/full/95/25/14863 samples genesgenes Gene Expression Matrix & Hierarchical Clustering

79 79 Feature Selection & Classification First, identify features (genes) that discriminate between classes Then use features for classification –machine learning approach –supervised analysis –assignment of a new sample—pattern—to a previously specified class, based on sample features and a trained classifier

80 80 “Classic” Example: Classification of AML vs. ALL Biological/Clinical Problems : previously, no single reliable test to distinguish them differ greatly in clinical course & response to treatments Golub et al., Science Oct 15 1999: 531-537 Comparing 2 acute leukemias acute myeloid leukemia (AML) acute lymphoid leukemia (ALL)

81 81 Golub et al., Science Oct 15 1999: 531-537 Study Design

82 82 Results of the study 1) Clustering of microarray data using tumors of known type  found 1100 of 6817 genes correlated with class distinction 2) Formation of a class predictor = 50 most informative genes used as a training set  classification of unknown tumors Golub et al., Science Oct 15 1999: 531-537

83 83 The prediction of a new sample is based on 'weighted votes' of a set of informative genes


85 85 Free Software for Microarray Analysis Cluster & TreeView –Michael Eisen –

86 86 More Free Software Expression Profiler (European Bioinformatics Inst) GeneX (National Center for Genome Research) ArrayViewer and MEV (TIGR) Many More!!

87 87 Microarray Analysis Software Popular commercial packages –Spotfire DecisionSite (Spotfire, Inc) –GeneSpring (Silicon Genetics) –Affymetrix Microarray Suite –Affymetrix Data Mining Tool (Affymetrix, Inc.) –Rosetta Inpharmatics' Resolver

88 88 Why cluster analysis may not be “the” answer Clustering methods typically require user inputs: Example: distance measure Clustering methods differ in the way that the number of clusters are specified. Clustering methods are often sensitive to the initialization condition (starting guess) Local vs. global sampling of clustering space

89 89 Cluster Analysis Challenges “Noise” in the data itself Large data sets –most of the techniques currently used were not developed for multidimensional data What about networks? –limitation of cluster analysis: similarity in expression pattern suggests co-regulation but doesn’t reveal cause-effect relationships

90 Using information networks as an interpretive layer between phenotypes and the underlying genes, proteins and metabolites Highly connected genes are often critical in the onset of cancer and metabolic diseases. However, drug treatment targeting less connected genes will have fewer side effects. Database stores information about the connections among cellular building blocks and traits. DNA chip/microarray Red indicates regions implicated in disease Human Chromosomes 5 & 13 Use network to understand the relationship between genes associated with disease regions. J. Blanchard-CAAGED Workshop 2002

91 91 Other Analytic Approaches are Being Explored to Reverse Engineer Networks Bayesian Networks –represent the dependence structure between multiple interacting quantities (e.g. expression levels of genes) –gene interactions & models of causal influence –good for “noisy” data

92 92 Analytic Challenges Advanced methods may require significant computational resources –Numerically complex calculations (large correlation matrices) –Combinatorically large search spaces (Bayesian nets) –Many training cycles (neural nets) –Global optimization (genetic algorithms) –Archiving, indexing, and correlation of large datasets (data extraction and data mining; visualization) S. Atlas CAAGED Workshop

93 93 “Take-home Messages” With current technology global gene expression studies are best used for hypothesis building Other experimental methods and smaller data sets needed to address the reproducibility problem New analytical approaches are needed to deal with the multidimensional data Need for high performance computing Need for Bio/CS/IT/Stats people to work together!

Download ppt "1 Analysis of Gene Expression Anne R. Haake Rhys Price Jones."

Similar presentations

Ads by Google