Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression.

Similar presentations


Presentation on theme: "Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression."— Presentation transcript:

1 Proteomics: Bioinformatics beyond sequences

2 Analyzing Global Gene Expression

3 Microarray Data A “snapshot” of the amount of a particular gene being transcribed in a tissue Measured for tens of thousands of genes Use of multiple tissues on a single array allow for direct comparisons between tissues

4 Objectives of Microarray Studies Which genes are affected when exposed to a “treatment”? –Hit it with a stick and see what happens Given a “profile” of levels of expression for many genes, can the unknown “treatment” be predicted? –Tumor or disease classification Time course experiments allow the study of coregulation of genes, and for the reconstruction of regulatory networks

5 Microarray Technology Spotted arrays –Attach entire sequence of genes to the array –Create cDNA from a tissue (expressed genes) –Wash the pool of cDNAs over the array –Complementary sequences bind Oligonucleotide arrays (Affy chips) –Attach short (25bp) oligos instead of entire genes

6 GTTCGA.... The gene CAAGCT.... cDNA GUUCGA.... mRNA Via reverse transcription

7 Spotted arrays are usually treated with samples from two different tissues, each labeled with a different “color” of dye (Red and Green) Highly expressed in tissue AHighly expressed in tissue B

8 Data Transformation Compute activation or repression by ratio of red/green control However, discrepancies in interpreting repression vs. activation numbers Solution: Log transformation of data –Log 10 (4) = 0.6 while log 10 (.25) = -0.6

9 Microarray animation http://www.bio.davidson.edu/courses/geno mics/chip/chip.htmlhttp://www.bio.davidson.edu/courses/geno mics/chip/chip.html

10 Many computational and statistical problems Image analysis (spot identification, background, etc.) Data management and pipelining “Normalization” of data Clustering coregulated genes Classifying tissue types Regulatory network inference Promoter identification (when combined with genomic sequence data)

11 Normalization Cy3 signal (log 2 ) Cy5 signal (log 2 )

12 Normalization by iterative linear regression fit a line (y=mx+b) to the data set set aside outliers (residuals > 2 x s.e.) repeat until r 2 changes by < 0.001 then apply slope and intercept to the original dataset D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp

13 Normalization (Linear) Cy3 signal (log 2 ) Cy5 signal (log 2 )

14 Normalization (Linear) Cy3 signal (log 2 ) Cy5 signal (log 2 )

15 Clustering genes with similar expression patterns can be comparable to distance-based phylogenetic inference Compute a matrix of pairwise “profile similarity” scores between genes Use these scores in something like UPGMA Eisen et al. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863-14868

16 Hierarchical Clustering Compare figure 5-4 to UPGMA, see any differences? Any data can be clustered, therefore we must be careful what conclusions we draw from our results Clustering is non-deterministic and as so, can and will produce different results on different runs

17

18 Pearson correlation coefficient – one type of distance metric Calculate mean and standard deviation for the rows in question Subtract the appropriate mean from each value in a row and divide by the standard deviation to generate a normalized row of data Multiply corresponding values from each row and keep a running total Divide the total by number of elements in the row to get the correlation coefficient

19 Merit of this coefficient If identical patterns, value should be 1.0 Reciprocal patterns, value should –1.0 USE LOG TRANSFORMED DATA for computation of Pearson coefficient Used in Clustering

20 Clustering genes Combine rows pairwise based on Pearson coefficients until all rows accounted for Eisen et al. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863-14868

21 Guilt by association Genes exhibiting similar expression patterns are thought to be involved in common physiological processes Can be used to find potential regulatory sequences

22 K-means Clustering Given a set of n data points in d- dimensional space and an integer k We want to find the set of k points in d- dimensional space that minimizes the mean squared distance from each data point to its nearest center No exact polynomial-time algorithms are known for this problem “A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al

23 Euclidean Distance Now to find the distance between two points, say the origin and the point (3,4): Simple and Fast! Remember this when we consider the complexity!

24 Finding a Centroid We use the following equation to find the n dimensional centroid point amid k n dimensional points: Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)

25 K-means Algorithm 1.Choose k initial center points randomly 2.Cluster data using Euclidean distance (or other distance metric) 3.Calculate new center points for each cluster using only points within the cluster 4.Re-Cluster all data using the new center points 1.This step could cause data points to be placed in a different cluster 5.Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

26 An example with k=2 1.We Pick k=2 centers at random 2.We cluster our data around these center points Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

27 K-means example with k=2 3.We recalculate centers based on our current clusters Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

28 K-means example with k=2 4.We re-cluster our data around our new center points Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

29 K-means example with k=2 5. We repeat the last two steps until no more data points are moved into a different cluster Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

30 Another example k1k1 k2k2 k3k3 X Y Initially distribute codes randomly in pattern space

31 k1k1 k2k2 k3k3 Assign each point to the closest code

32 X Y Move each code to the mean of all its assigned points k1k1 k2k2 k2k2 k1k1 k3k3 k3k3

33 X Y Repeat the process – reassign the data points to the codes Q: Which points are reassigned? k1k1 k2k2 k3k3

34 X Y re-compute cluster means k1k1 k3k3 k2k2

35 X Y move cluster centers to cluster means k2k2 k1k1 k3k3

36 Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Sensitive to outliers Extensions Adaptive k-means K-mediods (based on median instead of mean) –1,2,3,4,100  average 22, median 3

37 Choosing k Use another clustering method Run algorithm on data with several different values of k Use advance knowledge about the characteristics of your test –Cancerous vs Non-Cancerous

38 Cluster Quality Since any data can be clustered, how do we know our clusters are meaningful? –The size (diameter) of the cluster vs. The inter-cluster distance –Distance between the members of a cluster and the cluster’s center –Diameter of the smallest sphere From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

39 Cluster Quality Continued size=5 distance=20 distance=5 Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

40 Cluster Quality Continued Quality can be assessed simply by looking at the diameter of a cluster A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created. From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

41 Characteristics of k-means Clustering The random selection of initial center points creates the following properties –Non-Determinism –May produce clusters without patterns One solution is to choose the centers randomly from existing patterns From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

42 Kohonen Self Organizing Feature Maps (SOFM) Creates a map in which similar patterns are plotted next to each other Data visualization technique that reduces n dimensions and displays similarities More complex than k-means or hierarchical clustering, but more meaningful Neural Network Technique –Inspired by the brain From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

43 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 Self-Organizing Maps (SOM) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

44 Self-Organizing Maps (SOM) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

45 Self-Organizing Maps (SOM) 2. Choose a random gene, e.g., G9 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

46 Self-Organizing Maps (SOM) 4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 N1 N2 G12G13 G14 G15 G26G27 G29G28 N3 N4 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 N5 N6

47 Self-Organizing Maps (SOM) Situate grid of nodes along a plane where datapoints are distributed Perhaps a better view…

48 Self-Organizing Maps (SOM) Sample a gene and subject the closest node and neighboring nodes to its ‘gravitational’ influence Perhaps a better view…

49 Self-Organizing Maps (SOM) Perhaps a better view…

50 Self-Organizing Maps (SOM) Perhaps a better view… Sample another gene…

51 Self-Organizing Maps (SOM) Perhaps a better view… …and so on, and so on…

52 Self-Organizing Maps (SOM) Perhaps a better view… …until all genes have been sampled several times over. Each cluster is defined with reference to a node, specifically comprised by those genes for which it represents the closest node.

53 Our Favorite Example With Yeast Reduce data set to 828 genes Clustered data into 30 clusters using a SOFM Each pattern is represented by its average (centroid) pattern Clustered data has same behavior Neighbors exhibit similar behavior

54 A SOFM Example With Yeast “Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.

55 Benefits of SOFM SOFM contains the set of features extracted from the input patterns (reduces dimensions) SOFM yields a set of clusters A gene will always be most similar to a gene in its immediate neighborhood than a gene further away From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

56 Some freeware tools for microarray analysis indexed at Y.F. Leung’s Functional Genomics site: http://ihome.cuhk.edu.hk/~b400559/ MeV (TIGR) www.tigr.org MAExplorer (NCI) www.lecb.ncifcrf.gov/MAExplorer/ Expression Profiler (EBI) http://ep.ebi.ac.uk/ many of these tools require a Java Virtual Machine

57 Protein Interaction Tools and Techniques - Experimental Methods

58 Proteins Move in Pathways

59 Proteins Assemble

60 Proteins Interact

61 3D Structure Determination X-ray crystallography –grow crystal –collect diffract. data –calculate e- density –trace chain NMR spectroscopy –label protein –collect NMR spectra –assign spectra & NOEs –calculate structure using distance geom.

62 The Protein Fold Universe How Big Is It??? 500? 2000? 10000? 8 ?

63 Structures in PDB PDB = 19860 structures Jan 03 PDB = 23997 structures Jan 04 “structural genomics” search = 156 structures Jan 03 search = 478 structures Jan 04

64 Structural Proteomics 10000 20000 30000 40000 50000 60000 70000 80000 0 Sequences Structures 90000 100000

65 Unique folds in PDB

66 Protein Interaction Domains http://www.mshri.on.ca/pawson/domains.html

67 Protein Interaction Domains http://www.mshri.on.ca/pawson/domains.html

68 Yeast Two-Hybrid Analysis Yeast two-hybrid experiments yield information on protein protein interactions GAL4 Binding Domain GAL4 Activation Domain X and Y are two proteins of interest If X & Y interact then reporter gene is expressed

69 Affinity Pull-down

70 DNA vs Protein Chip Technology DNA microtechnology –Can successfully read 1000’s of side by side measurements of RNA levels –BUT RNA ≠ protein = function Protein Microarray Technology –Goal: develop protein chip with proteins in active state. Proteins more challenging to prepare than DNA/RNA Protein functionality depends on state, modifications, binding partners, localization etc.

71 Arraying Process

72 Protein Chips Antibody Array Antigen Array Ligand Array Detection by: SELDI MS, fluorescence, SPR, electrochemical, radioactivity, microcantilever

73 Protein (Antigen) Chips His 6 GST ORF Nickel coating H Zhu, J Klemic, S Chang, P Bertone, A Casamayor, K Klemic, D Smith, M Gerstein, M Reed, & M Snyder (2000).Analysis of yeast protein kinases using protein chips. Nature Genetics 26: 283-289

74 Protein (Antigen) Chips Nickel coating

75 Probe with anti-GST Mab Nickel coating

76 Anti-GST Probe

77 Probe with Cy3-labeled Calmodulin Nickel coating

78 “Functional” Protein Array Nickel coating

79 Rosetta Stone Method

80 Interologs, Homologs, Paralogs... Homolog –Common Ancestors –Common 3D Structure –Common Active Sites Ortholog –Derived from Speciation Paralog –Derived from Duplication Interolog –Protein-Protein Interaction

81 Finding Interologs If A and B interact in organism X, then if organism Y has a homolog of A (A’) and a homolog of B (B’) then A’ and B’ should interact too! Makes use of BLAST searches against entire proteome of well-studied organisms (yeast, E. coli) Requires list of known interacting partners

82 A Flood of Data High throughput techniques are leading to more and more data on protein interactions This is where bioinformatics can play a key role Some suggest that this is the “future” for bioinformatics

83 Interaction Databases BIND –http://www.blueprint.org/bind/bind.p hp DIP –http://dip.doe-mbi.ucla.edu/ MINT –http://mint.bio.uniroma2.it/mint/ PathCalling –http://portal.curagen.com/extpc/co m.curagen.portal.servlet.Yeast

84 The BIND Database BIND - Biomolecular Interaction Network Database Conceived and Developed by Chris Hogue, Tony Pawson, Francis Ouellette Designed to capture almost all interactions between biomolecules (large and small) Largest database of its kind

85 BIND Can Encode... Simple binary interactions Enzymes, substrates and conformational changes Restriction enzymes Limited proteolysis Phosphorylation (reversible) Glycosylation Intron splicing Transcriptional factors

86 BIND

87 BIND Query Result click

88 BIND Details

89 click

90 BIND Details

91 DIP Database of Interacting Proteins http://dip.doe-mbi.ucla.edu/

92 DIP Query Page CGPC

93 DIP Results Page click

94 DIP Results Page

95 MINT Molecular Interaction Database http://mint.bio.uniroma2.it/mint/

96 MINT Results click

97

98 KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/kegg2.html

99 KEGG

100

101 TRANSPATH http://www.biobase.de/pages/products/transpath.html

102 BIOCARTA www.biocarta.com Go to “Pathways” Web interactive links to many signaling pathways and other eukaryotic protein- protein interactions

103

104

105 Other Databases http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html

106 Antigen Array (ELISA Chip) Mezzasoma et al. Clinical Chem. 48:121 (2002)

107 Protein Chips Antibody Array Antigen Array Ligand Array

108 Ciphergen “Ligand” Chips Hydrophobic (C 8 ) Arrays Hydrophilic (SiO 2 ) Arrays Anion exchange Arrays Cation exchange Arrays Immobilized Metal Affinity (NTA-nitroloacetic acid) Arrays Epoxy Surface (amine and thiol binding) Arrays

109 Ciphergen ProteinChip

110 Peptide/Protein Profile E. coli Salmonella

111

112 Mass spectroscopy offers immense precision and sensitivity in protein analysis

113 …and versatility

114 Trouble with 2D gels  Running gels in reproducible manner is an “art”  Look at only fractions of proteins (ie. difficult to resolve membrane proteins on 2D gels) Result: Investigators are pursuing proteomes based solely on mass spec data

115 Challenges for Human Proteomics  Small MW proteins (<10-12 kDa)  Low Abundance proteins  High MW Basic proteins  Hydrophobic proteins

116 Problem exemplified: Ran 24 x 2-D gels 2100 spots resolved Only 250 spots common between all gels Could draw conclusions on only 2% of visible spots Note: 10% of genes make >50% of protein in living cells


Download ppt "Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression."

Similar presentations


Ads by Google