Download presentation
Presentation is loading. Please wait.
Published byRalf Jason Morgan Modified over 8 years ago
1
Proteomics: Bioinformatics beyond sequences
2
Analyzing Global Gene Expression
3
Microarray Data A “snapshot” of the amount of a particular gene being transcribed in a tissue Measured for tens of thousands of genes Use of multiple tissues on a single array allow for direct comparisons between tissues
4
Objectives of Microarray Studies Which genes are affected when exposed to a “treatment”? –Hit it with a stick and see what happens Given a “profile” of levels of expression for many genes, can the unknown “treatment” be predicted? –Tumor or disease classification Time course experiments allow the study of coregulation of genes, and for the reconstruction of regulatory networks
5
Microarray Technology Spotted arrays –Attach entire sequence of genes to the array –Create cDNA from a tissue (expressed genes) –Wash the pool of cDNAs over the array –Complementary sequences bind Oligonucleotide arrays (Affy chips) –Attach short (25bp) oligos instead of entire genes
6
GTTCGA.... The gene CAAGCT.... cDNA GUUCGA.... mRNA Via reverse transcription
7
Spotted arrays are usually treated with samples from two different tissues, each labeled with a different “color” of dye (Red and Green) Highly expressed in tissue AHighly expressed in tissue B
8
Data Transformation Compute activation or repression by ratio of red/green control However, discrepancies in interpreting repression vs. activation numbers Solution: Log transformation of data –Log 10 (4) = 0.6 while log 10 (.25) = -0.6
9
Microarray animation http://www.bio.davidson.edu/courses/geno mics/chip/chip.htmlhttp://www.bio.davidson.edu/courses/geno mics/chip/chip.html
10
Many computational and statistical problems Image analysis (spot identification, background, etc.) Data management and pipelining “Normalization” of data Clustering coregulated genes Classifying tissue types Regulatory network inference Promoter identification (when combined with genomic sequence data)
11
Normalization Cy3 signal (log 2 ) Cy5 signal (log 2 )
12
Normalization by iterative linear regression fit a line (y=mx+b) to the data set set aside outliers (residuals > 2 x s.e.) repeat until r 2 changes by < 0.001 then apply slope and intercept to the original dataset D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp
13
Normalization (Linear) Cy3 signal (log 2 ) Cy5 signal (log 2 )
14
Normalization (Linear) Cy3 signal (log 2 ) Cy5 signal (log 2 )
15
Clustering genes with similar expression patterns can be comparable to distance-based phylogenetic inference Compute a matrix of pairwise “profile similarity” scores between genes Use these scores in something like UPGMA Eisen et al. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863-14868
16
Hierarchical Clustering Compare figure 5-4 to UPGMA, see any differences? Any data can be clustered, therefore we must be careful what conclusions we draw from our results Clustering is non-deterministic and as so, can and will produce different results on different runs
18
Pearson correlation coefficient – one type of distance metric Calculate mean and standard deviation for the rows in question Subtract the appropriate mean from each value in a row and divide by the standard deviation to generate a normalized row of data Multiply corresponding values from each row and keep a running total Divide the total by number of elements in the row to get the correlation coefficient
19
Merit of this coefficient If identical patterns, value should be 1.0 Reciprocal patterns, value should –1.0 USE LOG TRANSFORMED DATA for computation of Pearson coefficient Used in Clustering
20
Clustering genes Combine rows pairwise based on Pearson coefficients until all rows accounted for Eisen et al. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863-14868
21
Guilt by association Genes exhibiting similar expression patterns are thought to be involved in common physiological processes Can be used to find potential regulatory sequences
22
K-means Clustering Given a set of n data points in d- dimensional space and an integer k We want to find the set of k points in d- dimensional space that minimizes the mean squared distance from each data point to its nearest center No exact polynomial-time algorithms are known for this problem “A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al
23
Euclidean Distance Now to find the distance between two points, say the origin and the point (3,4): Simple and Fast! Remember this when we consider the complexity!
24
Finding a Centroid We use the following equation to find the n dimensional centroid point amid k n dimensional points: Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)
25
K-means Algorithm 1.Choose k initial center points randomly 2.Cluster data using Euclidean distance (or other distance metric) 3.Calculate new center points for each cluster using only points within the cluster 4.Re-Cluster all data using the new center points 1.This step could cause data points to be placed in a different cluster 5.Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
26
An example with k=2 1.We Pick k=2 centers at random 2.We cluster our data around these center points Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
27
K-means example with k=2 3.We recalculate centers based on our current clusters Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
28
K-means example with k=2 4.We re-cluster our data around our new center points Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
29
K-means example with k=2 5. We repeat the last two steps until no more data points are moved into a different cluster Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
30
Another example k1k1 k2k2 k3k3 X Y Initially distribute codes randomly in pattern space
31
k1k1 k2k2 k3k3 Assign each point to the closest code
32
X Y Move each code to the mean of all its assigned points k1k1 k2k2 k2k2 k1k1 k3k3 k3k3
33
X Y Repeat the process – reassign the data points to the codes Q: Which points are reassigned? k1k1 k2k2 k3k3
34
X Y re-compute cluster means k1k1 k3k3 k2k2
35
X Y move cluster centers to cluster means k2k2 k1k1 k3k3
36
Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Sensitive to outliers Extensions Adaptive k-means K-mediods (based on median instead of mean) –1,2,3,4,100 average 22, median 3
37
Choosing k Use another clustering method Run algorithm on data with several different values of k Use advance knowledge about the characteristics of your test –Cancerous vs Non-Cancerous
38
Cluster Quality Since any data can be clustered, how do we know our clusters are meaningful? –The size (diameter) of the cluster vs. The inter-cluster distance –Distance between the members of a cluster and the cluster’s center –Diameter of the smallest sphere From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
39
Cluster Quality Continued size=5 distance=20 distance=5 Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
40
Cluster Quality Continued Quality can be assessed simply by looking at the diameter of a cluster A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created. From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
41
Characteristics of k-means Clustering The random selection of initial center points creates the following properties –Non-Determinism –May produce clusters without patterns One solution is to choose the centers randomly from existing patterns From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
42
Kohonen Self Organizing Feature Maps (SOFM) Creates a map in which similar patterns are plotted next to each other Data visualization technique that reduces n dimensions and displays similarities More complex than k-means or hierarchical clustering, but more meaningful Neural Network Technique –Inspired by the brain From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
43
Armidale Animal Breeding Summer Course, UNE, Feb. 2006 Self-Organizing Maps (SOM) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6
44
Self-Organizing Maps (SOM) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6
45
Self-Organizing Maps (SOM) 2. Choose a random gene, e.g., G9 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6
46
Self-Organizing Maps (SOM) 4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 N1 N2 G12G13 G14 G15 G26G27 G29G28 N3 N4 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 N5 N6
47
Self-Organizing Maps (SOM) Situate grid of nodes along a plane where datapoints are distributed Perhaps a better view…
48
Self-Organizing Maps (SOM) Sample a gene and subject the closest node and neighboring nodes to its ‘gravitational’ influence Perhaps a better view…
49
Self-Organizing Maps (SOM) Perhaps a better view…
50
Self-Organizing Maps (SOM) Perhaps a better view… Sample another gene…
51
Self-Organizing Maps (SOM) Perhaps a better view… …and so on, and so on…
52
Self-Organizing Maps (SOM) Perhaps a better view… …until all genes have been sampled several times over. Each cluster is defined with reference to a node, specifically comprised by those genes for which it represents the closest node.
53
Our Favorite Example With Yeast Reduce data set to 828 genes Clustered data into 30 clusters using a SOFM Each pattern is represented by its average (centroid) pattern Clustered data has same behavior Neighbors exhibit similar behavior
54
A SOFM Example With Yeast “Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.
55
Benefits of SOFM SOFM contains the set of features extracted from the input patterns (reduces dimensions) SOFM yields a set of clusters A gene will always be most similar to a gene in its immediate neighborhood than a gene further away From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
56
Some freeware tools for microarray analysis indexed at Y.F. Leung’s Functional Genomics site: http://ihome.cuhk.edu.hk/~b400559/ MeV (TIGR) www.tigr.org MAExplorer (NCI) www.lecb.ncifcrf.gov/MAExplorer/ Expression Profiler (EBI) http://ep.ebi.ac.uk/ many of these tools require a Java Virtual Machine
57
Protein Interaction Tools and Techniques - Experimental Methods
58
Proteins Move in Pathways
59
Proteins Assemble
60
Proteins Interact
61
3D Structure Determination X-ray crystallography –grow crystal –collect diffract. data –calculate e- density –trace chain NMR spectroscopy –label protein –collect NMR spectra –assign spectra & NOEs –calculate structure using distance geom.
62
The Protein Fold Universe How Big Is It??? 500? 2000? 10000? 8 ?
63
Structures in PDB PDB = 19860 structures Jan 03 PDB = 23997 structures Jan 04 “structural genomics” search = 156 structures Jan 03 search = 478 structures Jan 04
64
Structural Proteomics 10000 20000 30000 40000 50000 60000 70000 80000 0 Sequences Structures 90000 100000
65
Unique folds in PDB
66
Protein Interaction Domains http://www.mshri.on.ca/pawson/domains.html
67
Protein Interaction Domains http://www.mshri.on.ca/pawson/domains.html
68
Yeast Two-Hybrid Analysis Yeast two-hybrid experiments yield information on protein protein interactions GAL4 Binding Domain GAL4 Activation Domain X and Y are two proteins of interest If X & Y interact then reporter gene is expressed
69
Affinity Pull-down
70
DNA vs Protein Chip Technology DNA microtechnology –Can successfully read 1000’s of side by side measurements of RNA levels –BUT RNA ≠ protein = function Protein Microarray Technology –Goal: develop protein chip with proteins in active state. Proteins more challenging to prepare than DNA/RNA Protein functionality depends on state, modifications, binding partners, localization etc.
71
Arraying Process
72
Protein Chips Antibody Array Antigen Array Ligand Array Detection by: SELDI MS, fluorescence, SPR, electrochemical, radioactivity, microcantilever
73
Protein (Antigen) Chips His 6 GST ORF Nickel coating H Zhu, J Klemic, S Chang, P Bertone, A Casamayor, K Klemic, D Smith, M Gerstein, M Reed, & M Snyder (2000).Analysis of yeast protein kinases using protein chips. Nature Genetics 26: 283-289
74
Protein (Antigen) Chips Nickel coating
75
Probe with anti-GST Mab Nickel coating
76
Anti-GST Probe
77
Probe with Cy3-labeled Calmodulin Nickel coating
78
“Functional” Protein Array Nickel coating
79
Rosetta Stone Method
80
Interologs, Homologs, Paralogs... Homolog –Common Ancestors –Common 3D Structure –Common Active Sites Ortholog –Derived from Speciation Paralog –Derived from Duplication Interolog –Protein-Protein Interaction
81
Finding Interologs If A and B interact in organism X, then if organism Y has a homolog of A (A’) and a homolog of B (B’) then A’ and B’ should interact too! Makes use of BLAST searches against entire proteome of well-studied organisms (yeast, E. coli) Requires list of known interacting partners
82
A Flood of Data High throughput techniques are leading to more and more data on protein interactions This is where bioinformatics can play a key role Some suggest that this is the “future” for bioinformatics
83
Interaction Databases BIND –http://www.blueprint.org/bind/bind.p hp DIP –http://dip.doe-mbi.ucla.edu/ MINT –http://mint.bio.uniroma2.it/mint/ PathCalling –http://portal.curagen.com/extpc/co m.curagen.portal.servlet.Yeast
84
The BIND Database BIND - Biomolecular Interaction Network Database Conceived and Developed by Chris Hogue, Tony Pawson, Francis Ouellette Designed to capture almost all interactions between biomolecules (large and small) Largest database of its kind
85
BIND Can Encode... Simple binary interactions Enzymes, substrates and conformational changes Restriction enzymes Limited proteolysis Phosphorylation (reversible) Glycosylation Intron splicing Transcriptional factors
86
BIND
87
BIND Query Result click
88
BIND Details
89
click
90
BIND Details
91
DIP Database of Interacting Proteins http://dip.doe-mbi.ucla.edu/
92
DIP Query Page CGPC
93
DIP Results Page click
94
DIP Results Page
95
MINT Molecular Interaction Database http://mint.bio.uniroma2.it/mint/
96
MINT Results click
98
KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/kegg2.html
99
KEGG
101
TRANSPATH http://www.biobase.de/pages/products/transpath.html
102
BIOCARTA www.biocarta.com Go to “Pathways” Web interactive links to many signaling pathways and other eukaryotic protein- protein interactions
105
Other Databases http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html
106
Antigen Array (ELISA Chip) Mezzasoma et al. Clinical Chem. 48:121 (2002)
107
Protein Chips Antibody Array Antigen Array Ligand Array
108
Ciphergen “Ligand” Chips Hydrophobic (C 8 ) Arrays Hydrophilic (SiO 2 ) Arrays Anion exchange Arrays Cation exchange Arrays Immobilized Metal Affinity (NTA-nitroloacetic acid) Arrays Epoxy Surface (amine and thiol binding) Arrays
109
Ciphergen ProteinChip
110
Peptide/Protein Profile E. coli Salmonella
112
Mass spectroscopy offers immense precision and sensitivity in protein analysis
113
…and versatility
114
Trouble with 2D gels Running gels in reproducible manner is an “art” Look at only fractions of proteins (ie. difficult to resolve membrane proteins on 2D gels) Result: Investigators are pursuing proteomes based solely on mass spec data
115
Challenges for Human Proteomics Small MW proteins (<10-12 kDa) Low Abundance proteins High MW Basic proteins Hydrophobic proteins
116
Problem exemplified: Ran 24 x 2-D gels 2100 spots resolved Only 250 spots common between all gels Could draw conclusions on only 2% of visible spots Note: 10% of genes make >50% of protein in living cells
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.