Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression.

Proteomics: Bioinformatics beyond sequences

Analyzing Global Gene Expression

Microarray Data A “snapshot” of the amount of a particular gene being transcribed in a tissue Measured for tens of thousands of genes Use of multiple tissues on a single array allow for direct comparisons between tissues

Objectives of Microarray Studies Which genes are affected when exposed to a “treatment”? –Hit it with a stick and see what happens Given a “profile” of levels of expression for many genes, can the unknown “treatment” be predicted? –Tumor or disease classification Time course experiments allow the study of coregulation of genes, and for the reconstruction of regulatory networks

Microarray Technology Spotted arrays –Attach entire sequence of genes to the array –Create cDNA from a tissue (expressed genes) –Wash the pool of cDNAs over the array –Complementary sequences bind Oligonucleotide arrays (Affy chips) –Attach short (25bp) oligos instead of entire genes

GTTCGA.... The gene CAAGCT.... cDNA GUUCGA.... mRNA Via reverse transcription

Spotted arrays are usually treated with samples from two different tissues, each labeled with a different “color” of dye (Red and Green) Highly expressed in tissue AHighly expressed in tissue B

Data Transformation Compute activation or repression by ratio of red/green control However, discrepancies in interpreting repression vs. activation numbers Solution: Log transformation of data –Log 10 (4) = 0.6 while log 10 (.25) = -0.6

Microarray animation http://www.bio.davidson.edu/courses/geno mics/chip/chip.htmlhttp://www.bio.davidson.edu/courses/geno mics/chip/chip.html

Many computational and statistical problems Image analysis (spot identification, background, etc.) Data management and pipelining “Normalization” of data Clustering coregulated genes Classifying tissue types Regulatory network inference Promoter identification (when combined with genomic sequence data)

Normalization Cy3 signal (log 2 ) Cy5 signal (log 2 )

Normalization by iterative linear regression fit a line (y=mx+b) to the data set set aside outliers (residuals > 2 x s.e.) repeat until r 2 changes by < 0.001 then apply slope and intercept to the original dataset D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp

Normalization (Linear) Cy3 signal (log 2 ) Cy5 signal (log 2 )

Clustering genes with similar expression patterns can be comparable to distance-based phylogenetic inference Compute a matrix of pairwise “profile similarity” scores between genes Use these scores in something like UPGMA Eisen et al. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863-14868

Hierarchical Clustering Compare figure 5-4 to UPGMA, see any differences? Any data can be clustered, therefore we must be careful what conclusions we draw from our results Clustering is non-deterministic and as so, can and will produce different results on different runs

Pearson correlation coefficient – one type of distance metric Calculate mean and standard deviation for the rows in question Subtract the appropriate mean from each value in a row and divide by the standard deviation to generate a normalized row of data Multiply corresponding values from each row and keep a running total Divide the total by number of elements in the row to get the correlation coefficient

Merit of this coefficient If identical patterns, value should be 1.0 Reciprocal patterns, value should –1.0 USE LOG TRANSFORMED DATA for computation of Pearson coefficient Used in Clustering

Clustering genes Combine rows pairwise based on Pearson coefficients until all rows accounted for Eisen et al. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863-14868

Guilt by association Genes exhibiting similar expression patterns are thought to be involved in common physiological processes Can be used to find potential regulatory sequences

K-means Clustering Given a set of n data points in d- dimensional space and an integer k We want to find the set of k points in d- dimensional space that minimizes the mean squared distance from each data point to its nearest center No exact polynomial-time algorithms are known for this problem “A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al

Euclidean Distance Now to find the distance between two points, say the origin and the point (3,4): Simple and Fast! Remember this when we consider the complexity!

Finding a Centroid We use the following equation to find the n dimensional centroid point amid k n dimensional points: Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)

K-means Algorithm 1.Choose k initial center points randomly 2.Cluster data using Euclidean distance (or other distance metric) 3.Calculate new center points for each cluster using only points within the cluster 4.Re-Cluster all data using the new center points 1.This step could cause data points to be placed in a different cluster 5.Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

An example with k=2 1.We Pick k=2 centers at random 2.We cluster our data around these center points Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

K-means example with k=2 3.We recalculate centers based on our current clusters Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

K-means example with k=2 4.We re-cluster our data around our new center points Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

K-means example with k=2 5. We repeat the last two steps until no more data points are moved into a different cluster Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Another example k1k1 k2k2 k3k3 X Y Initially distribute codes randomly in pattern space

k1k1 k2k2 k3k3 Assign each point to the closest code

X Y Move each code to the mean of all its assigned points k1k1 k2k2 k2k2 k1k1 k3k3 k3k3

X Y Repeat the process – reassign the data points to the codes Q: Which points are reassigned? k1k1 k2k2 k3k3

X Y re-compute cluster means k1k1 k3k3 k2k2

X Y move cluster centers to cluster means k2k2 k1k1 k3k3

Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Sensitive to outliers Extensions Adaptive k-means K-mediods (based on median instead of mean) –1,2,3,4,100  average 22, median 3

Choosing k Use another clustering method Run algorithm on data with several different values of k Use advance knowledge about the characteristics of your test –Cancerous vs Non-Cancerous

Cluster Quality Since any data can be clustered, how do we know our clusters are meaningful? –The size (diameter) of the cluster vs. The inter-cluster distance –Distance between the members of a cluster and the cluster’s center –Diameter of the smallest sphere From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Cluster Quality Continued size=5 distance=20 distance=5 Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Cluster Quality Continued Quality can be assessed simply by looking at the diameter of a cluster A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created. From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Characteristics of k-means Clustering The random selection of initial center points creates the following properties –Non-Determinism –May produce clusters without patterns One solution is to choose the centers randomly from existing patterns From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Kohonen Self Organizing Feature Maps (SOFM) Creates a map in which similar patterns are plotted next to each other Data visualization technique that reduces n dimensions and displays similarities More complex than k-means or hierarchical clustering, but more meaningful Neural Network Technique –Inspired by the brain From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Armidale Animal Breeding Summer Course, UNE, Feb. 2006 Self-Organizing Maps (SOM) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

Self-Organizing Maps (SOM) 1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

Self-Organizing Maps (SOM) 2. Choose a random gene, e.g., G9 3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26G27 G29G28 N1N2 N3N4 N5N6

Self-Organizing Maps (SOM) 4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node. G1G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 N1 N2 G12G13 G14 G15 G26G27 G29G28 N3 N4 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 N5 N6

Self-Organizing Maps (SOM) Situate grid of nodes along a plane where datapoints are distributed Perhaps a better view…

Self-Organizing Maps (SOM) Sample a gene and subject the closest node and neighboring nodes to its ‘gravitational’ influence Perhaps a better view…

Self-Organizing Maps (SOM) Perhaps a better view…

Self-Organizing Maps (SOM) Perhaps a better view… Sample another gene…

Self-Organizing Maps (SOM) Perhaps a better view… …and so on, and so on…

Self-Organizing Maps (SOM) Perhaps a better view… …until all genes have been sampled several times over. Each cluster is defined with reference to a node, specifically comprised by those genes for which it represents the closest node.

Our Favorite Example With Yeast Reduce data set to 828 genes Clustered data into 30 clusters using a SOFM Each pattern is represented by its average (centroid) pattern Clustered data has same behavior Neighbors exhibit similar behavior

A SOFM Example With Yeast “Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.

Benefits of SOFM SOFM contains the set of features extracted from the input patterns (reduces dimensions) SOFM yields a set of clusters A gene will always be most similar to a gene in its immediate neighborhood than a gene further away From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Some freeware tools for microarray analysis indexed at Y.F. Leung’s Functional Genomics site: http://ihome.cuhk.edu.hk/~b400559/ MeV (TIGR) www.tigr.org MAExplorer (NCI) www.lecb.ncifcrf.gov/MAExplorer/ Expression Profiler (EBI) http://ep.ebi.ac.uk/ many of these tools require a Java Virtual Machine

Protein Interaction Tools and Techniques - Experimental Methods

Proteins Move in Pathways

Proteins Assemble

Proteins Interact

3D Structure Determination X-ray crystallography –grow crystal –collect diffract. data –calculate e- density –trace chain NMR spectroscopy –label protein –collect NMR spectra –assign spectra & NOEs –calculate structure using distance geom.

The Protein Fold Universe How Big Is It??? 500? 2000? 10000? 8 ?

Structures in PDB PDB = 19860 structures Jan 03 PDB = 23997 structures Jan 04 “structural genomics” search = 156 structures Jan 03 search = 478 structures Jan 04

Structural Proteomics 10000 20000 30000 40000 50000 60000 70000 80000 0 Sequences Structures 90000 100000

Unique folds in PDB

Protein Interaction Domains http://www.mshri.on.ca/pawson/domains.html

Yeast Two-Hybrid Analysis Yeast two-hybrid experiments yield information on protein protein interactions GAL4 Binding Domain GAL4 Activation Domain X and Y are two proteins of interest If X & Y interact then reporter gene is expressed

Affinity Pull-down

DNA vs Protein Chip Technology DNA microtechnology –Can successfully read 1000’s of side by side measurements of RNA levels –BUT RNA ≠ protein = function Protein Microarray Technology –Goal: develop protein chip with proteins in active state. Proteins more challenging to prepare than DNA/RNA Protein functionality depends on state, modifications, binding partners, localization etc.

Arraying Process

Protein Chips Antibody Array Antigen Array Ligand Array Detection by: SELDI MS, fluorescence, SPR, electrochemical, radioactivity, microcantilever

Protein (Antigen) Chips His 6 GST ORF Nickel coating H Zhu, J Klemic, S Chang, P Bertone, A Casamayor, K Klemic, D Smith, M Gerstein, M Reed, & M Snyder (2000).Analysis of yeast protein kinases using protein chips. Nature Genetics 26: 283-289

Protein (Antigen) Chips Nickel coating

Probe with anti-GST Mab Nickel coating

Anti-GST Probe

Probe with Cy3-labeled Calmodulin Nickel coating

“Functional” Protein Array Nickel coating

Rosetta Stone Method

Interologs, Homologs, Paralogs... Homolog –Common Ancestors –Common 3D Structure –Common Active Sites Ortholog –Derived from Speciation Paralog –Derived from Duplication Interolog –Protein-Protein Interaction

Finding Interologs If A and B interact in organism X, then if organism Y has a homolog of A (A’) and a homolog of B (B’) then A’ and B’ should interact too! Makes use of BLAST searches against entire proteome of well-studied organisms (yeast, E. coli) Requires list of known interacting partners

A Flood of Data High throughput techniques are leading to more and more data on protein interactions This is where bioinformatics can play a key role Some suggest that this is the “future” for bioinformatics

Interaction Databases BIND –http://www.blueprint.org/bind/bind.p hp DIP –http://dip.doe-mbi.ucla.edu/ MINT –http://mint.bio.uniroma2.it/mint/ PathCalling –http://portal.curagen.com/extpc/co m.curagen.portal.servlet.Yeast

The BIND Database BIND - Biomolecular Interaction Network Database Conceived and Developed by Chris Hogue, Tony Pawson, Francis Ouellette Designed to capture almost all interactions between biomolecules (large and small) Largest database of its kind

BIND Can Encode... Simple binary interactions Enzymes, substrates and conformational changes Restriction enzymes Limited proteolysis Phosphorylation (reversible) Glycosylation Intron splicing Transcriptional factors

BIND Query Result click

BIND Details

DIP Database of Interacting Proteins http://dip.doe-mbi.ucla.edu/

DIP Query Page CGPC

DIP Results Page click

DIP Results Page

MINT Molecular Interaction Database http://mint.bio.uniroma2.it/mint/

MINT Results click

KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/kegg2.html

TRANSPATH http://www.biobase.de/pages/products/transpath.html

BIOCARTA www.biocarta.com Go to “Pathways” Web interactive links to many signaling pathways and other eukaryotic protein- protein interactions

Other Databases http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html

Antigen Array (ELISA Chip) Mezzasoma et al. Clinical Chem. 48:121 (2002)

Protein Chips Antibody Array Antigen Array Ligand Array

Ciphergen “Ligand” Chips Hydrophobic (C 8 ) Arrays Hydrophilic (SiO 2 ) Arrays Anion exchange Arrays Cation exchange Arrays Immobilized Metal Affinity (NTA-nitroloacetic acid) Arrays Epoxy Surface (amine and thiol binding) Arrays

Ciphergen ProteinChip

Peptide/Protein Profile E. coli Salmonella

Mass spectroscopy offers immense precision and sensitivity in protein analysis

…and versatility

Trouble with 2D gels  Running gels in reproducible manner is an “art”  Look at only fractions of proteins (ie. difficult to resolve membrane proteins on 2D gels) Result: Investigators are pursuing proteomes based solely on mass spec data

Challenges for Human Proteomics  Small MW proteins (<10-12 kDa)  Low Abundance proteins  High MW Basic proteins  Hydrophobic proteins

Problem exemplified: Ran 24 x 2-D gels 2100 spots resolved Only 250 spots common between all gels Could draw conclusions on only 2% of visible spots Note: 10% of genes make >50% of protein in living cells

Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression.

Similar presentations

Presentation on theme: "Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression.

Similar presentations

Presentation on theme: "Proteomics: Bioinformatics beyond sequences. Analyzing Global Gene Expression."— Presentation transcript:

Similar presentations

About project

Feedback