Galina Glazko and Arcady Mushegian

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Structural bioinformatics
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Heuristic alignment algorithms and cost matrices
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Cluster validation Integration ICES Bioinformatics.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
A Very Basic Gibbs Sampler for Motif Detection
Basics of Comparative Genomics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Identifying templates for protein modeling:
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Fast Sequence Alignments
1 Department of Engineering, 2 Department of Mathematics,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
BLAST.
Protein structure prediction.
SEG5010 Presentation Zhou Lanjun.
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Presentation transcript:

Galina Glazko and Arcady Mushegian Generalized Expression Profiles Predict Protein Association With Merozoite Invasion Galina Glazko and Arcady Mushegian

Outline Strategies for exploring expression profile space Y2, an algorithm for similarity search in the expression profile space Prediction of new invasion proteins using Y2 Sequence-based prediction validation Conclusions

Different strategies to search in expression profiles space Goal: find the similarities that are biologically significant Time warping: align two differentially sampled time series by dynamic programming (Aach and Church, 2001) Gene Expression Search Tool (GEST) : match profiles based on a Bayesian similarity metric (Hunter et al., 2001) Y2: use a probabilistic model of related profiles as a query in searching profile space for other related profiles All approaches are based on the idea of similarity search in sequence space (alignment, distance sensitivity and profile search)

Dynamic Time Warping (DTW) Two sequences Q and C are similar, but out of phase. Time series: Q = q1,q2,…,qi,…,qn (1) C = c1,c2,…,cj,…,cm (2) Make matrix n x m of distances d(qi,cj)=(qi-cj)2 Warping path W : set of matrix elements that define mapping between Q and C, wk=(i,j)k W=w1,w2,…,wk,…,wK max(m,n)<K<m+n-1 Objective: Find the path that minimizes the warping cost DTW(Q,C)=min{√∑wk} May use dynamic programming, e.g. Needleman and Wunsch approach g(i,j)=d(qi,cj)+min{g(i-1,j-1,g(i-1,j),g(i,j-1)} To align Q,C construct matrix and search for the optimal warping path The resulting alignment Keogh, E. 2002. Exact indexing of dynamic time warping. Proc. of 28th VLDB Conf.

DTW: application for expression profiles Aach&Church, 2001 : Yeast Cell Cycle Time Series a factor: 18 time points every 7 min. cdc15: 24 time points Standards : Tightly regulated cell cycle genes, similar patterns in both series (MET, CLN2, CLB2) Result: Time points in differently sampled time series are matched correctly High time-warp alignment score can be the sole basis of pattern clustering Average trajectories of MET, CLB2, and CLN2 clusters Corresponded time states are aligned

Search of experiments that have similar gene expression levels Idea: searching expression databases is like searching the sequence database (Hunter et al., 2001). g1,g2,g3,g4 ,…, gn-1,gn Search: Query state state1 … statek Where does the assumption come from? Purpose: identify previously unsuspected relationships among cellular states Assumption: a crucial factor for the accuracy of similarity search is well-chosen distance measure

Metric for similarity search in expression profile space. I Purpose: “not all differences are treated equally in comparison of protein sequences… find a similarity metric for gene expression data that reflects a biological significance of the observed differences in a manner analogous to the substitution matrices in sequence searches” (Hunter et al., 2001) Compare the performance of Euclidean Distance (ED), Correlation Similarity Distance (CS) and new Bayesian Similarity Metric: for combined set of 92 yeast expression experiments: 73 from Spellman et al., (1998) + 9 from DeRisi et al., (1997) + 10 from Chu et al., (1998).

Metric for similarity search in expression profile space. II f(g1,…,gn)=f(G): joint probability distribution function of true expression levels; e(o1,…,on)=e(O,G): probability density for experimental errors given g1,…,gn; if X,Y are observations of the same state, - the likelihood of the observed data; if X,Y are from independent states, the likelihood is The ratio is the Bayes factor for distinguishing between two hypothesis Gray formulae are too pale. Make them yellow? Conclusion: no statistically significant advantage of Bayesian metric over ED and CS

Analogy with search in the sequence space In standard database searches (BLAST): Homologs are lost after identity drops below some threshold Position Specific Iterative BLAST: a position specific scoring matrix, PSSM is built of the highest scoring hits, updated at each iteration - More homologs are found (although still not all of them ). Swindells, M. Inpharmatica Ltd

Why not identity, though? Purpose: find all gene expression profiles similar to the query profile. P1 Simple solution (SS): -compute pairwise distances (ED, CS or other) d(gq,gi), i=1,…,N, between query gq and each database entry -construct the distribution of distances d(gq,gi) -use e.g. closest top 5% to choose profiles significanlty similar to gq Query gene gene1 … geneN P2 is it noise in P2 compared to P1? BUT: where to put similarity threshold? How to treat missing values? How account for cases when P2 is the noisy realization of P1?

Why not identity, though? Purpose: find all gene expression profiles similar to the query profile. Query gene gene1 … geneN General solution (GS): Strategy similar to PSI-BLAST – probabilistic model and iterative search: -allow more flexible search with more matches; -decrease noise in the gene expression data Split this slide in two: Title, purpose, picture, SS and BUT in the 1st slide – larger font and wider line spacing. The same title, purpose, picture, and only GS in the second slide.

Y2, an iterative algorithm for similarity search in the expression profiles space General Purpose: Find profiles similar not just to one query, but to the group of related profiles Iterative sequence similarity search relies on the position specific scoring matrix, PSSM, or other probabilistic models Similarity search in the expression profile space: compare query profile with the expression profiles of all other genes, and construct condition-specific similarity matrix (CSSM). CSSM matrix is updated at each iteration of the search, to account for new profiles exceeding the similarity threshold.

Preliminary Steps of Y2 algorithm : Discretization, CSSM for entire data set Discretization: for every profile : - compute the range of its expression values, Emaxi and Emini, - fix the number of categories, K and step= (Emaxi-Emini)/K. - transform profile into discrete vector: Choice of K: minimizes the distortion between discretized seed profiles over 1<k<Kl, where 2<Kl<N , s is the number of profiles in the seed, trxi is the transformed ith profile If you want to be more logical (if less chronological), maybe move this slide after the general strategy/iterations slide? CSSM construction for the entire data set {cij}: For each category at every time point, its frequency of appearance at this time point is computed.

Iterative steps DEFINITION: IF the correlation coefficient r() between query Q1 and a ith profile, Pi |r(Q1,Pi)|> fixed_threshold, THEN (Q1, Pi) form the High-Scoring Pair, HSP STEP1: Find all HSPs to form seed for iterative search. STEP 2: Construct the weight matrix wij = log(qij/cij). Compute: target frequencies qij for CSSMsubset from HSPs set background frequencies cij from the entire set. STEP 3: Search the profile space matching CSSMsubset to each profile: S(profile)=∑ wij, Construct the empirical distribution of S(profile). Profile=new match when Pr(Sprofile) <fixed_threshold. STEP 4: Update match list with new profiles; update the CSSMsubset. STEP 5: The process converges when we cannot find new profiles at step 3.

Application of Y2 to Plasmodium. Overview Purpose: Predict new invasion proteins by similarity of their expression profiles to those of the proteins with established role in the invasion. Approach: Use each expression profile of a known invasion factor as a query in an iterative search of the expression profiles database, identify all profiles similar to the query at a given level of significance. Validation of candidates: (1) comparison with Bozdech et al. (2003), when “simple SS” approach is used (2) sequence-based analysis of identified proteins and their predicted functions

Parasite life cycle Miller et al., 2002 Disease begins only once asexual parasite multiplies in Red Blood Cells (RBC) Drop it Miller et al., 2002

Intraerythrocytic Developmental Cycle of Plasmodium (IDC) schizont-specific gene Trophozoite-sp. DHFR-TS IDC is started with merosite invasion of RBC; formation of PV: ring stage; (2) maturation: troposoite stage; (3) reinvasion: schizont stage. Relatively constant exp., adenylosuccinate lyase Gene expression is stage-specific Fig. 1 A, B, C, D, Bozdech et al., 2003

Vaccine candidates Antibodies merozoite surface proteins block invasion, inhibit P.falciparum growth and multiplication =>best vaccine candidates schizont Ramasamy et al., 2001 Expression profiles for invasion proteins: sharp induction at mid to late schizont 7 best known candidates: RESA1, MSP1, MSP3, MSP5, RAP1, AMA1, EBA175 were used in Bozdech et al. to find genes with expression profiles similar to one of the candidates (Simple Search approach). Chosen candidates: top 5% of expression profiles ranked by increasing Euclidian distances; 262 ORFs, including 189 of unknown function.

Four steps of invasion Invasion is ordered process: 1) Initial binding; 2) reorientation; 3) junction formation; 4) parasite entry Antigen Stage Step Specific Domain/Motif TM MSP-1; covers the entire surface of merozoite Schizont 1 2 epidermal growth factor (EGF) modules (EGF: pr/pr interaction moldule found in other proteins) - EBA-175 Mid-late schizont 3 Receptor-binding domain DBL (Duffy-binding-like): Plasmodium-specific; cysteine-rich motif + Actin-myosin cytoskeletal components 4 Many invasion proteins have been identified; much remain to be found

Data set Collect all probes for RESA1, MSP1, MSP3, MSP5, RAP1, AMA1, EBA175 - if multiple Affy probes, choose one with the highest correlation to other probes (2) Search against the Quality Control data set (5077 probes) Missing data and outliers ( >3 s.d. from the profile mean) replaced by the mean for this profile. Generalized expression profiles of seven query proteins - all reach their maxima at the schizont stage time

Iterations The first round of search collects only profiles with correlation of 0.9 or higher Further iterations: expression profiles with wider range of correlations query RESA1, average correlation is 0.9 and higher Matches found by Y2 algorithm show trends of gradual decrease of expression during ring stage, 1-12 hrs; minimal or no expression during trophozoite stage, 12-30 hrs and gradual increase of expression during schizont stage, 30-48 hrs query RESA1, average correlation is 0.678

Merosoite Invasion Candidates CORR./K IT MATCHES SHARED Y2-U 5%ED-U cor09_step15 4 196.4 409 187 10 Results of Y2 search: 409 probes shared with 5%ED approach; 187 unique, with average expression peak at about 30 h. (schizont stage). Consider two sets of hypothetical proteins: -found by Y2 approach only (HPs1, 108), found by both Y2 and 5%ED methods (HPs12, 152).

Structural properties of two HPs set Prediction Method: PHDhtm (Rost et al., 1995) SET Protein length TM regions %Loop regions %Helical regions HPs1 888.39±8.33 1.17±0.02 91.08±0.13 8.92±0.13 HPs12 929.08±6.85 1.28±0.02 94.09±0.07 5.85±0.07 40% of the 1st and 50% of the 2nd HP sets contain TM regions - localize on the cell surface ? 25% of the 1st and 20% of the 2nd HP sets have > 1 TM region, some have several - membrane channels ?

Functional properties of two HPs sets blast, psiblast, rpsblast (after filtering for low-complexity regions) Still HP AFTER psi&rps blasts THERE ARE ORTHOLOGS IN P.YOELII HPs1 64% 70% HPs12 53% 88% These proteins have parasite-specific functions and crucial for parasite survival 30% and 12% of them are human-sp. In both sets, HPs1, HPs12 : Absent: Housekeeping genes involved in gene expression, in intermediate metabolism, and in signal transduction from cytoplasm to the nucleus. Present: Domains involved in lipid metabolism, synthesis and membrane remodeling; Proteins with chaperone activity ; Components of cytoskeleton and of secretory vesicles; multiple protein kinases and phosphatases

Differences between HPs sets Both approaches find structurally and functionally similar sets of proteins, but Y2 is more sensitive: - finds ETRAMP proteins, expressed mostly at early ring stage and located at the parasite-host cell interface; - finds proteins identified by MudPIT as parasite proteins on the surface of the infected erythrocyte (PIESPs, Florens et al., 2004) Overall, HPs1 seems to possess more unknown parasite- and Plasmodium- specific proteins whose function remain to be studied

Conclusions Y2 approach is sensitive and specific: starting with 7 profiles for antigens it found previously identified as well as new similar profiles. The candidates properties are in good agreement with those of earlier proposed invasion candidates. Y2 may be useful for protein functional annotation when annotation by transfer is impossible ( no homologs )

Acknowledgments Arcady Mushegian Amy Ubben, Jie Chen, Mike Coleman, Malcolm Cook, Frank Emmert-Streib, Earl Glynn, Manisha Goel, Piotr Kozbial, Jing Liu Stick in Stowers Institute picture please