Galina Glazko and Arcady Mushegian Generalized Expression Profiles Predict Protein Association With Merozoite Invasion Galina Glazko and Arcady Mushegian
Outline Strategies for exploring expression profile space Y2, an algorithm for similarity search in the expression profile space Prediction of new invasion proteins using Y2 Sequence-based prediction validation Conclusions
Different strategies to search in expression profiles space Goal: find the similarities that are biologically significant Time warping: align two differentially sampled time series by dynamic programming (Aach and Church, 2001) Gene Expression Search Tool (GEST) : match profiles based on a Bayesian similarity metric (Hunter et al., 2001) Y2: use a probabilistic model of related profiles as a query in searching profile space for other related profiles All approaches are based on the idea of similarity search in sequence space (alignment, distance sensitivity and profile search)
Dynamic Time Warping (DTW) Two sequences Q and C are similar, but out of phase. Time series: Q = q1,q2,…,qi,…,qn (1) C = c1,c2,…,cj,…,cm (2) Make matrix n x m of distances d(qi,cj)=(qi-cj)2 Warping path W : set of matrix elements that define mapping between Q and C, wk=(i,j)k W=w1,w2,…,wk,…,wK max(m,n)<K<m+n-1 Objective: Find the path that minimizes the warping cost DTW(Q,C)=min{√∑wk} May use dynamic programming, e.g. Needleman and Wunsch approach g(i,j)=d(qi,cj)+min{g(i-1,j-1,g(i-1,j),g(i,j-1)} To align Q,C construct matrix and search for the optimal warping path The resulting alignment Keogh, E. 2002. Exact indexing of dynamic time warping. Proc. of 28th VLDB Conf.
DTW: application for expression profiles Aach&Church, 2001 : Yeast Cell Cycle Time Series a factor: 18 time points every 7 min. cdc15: 24 time points Standards : Tightly regulated cell cycle genes, similar patterns in both series (MET, CLN2, CLB2) Result: Time points in differently sampled time series are matched correctly High time-warp alignment score can be the sole basis of pattern clustering Average trajectories of MET, CLB2, and CLN2 clusters Corresponded time states are aligned
Search of experiments that have similar gene expression levels Idea: searching expression databases is like searching the sequence database (Hunter et al., 2001). g1,g2,g3,g4 ,…, gn-1,gn Search: Query state state1 … statek Where does the assumption come from? Purpose: identify previously unsuspected relationships among cellular states Assumption: a crucial factor for the accuracy of similarity search is well-chosen distance measure
Metric for similarity search in expression profile space. I Purpose: “not all differences are treated equally in comparison of protein sequences… find a similarity metric for gene expression data that reflects a biological significance of the observed differences in a manner analogous to the substitution matrices in sequence searches” (Hunter et al., 2001) Compare the performance of Euclidean Distance (ED), Correlation Similarity Distance (CS) and new Bayesian Similarity Metric: for combined set of 92 yeast expression experiments: 73 from Spellman et al., (1998) + 9 from DeRisi et al., (1997) + 10 from Chu et al., (1998).
Metric for similarity search in expression profile space. II f(g1,…,gn)=f(G): joint probability distribution function of true expression levels; e(o1,…,on)=e(O,G): probability density for experimental errors given g1,…,gn; if X,Y are observations of the same state, - the likelihood of the observed data; if X,Y are from independent states, the likelihood is The ratio is the Bayes factor for distinguishing between two hypothesis Gray formulae are too pale. Make them yellow? Conclusion: no statistically significant advantage of Bayesian metric over ED and CS
Analogy with search in the sequence space In standard database searches (BLAST): Homologs are lost after identity drops below some threshold Position Specific Iterative BLAST: a position specific scoring matrix, PSSM is built of the highest scoring hits, updated at each iteration - More homologs are found (although still not all of them ). Swindells, M. Inpharmatica Ltd
Why not identity, though? Purpose: find all gene expression profiles similar to the query profile. P1 Simple solution (SS): -compute pairwise distances (ED, CS or other) d(gq,gi), i=1,…,N, between query gq and each database entry -construct the distribution of distances d(gq,gi) -use e.g. closest top 5% to choose profiles significanlty similar to gq Query gene gene1 … geneN P2 is it noise in P2 compared to P1? BUT: where to put similarity threshold? How to treat missing values? How account for cases when P2 is the noisy realization of P1?
Why not identity, though? Purpose: find all gene expression profiles similar to the query profile. Query gene gene1 … geneN General solution (GS): Strategy similar to PSI-BLAST – probabilistic model and iterative search: -allow more flexible search with more matches; -decrease noise in the gene expression data Split this slide in two: Title, purpose, picture, SS and BUT in the 1st slide – larger font and wider line spacing. The same title, purpose, picture, and only GS in the second slide.
Y2, an iterative algorithm for similarity search in the expression profiles space General Purpose: Find profiles similar not just to one query, but to the group of related profiles Iterative sequence similarity search relies on the position specific scoring matrix, PSSM, or other probabilistic models Similarity search in the expression profile space: compare query profile with the expression profiles of all other genes, and construct condition-specific similarity matrix (CSSM). CSSM matrix is updated at each iteration of the search, to account for new profiles exceeding the similarity threshold.
Preliminary Steps of Y2 algorithm : Discretization, CSSM for entire data set Discretization: for every profile : - compute the range of its expression values, Emaxi and Emini, - fix the number of categories, K and step= (Emaxi-Emini)/K. - transform profile into discrete vector: Choice of K: minimizes the distortion between discretized seed profiles over 1<k<Kl, where 2<Kl<N , s is the number of profiles in the seed, trxi is the transformed ith profile If you want to be more logical (if less chronological), maybe move this slide after the general strategy/iterations slide? CSSM construction for the entire data set {cij}: For each category at every time point, its frequency of appearance at this time point is computed.
Iterative steps DEFINITION: IF the correlation coefficient r() between query Q1 and a ith profile, Pi |r(Q1,Pi)|> fixed_threshold, THEN (Q1, Pi) form the High-Scoring Pair, HSP STEP1: Find all HSPs to form seed for iterative search. STEP 2: Construct the weight matrix wij = log(qij/cij). Compute: target frequencies qij for CSSMsubset from HSPs set background frequencies cij from the entire set. STEP 3: Search the profile space matching CSSMsubset to each profile: S(profile)=∑ wij, Construct the empirical distribution of S(profile). Profile=new match when Pr(Sprofile) <fixed_threshold. STEP 4: Update match list with new profiles; update the CSSMsubset. STEP 5: The process converges when we cannot find new profiles at step 3.
Application of Y2 to Plasmodium. Overview Purpose: Predict new invasion proteins by similarity of their expression profiles to those of the proteins with established role in the invasion. Approach: Use each expression profile of a known invasion factor as a query in an iterative search of the expression profiles database, identify all profiles similar to the query at a given level of significance. Validation of candidates: (1) comparison with Bozdech et al. (2003), when “simple SS” approach is used (2) sequence-based analysis of identified proteins and their predicted functions
Parasite life cycle Miller et al., 2002 Disease begins only once asexual parasite multiplies in Red Blood Cells (RBC) Drop it Miller et al., 2002
Intraerythrocytic Developmental Cycle of Plasmodium (IDC) schizont-specific gene Trophozoite-sp. DHFR-TS IDC is started with merosite invasion of RBC; formation of PV: ring stage; (2) maturation: troposoite stage; (3) reinvasion: schizont stage. Relatively constant exp., adenylosuccinate lyase Gene expression is stage-specific Fig. 1 A, B, C, D, Bozdech et al., 2003
Vaccine candidates Antibodies merozoite surface proteins block invasion, inhibit P.falciparum growth and multiplication =>best vaccine candidates schizont Ramasamy et al., 2001 Expression profiles for invasion proteins: sharp induction at mid to late schizont 7 best known candidates: RESA1, MSP1, MSP3, MSP5, RAP1, AMA1, EBA175 were used in Bozdech et al. to find genes with expression profiles similar to one of the candidates (Simple Search approach). Chosen candidates: top 5% of expression profiles ranked by increasing Euclidian distances; 262 ORFs, including 189 of unknown function.
Four steps of invasion Invasion is ordered process: 1) Initial binding; 2) reorientation; 3) junction formation; 4) parasite entry Antigen Stage Step Specific Domain/Motif TM MSP-1; covers the entire surface of merozoite Schizont 1 2 epidermal growth factor (EGF) modules (EGF: pr/pr interaction moldule found in other proteins) - EBA-175 Mid-late schizont 3 Receptor-binding domain DBL (Duffy-binding-like): Plasmodium-specific; cysteine-rich motif + Actin-myosin cytoskeletal components 4 Many invasion proteins have been identified; much remain to be found
Data set Collect all probes for RESA1, MSP1, MSP3, MSP5, RAP1, AMA1, EBA175 - if multiple Affy probes, choose one with the highest correlation to other probes (2) Search against the Quality Control data set (5077 probes) Missing data and outliers ( >3 s.d. from the profile mean) replaced by the mean for this profile. Generalized expression profiles of seven query proteins - all reach their maxima at the schizont stage time
Iterations The first round of search collects only profiles with correlation of 0.9 or higher Further iterations: expression profiles with wider range of correlations query RESA1, average correlation is 0.9 and higher Matches found by Y2 algorithm show trends of gradual decrease of expression during ring stage, 1-12 hrs; minimal or no expression during trophozoite stage, 12-30 hrs and gradual increase of expression during schizont stage, 30-48 hrs query RESA1, average correlation is 0.678
Merosoite Invasion Candidates CORR./K IT MATCHES SHARED Y2-U 5%ED-U cor09_step15 4 196.4 409 187 10 Results of Y2 search: 409 probes shared with 5%ED approach; 187 unique, with average expression peak at about 30 h. (schizont stage). Consider two sets of hypothetical proteins: -found by Y2 approach only (HPs1, 108), found by both Y2 and 5%ED methods (HPs12, 152).
Structural properties of two HPs set Prediction Method: PHDhtm (Rost et al., 1995) SET Protein length TM regions %Loop regions %Helical regions HPs1 888.39±8.33 1.17±0.02 91.08±0.13 8.92±0.13 HPs12 929.08±6.85 1.28±0.02 94.09±0.07 5.85±0.07 40% of the 1st and 50% of the 2nd HP sets contain TM regions - localize on the cell surface ? 25% of the 1st and 20% of the 2nd HP sets have > 1 TM region, some have several - membrane channels ?
Functional properties of two HPs sets blast, psiblast, rpsblast (after filtering for low-complexity regions) Still HP AFTER psi&rps blasts THERE ARE ORTHOLOGS IN P.YOELII HPs1 64% 70% HPs12 53% 88% These proteins have parasite-specific functions and crucial for parasite survival 30% and 12% of them are human-sp. In both sets, HPs1, HPs12 : Absent: Housekeeping genes involved in gene expression, in intermediate metabolism, and in signal transduction from cytoplasm to the nucleus. Present: Domains involved in lipid metabolism, synthesis and membrane remodeling; Proteins with chaperone activity ; Components of cytoskeleton and of secretory vesicles; multiple protein kinases and phosphatases
Differences between HPs sets Both approaches find structurally and functionally similar sets of proteins, but Y2 is more sensitive: - finds ETRAMP proteins, expressed mostly at early ring stage and located at the parasite-host cell interface; - finds proteins identified by MudPIT as parasite proteins on the surface of the infected erythrocyte (PIESPs, Florens et al., 2004) Overall, HPs1 seems to possess more unknown parasite- and Plasmodium- specific proteins whose function remain to be studied
Conclusions Y2 approach is sensitive and specific: starting with 7 profiles for antigens it found previously identified as well as new similar profiles. The candidates properties are in good agreement with those of earlier proposed invasion candidates. Y2 may be useful for protein functional annotation when annotation by transfer is impossible ( no homologs )
Acknowledgments Arcady Mushegian Amy Ubben, Jie Chen, Mike Coleman, Malcolm Cook, Frank Emmert-Streib, Earl Glynn, Manisha Goel, Piotr Kozbial, Jing Liu Stick in Stowers Institute picture please