Presentation on theme: "Reliable Integration Strategy of PPI databases JWH 2009 / 6 / 19."— Presentation transcript:
Reliable Integration Strategy of PPI databases JWH 2009 / 6 / 19
Contents Introduction Reason for Recent Problem –Low Prediction Accuracy in newly published PPI data Confidence Leveling Strategies –In terms of Interaction Type and Evaluation Methods Control –In terms of Domain Appearances Evaluation Conclusion & Work to do
Introduction PPI integration –Means merge heterogeneous PPI databases into single data source. –The evident need to integrate multiple sources. –Technical problems exist. It is essential and important because of –Different distributers for different interests. –Most of machine learning based PPI prediction methods(ex, PreSPI) are highly sensitive in different training sets.
Introduction Meanwhile in PreSPI, –Only Database of Interaction Protein (DIP) was used for PPI source. –There was no consideration PPI type. –Its domain source, InterPro has a redundancy problem such as, Domain A Domain B Domain C
Introduction Meanwhile in PreSPI, –Only Database of Interaction Protein (DIP) was used for PPI source. + MINT, IntAct –There was no consideration PPI type. PSI-MI –Its domain source, InterPro has a redundancy problem such as, Pfam-A Domain A Domain B Domain C
Recent Problem Reason for recent low prediction accuracy problem About 33% of PPIs are overlapped Test set may have exactly same PPI which exists in the learning set. Prediction accuracy decreases to 52%, 94% for the sensitivity and specificity respectively.. IntActDIP. +. Integrated DB
Recent Problem 각 DB 별 도메인 분포 분석, 업데이트에 따른 도메인 분포 분석 결과 NEW1: DIP U MINT U IntAct (no pre-processing) NEW2: DIP U MINT U IntAct (no colocalization) NEW3: DIP U MINT U IntAct (no colocalization, no association) OLDNEW1NEW2NEW3 # of PPI pairs50085659026579452529 # of PPI pairs (domains are known)35829453854530734574 Portion of available PPI pairs71.5%68.9% 65.8% # of proteins5720600159985947 # of proteins (domains are known)3949418841864134 Avg. # of domains for one protein1.451.43 1.44 Sensitivity63.4%50.68%50.23%49.82%
Recent Problem 각 DB 별 도메인 분포 분석, 업데이트에 따른 도메인 분포 분석 결과 NEW1: DIP U MINT U IntAct (no pre-processing) NEW2: DIP U MINT U IntAct (no colocalization) NEW3: DIP U MINT U IntAct (no colocalization, no association) OLDNEW1NEW2NEW3 # of PPI pairs42483659026579452529 # of PPI pairs (domains are known)30870453854530734574 Portion of available PPI pairs72.7%68.9% 65.8% # of proteins5720600159985947 # of proteins (domains are known)3949418841864134 Avg. # of domains for one protein1.451.43 1.44 Sensitivity52.0%50.68%50.23%49.82%
Confidence Leveling Strategy Control Detected Interaction Type Evaluation Method Domain Appearances –When both proteins in binary PPI have rarely appeared domain, they give harm to prediction accuracy.
PSI-MI Ontology Tree Association (MI:0914) Molecules that are experimentally shown to be associated potentially by sharing just one interactor. Often associated molecules are co-purified by a pull-down or coimmunoprecipitation and share the same bait molecule. Physical association (MI:0915) Molecules that are experimentally shown to belong to the same functional or structural complex. Direct interaction (MI:0407) Interaction that is proven to involve only its interactors. Physical interaction (MI:0218) Interaction among molecules that can be direct or indirect. OBSOLETE: splitted to “association; MI:0914” and “physical association; MI:0915”. For remapping consider the experimental setting of an interaction. For bulk remapping a possible criteria is to whatever physical interaction that has among its participant a bait should become “association; MI:0914” the others can become “physical association; MI:0915”. Two hybrid interactions are an expection and can be “physical association; MI:0915”. Interaction Type Control
Evaluation Method(DIP) Confidence Score DIP: dip:0005(high throughput) non-core dip:0005(high throughput)|dip:0005(high throughput) core dip:0002(small scale) core dip:0004(small scale) core High throughput only 1 non core High throughput multi (non small scale) 2 Small scale 3 One small scale + High throughput 4 Two or more small scale 5 ConfidenceCount 147255 21417 34282 4517 52728
Evaluation Method(MINT) Confidence Score Mint : mint-score 0.0 ~ 1.0 experimental knowledge based no mint score average mint-score of “MI:0018, two hybrid” 0.32 0.0 ~ 0.32 1 0.32 ~ 0.49 2 0.49 ~ 0.66 3 0.66 ~ 0.83 4 0.83 ~ 5 ConfidenceCount 184077 27214 32517 41156 5292
Evaluation Method(IntAct) Confidence Score IntAct: any child of transcriptional complementation assay high-throughput We give confidence level such as the way in DIP ConfidenceCount 169167 23285 365083 41543 514843
Domain Appearances 전체 단백질 상호작용에서 매우 드물게 나 타나는 도메인을 보유한 PPI 를 제한함 P1P2 D1D2D3D4 If appearance frequencies of D1, D2, D3 and D4 are smaller than threshold, this PPI will be removed manually. How can we decide threshold? 2116 PPI pairs have PF00012 domain while only one pair has PF00887. 137 domains are appeared only one pair in the PPI databases
Conclusion The PPI databases distributed by different research group are heterogeneous and the overlap among them is very small. PPI prediction methods based on machine learning are very sensitive in different training sets. We can take the quality management of integrated PPI database through confidence leveling.