Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reliable Integration Strategy of PPI databases JWH 2009 / 6 / 19.

Similar presentations


Presentation on theme: "Reliable Integration Strategy of PPI databases JWH 2009 / 6 / 19."— Presentation transcript:

1 Reliable Integration Strategy of PPI databases JWH 2009 / 6 / 19

2 Contents Introduction Reason for Recent Problem –Low Prediction Accuracy in newly published PPI data Confidence Leveling Strategies –In terms of Interaction Type and Evaluation Methods Control –In terms of Domain Appearances Evaluation Conclusion & Work to do

3 Introduction PPI integration –Means merge heterogeneous PPI databases into single data source. –The evident need to integrate multiple sources. –Technical problems exist. It is essential and important because of –Different distributers for different interests. –Most of machine learning based PPI prediction methods(ex, PreSPI) are highly sensitive in different training sets.

4 Introduction Meanwhile in PreSPI, –Only Database of Interaction Protein (DIP) was used for PPI source. –There was no consideration PPI type. –Its domain source, InterPro has a redundancy problem such as, Domain A Domain B Domain C

5 Introduction Meanwhile in PreSPI, –Only Database of Interaction Protein (DIP) was used for PPI source. + MINT, IntAct –There was no consideration PPI type. PSI-MI –Its domain source, InterPro has a redundancy problem such as, Pfam-A Domain A Domain B Domain C

6 Recent Problem Reason for recent low prediction accuracy problem About 33% of PPIs are overlapped Test set may have exactly same PPI which exists in the learning set. Prediction accuracy decreases to 52%, 94% for the sensitivity and specificity respectively.. IntActDIP. +. Integrated DB

7 Recent Problem 각 DB 별 도메인 분포 분석, 업데이트에 따른 도메인 분포 분석 결과 NEW1: DIP U MINT U IntAct (no pre-processing) NEW2: DIP U MINT U IntAct (no colocalization) NEW3: DIP U MINT U IntAct (no colocalization, no association) OLDNEW1NEW2NEW3 # of PPI pairs # of PPI pairs (domains are known) Portion of available PPI pairs71.5%68.9% 65.8% # of proteins # of proteins (domains are known) Avg. # of domains for one protein Sensitivity63.4%50.68%50.23%49.82%

8 Recent Problem 각 DB 별 도메인 분포 분석, 업데이트에 따른 도메인 분포 분석 결과 NEW1: DIP U MINT U IntAct (no pre-processing) NEW2: DIP U MINT U IntAct (no colocalization) NEW3: DIP U MINT U IntAct (no colocalization, no association) OLDNEW1NEW2NEW3 # of PPI pairs # of PPI pairs (domains are known) Portion of available PPI pairs72.7%68.9% 65.8% # of proteins # of proteins (domains are known) Avg. # of domains for one protein Sensitivity52.0%50.68%50.23%49.82%

9 Confidence Leveling Strategy Control Detected Interaction Type Evaluation Method Domain Appearances –When both proteins in binary PPI have rarely appeared domain, they give harm to prediction accuracy.

10 PSI-MI Ontology Tree Association (MI:0914) Molecules that are experimentally shown to be associated potentially by sharing just one interactor. Often associated molecules are co-purified by a pull-down or coimmunoprecipitation and share the same bait molecule. Physical association (MI:0915) Molecules that are experimentally shown to belong to the same functional or structural complex. Direct interaction (MI:0407) Interaction that is proven to involve only its interactors. Physical interaction (MI:0218) Interaction among molecules that can be direct or indirect. OBSOLETE: splitted to “association; MI:0914” and “physical association; MI:0915”. For remapping consider the experimental setting of an interaction. For bulk remapping a possible criteria is to whatever physical interaction that has among its participant a bait should become “association; MI:0914” the others can become “physical association; MI:0915”. Two hybrid interactions are an expection and can be “physical association; MI:0915”. Interaction Type Control

11

12 Evaluation Method(DIP) Confidence Score DIP: dip:0005(high throughput)  non-core dip:0005(high throughput)|dip:0005(high throughput)  core dip:0002(small scale)  core dip:0004(small scale)  core High throughput only  1  non core High throughput multi (non small scale)  2 Small scale  3 One small scale + High throughput  4 Two or more small scale  5 ConfidenceCount

13 Evaluation Method(MINT) Confidence Score Mint : mint-score  0.0 ~ 1.0 experimental knowledge based  no mint score average mint-score of “MI:0018, two hybrid”  ~ 0.32  ~ 0.49  ~ 0.66  ~ 0.83  ~  5 ConfidenceCount

14 Evaluation Method(IntAct) Confidence Score IntAct: any child of transcriptional complementation assay  high-throughput We give confidence level such as the way in DIP ConfidenceCount

15 Domain Appearances 전체 단백질 상호작용에서 매우 드물게 나 타나는 도메인을 보유한 PPI 를 제한함 P1P2 D1D2D3D4 If appearance frequencies of D1, D2, D3 and D4 are smaller than threshold, this PPI will be removed manually. How can we decide threshold? 2116 PPI pairs have PF00012 domain while only one pair has PF domains are appeared only one pair in the PPI databases

16 Domain Appearances For all PPI pairs, Threshold = 10 (209) –52.1%, 94.3% Threshold = 20 (584) –53.5%, 94.4% Threshold = 30 (1172) –54.3%, 94.3% AppearancesCount <10209 <20584 < < <501611

17 Conclusion The PPI databases distributed by different research group are heterogeneous and the overlap among them is very small. PPI prediction methods based on machine learning are very sensitive in different training sets. We can take the quality management of integrated PPI database through confidence leveling.


Download ppt "Reliable Integration Strategy of PPI databases JWH 2009 / 6 / 19."

Similar presentations


Ads by Google