University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions.

University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions Razvan C. Bunescu Raymond J. Mooney Machine Learning Group Department of Computer Sciences University of Texas at Austin {razvan, mooney}@cs.utexas.edu Arun K. Ramani Edward M. Marcotte Institute for Cellular and Molecular Biology and Center for Computational Biology and Bioinformatics University of Texas at Austin {arun, marcotte}@icmb.utexas.edu

2 University of Texas at Austin Machine Learning Group Outline  Introduction & Motivation.  Two benchmark tests of accuracy.  Framework for the extraction of interactions.  Future Work.  Conclusions.

3 University of Texas at Austin Machine Learning Group Introduction  Large scale protein networks facilitate a better understanding of the interactions between proteins.  Most complete for yeast.  Minimal progress for human.  Most known interactions between human proteins are reported in Medline.  Reactome, BIND, HPRD: databases with protein interactions manually curated from Medline. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

4 University of Texas at Austin Machine Learning Group Motivation  Many interactions from Medline are not covered by current databases.  Databases are generally biased for different classes of interactions.  Manually extracting interactions is a very laborious process. Aim: Automatically identify pairs of interacting proteins with high accuracy.

5 University of Texas at Austin Machine Learning Group Outline  Introduction & Motivation.  Two benchmark tests of accuracy.  Functional Annotation.  Physical Interaction.  Framework for the extraction of interactions.  Future Work.  Conclusions.

6 University of Texas at Austin Machine Learning Group Accuracy Benchmarks – Shared Functional Annotations  Accuracy of interaction datasets correlates well with % of interaction partners sharing functional annotations.  Functional annotation  a pathway between the two proteins in a particular ontology:  KEGG: 55 pathways at lowest level.  GO: 1356 pathways at level 8 of biological process annotation.

7 University of Texas at Austin Machine Learning Group Accuracy Benchmarks – Shared Known Physical Interactions  Assumption: Accurate datasets are more enriched in pairs of proteins known to participate in a physical interaction.  Reactome and BIND are more accurate than others  use them as source of known physical interactions.  Total: 11,425 interactions between 1,710 proteins.

8 University of Texas at Austin Machine Learning Group Accuracy Benchmarks – LLR Scoring Scheme  Use the log-likelihood ratio (LLR) of protein pairs with respect to:  Sharing functional annotations.  Physically interacting. P(D|I) and P(D|-I) are the probabilities of observing the data D conditioned on the proteins sharing (I) or not sharing (-I) benchmark associations.  Higher values for LLR indicate higher accuracy.

9 University of Texas at Austin Machine Learning Group Outline  Introduction & Motivation.  Two benchmark tests of accuracy.  Framework for the extraction of interactions.  Future Work.  Conclusions.

10 University of Texas at Austin Machine Learning Group Framework for Interaction Extraction Interactions Database Medline abstract Protein Extraction Medline abstract (proteins tagged) Interaction Extraction  Extensive comparative experiments in [Bunescu et al. 2005]  Protein Extraction: Maximum Entropy tagger.  Interaction Extraction: ELCS (Extraction using Longest Common Subsequences).  Current framework aims to improve on the previous approach on a much larger scale (750K Medline abstracts).

11 University of Texas at Austin Machine Learning Group Framework for Interaction Extraction [Protein Extraction] Identify protein names using a Conditional Random Fields (CRFs) tagger trained on a dataset of 750 Medline abstracts, manually tagged for proteins. [Interaction Extraction] 2) Keeping most confident extractions, detect which pairs of proteins are interacting. Two methods: 2.1) Co-citation analysis (document level). 2.2) Learning of interaction extractors (sentence level). [Lafferty et al. 2001]

12 University of Texas at Austin Machine Learning Group 1) A CRF tagger for protein names  Protein Extraction  a sequence tagging task, where each word is associated a tag from: O(-utside), B(-egin), C(-ontinue), E(-nd), U(-nique). O O O O O O B E O O O O O In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1  The input text is first preprocessed:  Tokenized  Split in sentences (Ratnaparki’s MXTerminator)  Tagged with part-of-speech (POS) tags (Brill’s tagger)

13 University of Texas at Austin Machine Learning Group 1) A CRF tagger for protein names  Each token position in a sentence is associated with a vector of binary features based on the (current tag, previous tag) combination, and observed values such as:  Words before, after or at the current position.  Their POS tags & capitalization patterns.  A binary flag set on true if the word is part of a protein dictionary. IN VBN JJ NN NNS, NN NNP VBZ VBN IN JJ In synchronized human osteosarcoma cells, cyclin D1 is induced in early words afterwords before current word POS beforePOS after current POS

14 University of Texas at Austin Machine Learning Group 1) A CRF tagger for protein names  The CRF model is trained on 750 Medline abstracts manually annotated for proteins.  Experimentally, CRFs give better performance then Maximum Entropy models – they allow local tagging decisions to compete against each other in a global sentence model.  The model is used for tagging a large set (750K) of Medline abstracts citing the word ‘human’.  Each extracted protein is associated a normalized confidence value.  For the Interaction Extraction step, we keep only proteins scoring 0.8 or better.

15 University of Texas at Austin Machine Learning Group 2.1) Interaction Extraction using Co-citation Analysis  Intuition: proteins co-occurring in a large number of abstracts tend to be interacting proteins.  Compute the probability of co-citation under a random model (hyper-geometric distribution). N – total number of abstracts (750K) n – abstracts citing the first protein m – abstracts citing the second protein k – abstracts citing both proteins

16 University of Texas at Austin Machine Learning Group 2.1) Interaction Extraction using Co-citation Analysis  Protein pairs which co-occur in a large number of abstracts (high k) are assigned a low probability under the random model.  Empirically, protein pairs whose observed co-citation rate is given a low probabilty under the random model score high on the functional annotation benchmark.  RESULT: Close to 15K interactions extracted that score comparable or better than HPRD on the functional annotation benchmark.

17 University of Texas at Austin Machine Learning Group 2.1) Co-citation Analysis with Bayesian Reranking 1.Use a trained Naïve Bayes model to measure the likelihood that an abstract discusses physical protein interactions. 2.For a given pair of proteins, compute the average score of co-citing abstracts. 3.Use the average score to re-rank the 15k already extracted pairs. Medline abstract CRF tagger Medline abstract (proteins tagged) Co-citation Analysis Re-ranked Interactions Ranked Interactions Naïve Bayes scores

18 University of Texas at Austin Machine Learning Group Integrating Extracted Data with Existing Databases Extracted: 6,580 interactions between 3,737 human proteins Total: 31,609 interactions between 7,748 human proteins.

19 University of Texas at Austin Machine Learning Group 2.1) Co-citation Analysis: Evaluation

20 University of Texas at Austin Machine Learning Group 2.1) Co-citation Analysis: Evaluation

21 University of Texas at Austin Machine Learning Group 2.2) Learning of Interaction Extractors  Proteins may be co-cited for reasons other than interactions.  Solution: sentence level extraction, with a binary classifier.  Given a sentence containing the two protein names, output:  Positive: if the sentence asserts an interaction between the two.  Negative: otherwise.  If the sentence contains n > 2 protein names, replicate it into (n choose 2) sentences, each with only two protein names.  Training data: AImed, a collection of Medline abstracts, manually tagged.

22 University of Texas at Austin Machine Learning Group AImed  Total of 225 documents (200 w/ interactions + 25 wo interactions)  Annotations for proteins and interactions In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity … cyclin D1 … becomes associated with p9Ckshs1 => Interaction cyclin D1 is associated with both p34cdc2 => Interaction cyclin D1 is associated with both p34cdc2 and p33cdk2 => Interaction

23 University of Texas at Austin Machine Learning Group ELCS (Extraction using Longest Common Subsequences)  A new method for inducing rules that extract interactions between previously tagged proteins.  Each rule consists of a sequence of words with allowable word gaps between them, similar to [Blaschke & Valencia, 2001, 2002]. - (7) interactions (0) between (5) PROT (9) PROT (17).  Any pair of proteins in a sentence if tagged as interacting forms a positive example, otherwise it forms a negative example.  Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples. [Bunescu et al., 2005]

24 University of Texas at Austin Machine Learning Group ERK (Extraction using a Relation Kernel)  The patterns (features) are sparse subsequences of words constrained to be anchored on the two protein names.  The feature space can be further pruned down – in almost all examples, a sentence asserts a relationship between two entities using one of the following patterns:  [FI] Fore-Inter: ‘interaction of P 1 with P 2 ’, ‘activation of P 1 by P 2 ’  [I] Inter: ‘P 1 interacts with P 2 ’, ‘P 1 is activated by P 2 ’  [IA] Inter-After: ‘P 1 – P 2 complex’, ‘P 1 and P 2 interact’  Restrict the three types of patterns to use at most 4 words (besides the two protein anchors).

25 University of Texas at Austin Machine Learning Group ERK (Extraction using a Relation Kernel)  The kernel K(S 1,S 2 )  the number of common patterns between S 1 and S 2, weighted by their span in the two sentences.  K(S 1,S 2 ) can be computed based on the dynamic procedure from [Lodhi et al., 2002].  Train an SVM model to find a max-margin linear discriminator between positive and negative examples S 1  In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. S 2  Experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and  [FI] patterns: “human cells P 1 associated with P 2 ”, …  [I] patterns: “P 1 associated with P 2 ”, …  [IA] patterns: “P 1 associated with P 2,”, …

26 University of Texas at Austin Machine Learning Group Evaluation: ERK vs ELCS vs Manual  Compare results using the standard measures of precision and recall:  All three systems were tested on Aimed, using gold-standard proteins.

27 University of Texas at Austin Machine Learning Group Evaluation: ERK vs ELCS vs Manual

28 University of Texas at Austin Machine Learning Group Future Work & Conclusions Future Work:  Analyze the complete set of 750K abstracts using the relational kernel and integrate results into an improved composite dataset. Conclusions:  Created a large database of interacting human proteins by consolidating interactions automatically extracted from Medline abstracts with existing databases.  Final database: 31,609 interactions between 7,748 human proteins.

29 University of Texas at Austin Machine Learning Group For Further Information Consolidated database available on line: –http://bioinformatics.icmb.utexas.edu/idserve/http://bioinformatics.icmb.utexas.edu/idserve/ Papers available online: –http://www.cs.utexas.edu/users/ml/publication/bioinformatics.htmlhttp://www.cs.utexas.edu/users/ml/publication/bioinformatics.html “Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome,” Ramani, A.K., Bunescu, R.C., Mooney, R.J. and Marcotte, E.M.,Genome Biology, 6, 5, r40(2005). “Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions,”Arun Ramani, Edward Marcotte, Razvan Bunescu, Raymond Mooney, to appear in the Proceedings of ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005. “Collective Information Extraction with Relational Markov Networks,” Razvan Bunescu and Raymond J. Mooney, Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pp. 439- 446, Barcelona, Spain, July 2004. “Comparative Experiments on Learning Information Extractors for Proteins and their Interactions.,” Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong, Artificial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33, 2 (2005), pp. 139-155.

30 University of Texas at Austin Machine Learning Group The End

31 University of Texas at Austin Machine Learning Group Protein Interaction Datasets – Normalization  Need a shared convention for referencing proteins and their interactions.  Map each interacting protein to a LocusLink ID => small loss of proteins.  Consider interactions symmetric => many duplicates eliminated.  Omit self interactions – cannot be evaluated on functional annotation benchmark. Example: HPRD reduced from 12,013 to 6,054 unique symmetric, non-self interactions.

32 University of Texas at Austin Machine Learning Group Protein Interaction Datasets – Normalization DatasetVersionTotal Is (Ps)Self Is (Ps)Unique Is (Ps) Reactome08/03/0412,497 (6,257)160 (160)12,336 (807) BIND08/03/046,212 (5,412)549 (549)5,663 (4,762) HPRD04/12/0412,013 (4,122)3,028 (3,028)6,054 (2,747) Orthology (all)03/31/0471,497 (6,257)373 (373)71,124 (6,228) Orthology (core)03/31/0411,488 (3,918)206 (206)11,282 (3,863)  Dataset statistics after normalization (Is  interactions, Ps  proteins):

33 University of Texas at Austin Machine Learning Group Accuracy of manually curated interactions Functional Annotation Benchmark Physical Interaction Benchmark DatabaseLLRDatabaseLLR Reactome3.8N/A BIND2.9N/A HPRD2.1Core orthology5.0 Core orthology2.1HPRD3.7 Non-core orthology1.1Non-core orthology3.7

University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions.

Similar presentations

Presentation on theme: "University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions.

Similar presentations

Presentation on theme: "University of Texas at Austin Machine Learning Group Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions."— Presentation transcript:

Similar presentations

About project

Feedback