Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC.

Similar presentations


Presentation on theme: "Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC."— Presentation transcript:

1 Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC

2 Michigan, 2005 Alfonso Valencia CNB-CSIC SLIDE WINDOW APPROACH Krallinger Valencia Drug Discovery Today 2005 ISMB-Biolink

3 Michigan, 2005 Alfonso Valencia CNB-CSIC BioLINK SIG: Linking Literature, Information and Knowledge for Biology A Joint Meeting of The ISMB BioLINK Special Interest Group on Text Data Mining and The ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics Christian Blaschke, Hagit Shatkay, Kevin B. Cohen, Lynette Hirschman 1. InTex: a Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. S. T. Ahmed, D. Chidambaram, H. Davulcu, C. Baral 2. Corpus Design for Biomedical Natural Language Processing. K. B. Cohen, L. Fox, P. V. Ogren, L. Hunter 3. Unsupervised Gene/Protein Named Entity Normalization using Automatically Extracted Dictionaries. A. M. Cohen 4. Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions. A. Ramani, E. Marcotte, R. Bunescu, R. Mooney 5. MedTag: a Collection of Biomedical Annotations. L.H Smith, L. Tanabe, T. Rindflesch, W. John Wilbur 6. A Machine Learning Approach to Acronym Generation. Y. Tsuruoka, S. Ananiadou, J. Tsujii 7. Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data. B. Wellner 8. Adaptive String Similarity Metrics for Biomedical Reference Resolution. B. Wellner, J. Castaño, J. Pustejovsky 9. A Cross-Domain Application of Natural Language Processing in Biology. I. Chiu, L. H. Shu 10. Functional Annotation of Genes Using Hierarchical Text Categorization. S. Kiritchenko, S. Matwin, A. F. Famili 11. Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. P. Nakov, A. Schwartz, B. Wolf, M. Hearst 12. Searching for High-Utility Text in the Biomedical Literature. H. Shatkay, A. Rzhetsky, W. J. Wilbur 13. Automatic Highlighting of Bioscience Literature. H. Wang, S. Bradshaw, M. Light BioLINK SIG / BioOntologies in ECCB05 Madrid Sept. www.eccb05.org

4 Michigan, 2005 Alfonso Valencia CNB-CSIC Competitions -BioCreAtIve Task 1: Extraction of gene / protein names from text, mapping to identifiers (fly, mouse, yeast) Task 2: GO to protein via text for a collection of human genes. -TREC I, II -KDD -JNLPBA -others Text Mining vs. Curation Text Mining supports curation Curators build and maintain ontologies and databases Text Mining profits from data from different resources: ontologies, databases BioCreAtIvE ©

5 Michigan, 2005 Alfonso Valencia CNB-CSIC Text mining in a nutshell 1.Protein / gene names Interspecies Linking to DBs 2.Relations between entities Protein-protein Other entities (regulation, drugs) Function 3.Type of Relation Proteins Metabolic pathways 1. 80% prec/recall (BioCreative) Far less than that Essential (Bioinformatics not NLP) 2. Easy on the surface Best known one (accessible?) Dictionaries Very difficult (i.e. GO in BioCreative) 3. Semantic Summaries very difficult New challenge, unexplored Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005 Krallinger et al., DDToday 2005

6 Michigan, 2005 Alfonso Valencia CNB-CSIC Krallinger et al., Genome Biology 2005

7 Michigan, 2005 Alfonso Valencia CNB-CSIC Text mining in a nutshell 1.Protein / gene names 1.Interspecies 2.Linking to DBs 2.Relations 1.Protein protein 2.Others (regulation, drugs) 3.Function 3.Type of Relation 1.Proteins 2.Metabolic pathways 4.Concepts for groups of genes 1.Existing 2.Creating new ones 1. 80% prec/recall (biocreative) 1.Far less than that 2.Essential (not NLP) 2. Easy on the surface 1.Best known one (accessible?) 2.Dictionaries 3.Very difficult (to GO Biocreative) 3. Semantic 1.Summaries very difficult 2.New challenge, unexplored 4. Knowledge discovery 1.Summaries and generalization 2.Not jet Hoffmann et al., Science STKE 2005 Krallinger et al., Genome Biology 2005

8 Michigan, 2005 Alfonso Valencia CNB-CSIC Meiosis Cyclin Checkpoint Interphase Nucleoplasma Division Histone Replication Chromatid Dipeptidyl Prolyl nmr Collagen-binding 17 genes PCNA CDC2 MSH2 LBR TOP2A... 24 genes ABCA5 CAT ELF2 PIM1 WNT2... Cell cycle Unknown DNA replication DNA metabolism Cell Cycle control PCNA-MSH2 The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication. LBR-CDC2 LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase. Words GO codes Sentences Words Blaschke, et al., Funct. Integ. Genomics 2001

9 Michigan, 2005 Alfonso Valencia CNB-CSIC AC Intro 1:30-1:45pm Text Mining: Dietrich Rebholz-Schuhmann 7. High-recall Protein Entity Recognition Using a Dictionary. Kou, Cohen, Murphy 1:45-2:10pm 9. Beyond The Clause: Extraction of Phosphorylation Information from Medline Abstracts. Narayanaswamy, Ravikumar, Vijay- Shanker 2:10-2:35pm

10 Michigan, 2005 Alfonso Valencia CNB-CSIC

11 Michigan, 2005 Alfonso Valencia CNB-CSIC Exponential Growth in Data EMBL Total Entries / year Medline Total Articles / year Medline New Articles / year

12 Michigan, 2005 Alfonso Valencia CNB-CSIC OFFICIAL6254244.46 % ALIAS5174936.79 % PROTEIN2636318.74 % The 2492 selected genes in the year 2002 were cited 140654 times Tamames et al., 2005

13 Michigan, 2005 Alfonso Valencia CNB-CSIC Leon et al., 2004 - 98 pathways with more than one step (information available for 73) - 2111 individual steps. Protein-compound links in abstracts Total2111 steps 856 linked (40 %) Bacterial chemotaxis 19 17 (89 %) Glutathione metabolism7 6 (85 %) Fatty acid biosynthesis -path 1- 9 7 (78 %) in sentences Total 2111 steps611 linked(29%) Bacterial chemotaxis 19 13 (65 %) Two-component system 85 52 (61 %) Citrate cycle -TCA cycle- 2717 (63 %) KEGG links to literature

14 Michigan, 2005 Alfonso Valencia CNB-CSIC Years Evolution of gene names Hoffmann, Valencia TIGs 2003 Gene names The evolution of gene names over time is a “scale free” process - “critical state” system - the evolution of a gene name cannot be predicted - some gene name act as attractors of other names

15 Michigan, 2005 Alfonso Valencia CNB-CSIC Hoffmann Valencia Nat Genet 2004

16 Michigan, 2005 Alfonso Valencia CNB-CSIC

17 Michigan, 2005 Alfonso Valencia CNB-CSIC SOTA clustering versus significance of Geisha terms. Oliveros, Blaschke, GIW 2000 ©

18 Michigan, 2005 Alfonso Valencia CNB-CSIC SOTA and GEISA mixed information Blaschke, Herrero, Dopazo, Valencia 2002 Expression based clustering Weight (expression)+ Weight (text) Term (text) based clustering

19 Michigan, 2005 Alfonso Valencia CNB-CSIC

20 Michigan, 2005 Alfonso Valencia CNB-CSIC Stable clusters > central processes with expression and functional information agree Unstable groups > contradictory information “jumping” genes, divergent expression and functional classifications. (Gene of very unstable behavior > related with insufficient information)


Download ppt "Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC."

Similar presentations


Ads by Google