Presentation on theme: "Semantic Web Data Interoperation in Support of Enhanced Information Retrieval and Datamining in Proteomics Andrew Smith Thesis Defense 8/25/2006 Committee:"— Presentation transcript:
Semantic Web Data Interoperation in Support of Enhanced Information Retrieval and Datamining in Proteomics Andrew Smith Thesis Defense 8/25/2006 Committee: Martin Schultz, Mark Gerstein (co-advisors), Drew McDermott, Steven Brenner (UC Berkeley)
Outline Problem Description Problem Description Enhanced information retrieval, datamining Enhanced information retrieval, datamining LinkHub – supporting system LinkHub – supporting system Biological identifiers and their relationships. Biological identifiers and their relationships. Semantic web RDF graph, RDF query languages Semantic web RDF graph, RDF query languages Cross-database queries Cross-database queries Combined relational / keyword-based search Combined relational / keyword-based search Enhanced automated information retrieval Enhanced automated information retrieval The webThe web PubMed (biomedical scientific literature)PubMed (biomedical scientific literature) Empirical performance evaluation for yeast proteinsEmpirical performance evaluation for yeast proteins Related Work and Conclusions Related Work and Conclusions
Web Information Management and Access – Opposing Paradigms Search Engines over the web Automated Automated People can publish in natural languages flexible People can publish in natural languages flexible Vast coverage (almost whole web) Vast coverage (almost whole web) Currently, preeminent paradigm because of vast size of web and its unstructured heterogeneity Currently, preeminent paradigm because of vast size of web and its unstructured heterogeneity Unfortunately, only gives coarse-grained topical access, no real cross-site interoperation / analysis Unfortunately, only gives coarse-grained topical access, no real cross-site interoperation / analysis Semantic Web Very fine-grained data modeling and connection Very fine-grained data modeling and connection Very precise cross-resource query / question answering supported Very precise cross-resource query / question answering supported Unfortunately, requires much more manual intervention, people must change how they publish (RDF, OWL, etc.) Thus, limited acceptance and size Unfortunately, requires much more manual intervention, people must change how they publish (RDF, OWL, etc.) Thus, limited acceptance and size
Combining Search and Semantic Web Paradigms Two paradigms largely independent. Seem to have complementary strengths and weaknesses. Key idea: These two approaches to web information management and retrieval can work together and complement one another and there are interesting, practical, and useful ways they can work with, leverage, and enhance each other.
Combining Relational and Keyword-based Search Access to Free Text Documents Consider query for “ Consider query for “all documents containing information for proteins which are members of the Pfam Adenylate Kinase (ADK) family” Standard keyword-based search engines couldn’t support this. Relational information about documents required, i.e. that they are related to particular proteins in the Pfam ADK family.
Using Semantic Web for Enhanced Automated Information Retrieval Basic idea: the semantic web provides detailed information about terms and their interrelationships which can be used as additional information to improve web searches for those terms (and related terms). As proof of concept, the particular, practical problem we are addressing is to find additional relevant documents for proteomics identifiers on the web or in the scientific literature.
Finding Additional Relevant Documents for Proteomics Identifiers Why not just do a web search directly for the identifier, e.g. ‘P26364’? likely not good result Conflated senses of the identifier text, e.g. product catalog codes, etc. Conflated senses of the identifier text, e.g. product catalog codes, etc. Might be synonyms of the identifier should search these too (semantic web could provide them). Might be synonyms of the identifier should search these too (semantic web could provide them). Many important, relevant documents might not directly mention the identifier Many important, relevant documents might not directly mention the identifier E.g. pages about ‘cancer pathways’ but not specifically containing cancer pathway related protein identifierE.g. pages about ‘cancer pathways’ but not specifically containing cancer pathway related protein identifier Should search for important related concepts.Should search for important related concepts. Potentially much extra info from semantic web use it! Potentially much extra info from semantic web use it!
Example Cross-database Queries For yeast protein interactions, find For yeast protein interactions, find corresponding, evolutionarily related (homologous) protein interactions in Worm Requires 4 databases: UniProt, SGD, PFAM, WormBase Explore pseudogenes for yeast essential genes; Explore pseudogenes of human homologues of yeast essential genes Requires 4 databases: PseudoGene, SGD, UniProt, PFAM Requires 4 databases: PseudoGene, SGD, UniProt, PFAM
Characteristics of Biological Data Biological data, especially proteomics data, is the motivation and domain of focus. Vast quantities of data from high throughput experiments (genome sequencing, structural genomics, microarray, etc.) Huge and growing number of biological data resources: distributed, heterogeneous, large size variance. Practical domain to work in - need for better interoperation Practical domain to work in - need for better interoperation Challenging but not overly complex. Challenging but not overly complex. Rich semantic relationships among data resources, but often not made explicit.
Problem Description Integration and interoperation of structured, relational data (particularly proteomics data) that is: Large-scale Large-scale Widely distributed Widely distributed Independently maintained Independently maintained In support of important applications: Enhanced information retrieval / web search Enhanced information retrieval / web search Cross-database queries for datamining Cross-database queries for datamining
Data Heterogeneity Lack of standard detailed description of resources Data are exposed in different ways Programmatic interfaces Programmatic interfaces Web forms or pages Web forms or pages FTP directory structures FTP directory structures Data are presented in different ways Structured text (e.g., tab delimited format and XML format) Structured text (e.g., tab delimited format and XML format) Free text Free text Binary (e.g., images) Binary (e.g., images)
Classical Approaches to Interoperation: Data Warehousing and Federation Data Warehousing Focuses on data translation Focuses on data translation Translate data to common format, under unified schema, cross-reference, store and query in single machine/system. Translate data to common format, under unified schema, cross-reference, store and query in single machine/system. Federation Focuses on query translation Focuses on query translation Translating and distributing the parts of a query across multiple distinct, distributed databases and collating their results into single result. Translating and distributing the parts of a query across multiple distinct, distributed databases and collating their results into single result.
General Strategy for Interoperation Data is vast, distributed, independently maintained Complete centralized data-warehousing integration is impractical Complete centralized data-warehousing integration is impractical Must rely on federated, cooperative, loosely coupled solutions which allow partial and incremental progress Widely used and supported standards necessary Widely used and supported standards necessary Semantic Web is excellent fit to these needs
14 en date> Semantic Web Resource Description Framework (RDF) Models Data as a Directed Graph with Objects and Relationships Named by URI http://www.ncbi.nlm.nih.gov/SNP http://www.ncbi.nlm.nih.gov http://purl.org/dc/elements/1.1/creator http://purl.org/dc/elements/1.1/language en Above RDF graph in XML
Name / ID proliferation problem Identifiers for biological entities are a simple but key way to identify and interrelate the entities; Important “scaffold” for biological data. But often many synonyms for same entity, e.g. strange, legacy names; e.g. fly gene called “sonic hedgehog”. Even simple syntactic variants can be cumbersome, e.g. GO:0008150 vs. GO0008150 vs. GO-8150, etc. Not just synonyms, many kinds of relationship, one-to-many mappings. Known relationships among entities not always stored, or stored in non-standard ways. Implicit overall data structure: enormous, elaborate graph of relationships among biological entities.
LinkHub The major proteomics hub UniProt performs centralized identifier mapping for large, well-known databases. Large staff, resource intensive, manual curation centralization bottleneck. Not viable as complete solution Large staff, resource intensive, manual curation centralization bottleneck. Not viable as complete solution Just simple mappings, no relationship types. Just simple mappings, no relationship types. Many data resources not covered (e.g., small, transient, local, lab-specific, “boutique”). Many data resources not covered (e.g., small, transient, local, lab-specific, “boutique”). Need for system / toolkit to enable local, collaborative integration of data allow people to create “mini UniProts” and connect them LinkHub. Practically, LinkHub provides common “links portal” into a lab’s resources, also connecting them to larger resources. LinkHub used this way for Gerstein Lab and NESG, connecting them to major proteomics hub UniProt. LinkHub used this way for Gerstein Lab and NESG, connecting them to major proteomics hub UniProt.
Major / Minor Hubs and Spokes Federated Model LinkHub as local minor hubs to connect groups of common resources, single common connection to major hubs; more efficient organization of biological data.
Example YeastHub / LinkHub queries Query 1: Finding Worm ‘Interologs’ of Yeast Protein Interactions: For each yeast gene in interacting pair find corresponding WormBase genes: yeast gene name UniProt Accession Pfam accession UniProt Accession WormBase ID For each yeast gene in interacting pair find corresponding WormBase genes: yeast gene name UniProt Accession Pfam accession UniProt Accession WormBase ID Query 2: Exploring Pseudogene Content versus Gene Essentiality in Yeast and Humans yeast gene name UniProt Accession yeast pseudogene yeast gene name UniProt Accession yeast pseudogene yeast gene name UniProt Accession Pfam accession human UniProt Id UniProt Accession Pseudogene LSID yeast gene name UniProt Accession Pfam accession human UniProt Id UniProt Accession Pseudogene LSID
Combining Relational and Keyword-based Search Access to Free Text Documents Consider query for “ Consider query for “all documents containing information for proteins which are members of the Pfam Adenylate Kinase (ADK) family” Standard keyword-based search engines couldn’t support this. Relational information about documents required, i.e. that they are related to proteins in the Pfam ADK family. LinkHub attaches documents to identifier nodes and supports such relational query access to them.
LinkHub Path Type Queries View all paths in LinkHub graph matching specific relationship types, e.g. “family views”: PDB ID UniProt ID Pfam family UniProt ID PDB ID MolMovDB Motion PDB ID UniProt ID Pfam family UniProt ID PDB ID MolMovDB Motion NESG ID UniProt ID Pfam family UniProt ID NESG ID NESG ID UniProt ID Pfam family UniProt ID NESG ID Practically used as secondary, orthogonal interface to other databases. MolMovDB and NESG’s SPINE both use LinkHub for such “family views”. MolMovDB and NESG’s SPINE both use LinkHub for such “family views”.
LinkHub Subgraphs as Gold-standard Training Sets for Enhanced Automated Information Retrieval The LinkHub subgraph emanating from a given identifier and the web pages (hyperlinks) attached to the identifiers in the subgraph is concrete, accurate, extra information about the given identifier that can be used to improve document retrieval for the given central identifier. LinkHub subgraphs and associated documents for a given identifier are used as a training set to build classifiers to rank documents obtained from the web or scientific literature.
Training Set Docs are Scaled down based on Distance and Link-types from Central Identifier A B C D 0.5.33 E F G 0.5 0.7 0.8 1.0 0.5 0.85 0.5 0.4 0.78 Weight 1.0 Weight 0.5 Weight 1.0 Weight 0.75
Term Frequency-Inverse Document Frequency (TF-IDF) Word Weighting Vector Space Model: Documents are modeled as vectors of word weights, where the weights come from TF-IDF TF: frequently occurring words in query more likely semantically meaningful IDF: less frequent words in corpus are more discriminating Document frequency D: Number of docs in corpus (of N total docs) containing a term IDF = Log (N/D) IDF = Log (N/D)
Classifier for Document Relevance Reranking Use standard information retrieval techniques: tokenization, stopword filtering, word stemming, TF-IDF term weighting, and cosine similarity measures. Classifier model: add weighted subgraph documents’ word vectors, TF-IDF weight it, and take top weighted 20% terms. Standard cosine similarity value used to score documents against classifier.
Obtaining Documents to Rank Use major web search engines, via their web APIs. For demo purposes, we used Yahoo. Perform individual, base searches Top 40 training set feature words Top 40 training set feature words Identifiers in the subgraph Identifiers in the subgraph Combine all results into one large result set. Combine all results into one large result set. Rerank combined result set using the constructed classifier. Essentially, systematically exploring “concept space” around the identifier. Searches returning most relevant docs on average could be called semantic signatures Key concepts related to the given identifier Key concepts related to the given identifier Succinct snippets of what the identifier is “about”. Succinct snippets of what the identifier is “about”.
Query: P26364 Spawned Searches Results Example: UniProt P26364 Note: Direct Yahoo search for P26364 returned very poor results. In manual results inspection, 17/40 clearly had nothing to do with the UniProt protein. Many of others didn’t seem too useful: large tabular dumps of identifiers, etc. First clearly unrelated result in LinkHub’s results was at position 72. LinkHub’s results arguably better.
PubMed Application PubMed is a database of all scientific literature citations for about the last 50-100 years. Currently, no automated information retrieval of PubMed for biological identifier- related citations. Built app to search for related PubMed abstracts, using Swish-e to index and provide base search access to PubMed.
PubMed search for UniProt P26364 Manual annotations exist (above); Only 3 and 4 are directly related --- LinkHub-based automated method ranked these 13 and 7 and returned many more relevant docs.
Empirical Performance Tests Empirical Performance Tests Preceding results seemed reasonably good, but can we empirically measure performance? Use gold-standard set of documents (curated bibliography) known related to particular identifiers gene_literature.tab from yeast genome database (SGD) gene_literature.tab from yeast genome database (SGD)
Goals of Performance Tests Quantify the performance level of the procedure Performance close to optimal or lot’s of room for improvement? Performance close to optimal or lot’s of room for improvement? How can we know that adding in documents (downweighted) for related identifiers actually helps? Proof of concept: PFAM and GO are key related concepts for proteins, let’s objectively see if they help. Proof of concept: PFAM and GO are key related concepts for proteins, let’s objectively see if they help. Quantify performance of a particular enhancement: pre-IDF step.
Pre-Inverse Document Frequency (pre-IDF) step Idea: maximally separate all proteomics identifiers’ classifiers while making them specifically relevant and discriminating. Determine document frequencies for all pages of a type, e.g. all or sample of UniProt pages. Perform IDF against the type’s doc freqs and then again against the corpus you are searching; e.g. first UniProt then PubMed
Pre-IDF Step is Generally Useful For example, imagine wanting to find web pages highly relevant to a particular digital camera or mp3 player. Cnet has many pages about different digital cameras or mp3 players doc freqs for these. Build classifier for particular digital camera by first doing pre-IDF step against doc freqs for all Cnet digital camera pages.
Experimental Protocol Pick few hundred random Yeast proteins from TrEMBL and Swiss-Prot separately Each with at least 20 citations and GO and PFAM relations. Each with at least 20 citations and GO and PFAM relations. A protein’s citations are its “in” group Other protein’s citations are “mid” group Randomly selected PubMed citations (not in gene_lit.tab or UniProt) are an “out” group Classifier should match “in” – “mid” – “out” order But focus on “in”, most important But focus on “in”, most important Degree of deviation from this is the objective test Degree of deviation from this is the objective test
Performance Measure: ROC curves Sensitivity / specificity tradeoff, shows true positive rate vs false positive rate Depicts classifier performance without regard to class distribution or error costs Area under the curve (AUC) is single, summary measure: 1 is max, 0 is min. AUC is probability randomly chosen + ranked higher than randomly chosen -
Measuring Performance Full AUC and top 5% AUC (.05 AUC) It is how well you do in the top of rankings that really matters. It is how well you do in the top of rankings that really matters. UniProt page weight set to 1.0. Perform optimization over parameters at 0.1 granularity: PFAM and GO weights PFAM and GO weights Percentage of features kept Percentage of features kept Use/not use pre-IDF step. Use/not use pre-IDF step. Compare average AUC values for various parameter values Determine statistical significance with paired t- tests.
Results Perc features kept didn’t really matter 0.4 or 0.5 for computational efficiency 0.4 or 0.5 for computational efficiency Pre-IDF gave largest performance boost Any trial with pre-IDF gave better result than one without, regardless of other parameters. Any trial with pre-IDF gave better result than one without, regardless of other parameters. Pre-IDF and PFAM at small weight increased avg AUC for all trials for both TrEMBL and Swiss-Prot. GO did not help (PFAM more info rich). GO did not help (PFAM more info rich). TrEMBL was helped more than Swiss-Prot SwissProt is curated, high quality, complete, whereas TrEMBL is automated, lower quality makes sense SwissProt is curated, high quality, complete, whereas TrEMBL is automated, lower quality makes sense All comparisons for TrEMBL were statistically significant by t-tests, only pre-IDF for Swiss-Prot.
Related Work Search Engines / Google Small number of search terms entered. So little information to go by maybe millions of result documents, hard to rank well. So little information to go by maybe millions of result documents, hard to rank well. Good ranking was big problem for early search engines. Good ranking was big problem for early search engines. Google (Brin and Page 1998) provided a popular solution Hyperlinks as “votes” for importance Hyperlinks as “votes” for importance Rank web pages with most and best votes highest. Rank web pages with most and best votes highest.
Alternative to Millions of Result Documents: Increased Search Precision Increase search precision so fewer, more manageable number of documents returned. More search terms More search terms Problem: users are lazy, won’t enter too many terms. Solution: semantic web provides the increased precision users just select semantic web nodes for concepts, the nodes are automatically expanded to increase search precision. This is what LinkHub does This is what LinkHub does
Very Recent Related Work: Aphinyanaphongs et al 2006 Argues for specialized, automated filters for finding relevant documents in huge and ever expanding scientific literature. Constructed classifiers for predicting relevance of PubMed documents for various clinical medicine themes State of the art SVM classifiers Used large, manually curated, respected bibliographies to train Used text from article title, abstract, journal name, and MeSH terms for features
Aphinyanaphongs et al 2006 cont. LinkHub-based search by contrast: Fairly basic classifier model, word weight vectors compared with cosine similarity. Fairly basic classifier model, word weight vectors compared with cosine similarity. Small training sets (UniProt, GO, PFAM pgs) Small training sets (UniProt, GO, PFAM pgs) Fairly noisy also: web pages vs focused textFairly noisy also: web pages vs focused text Only abstract text used as features Only abstract text used as features Some gene_lit.tab citations more generally relevant True performance understated Some gene_lit.tab citations more generally relevant True performance understated Classifiers built automatically & easily at very large scale as natural byproduct of LinkHub Classifiers built automatically & easily at very large scale as natural byproduct of LinkHub But LinkHub’s.927 and.951 AUCs better than or negligibly smaller than But LinkHub’s.927 and.951 AUCs better than or negligibly smaller than 0.893, 0.932, 0.966 AUCs of Aphinyanaphongs et al 2006
Aphinyanaphongs et al 2006 and Citation Metrics Also compared to citation-based metrics Citation count Citation count Journal impact factor Journal impact factor Indirectly to Google PageRank Indirectly to Google PageRank SVM classifiers outperformed these, and adding citation metrics as features gave marginal improvement at best. Surprising given Google’s success with PageRank
Aphinyanaphongs et al 2006 and Citation Metrics cont. More generally stated, the conceivable reasons for citation are so numerous that it is unrealistic to believe that citation conveys just one semantic interpretation. Instead citation metrics are a superimposition of a vast array of semantically distinct reasons to acknowledge an existing article. It follows that any specific set of criteria cannot be captured by a few general citation metrics and only focused filtering mechanisms, if attainable, would be able to identify articles satisfying the specific criteria in question. Conclusion: Aphinyanaphongs et al 2006 is consistent with and supports the general approach taken by LinkHub of creating specialized filters (in the form of word weight vectors) for retrieval of documents specific to particular proteomics identifiers. It is arguably state of the art for focused tasks, superior to most commonly used search technology of Google; by extension, LinkHub-based search is also.
Publications 1. 1. LinkHub: a Semantic Web System for Efficiently Handling Complex Graphs of Proteomics Identifier Relationships that Facilitates Cross-database Queries and Information Retrieval. Andrew K. Smith, Kei-Hoi Cheung, Kevin Y. Yip, Martin Schultz1, Mark B. Gerstein. International Workshop on Semantic e-Science 3rd September, 2006, Beijing, China. (SeS2006), co-located with ASWC2006. To be published in special proceedings of BMC Bioinformatics. 2. 2. YeastHub: a semantic web use case for integrating data in the life sciences domain. KH Cheung, KY Yip, A Smith, R Deknikker, A Masiar, M Gerstein (2005) Bioinformatics 21 Suppl 1: i85-96. 3. An XML-Based Approach to Integrating Heterogeneous Yeast Genome Data. KH Cheung, D Pan, A Smith, M Seringhaus, SM Douglas, M Gerstein. 2004 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS); pp 236-242. 4. Network security and data integrity in academia: an assessment and a proposal for large-scale archiving. A Smith, D Greenbaum, SM Douglas, M Long, M Gerstein (2005) Genome Biol 6: 119. 5. Computer Security in academia-a potential roadblock to distributed annotation of the human genome. D Greenbaum, SM Douglas, A Smith, J Lim, M Fischer, M Schultz, M Gerstein (2004) Nat Biotechnol 22: 771-2. 6. Impediments to database interoperation: legal issues and security concerns. D Greenbaum, A Smith, M Gerstein (2005) Nucleic Acids Res 33: D3-4. 7. Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. CS Goh, N Lan, SM Douglas, B Wu, N Echols, A Smith, D Milburn, GT Montelione, H Zhao, M Gerstein (2004) J Mol Biol 336: 115-30.
Acknowledgements Committee: Martin Schultz, Mark Gerstein (co-advisors), Drew McDermott, Steven Brenner Kei Cheung Michael Krauthammer Kevin Yip Yale Semantic Web Interest Group National Library of Medicine (NLM)
(Very) Brief Proteomics Overview The Central Dogma of Biology states that the coded genetic information hard-wired into DNA (i.e. the genome) is transcribed into individual transportable cassettes, composed of messenger RNA (mRNA); each mRNA cassette contains the program for synthesis of a particular protein (or small number of proteins). Proteins are they key agents in the cell. Proteomics is the large-scale study of proteins, particularly their structures and functions. While the genome is a rather constant entity, the proteome differs from cell to cell and is constantly changing through its biochemical interactions with the genome and the environment. One organism has radically different protein expression in different parts of its body, in different stages of its life cycle and in different environmental conditions.
Proteins are modular, composed of domains or families (based on evolution)
LinkHub Data Model LinkHub has identifier types, identifiers, mappings between identifiers (“mapping type” attribute gives relationship type), and resources and the identifier types they accept (and where) in their URL templates (link_exceptions gives exceptions). LinkHub stored in both RDF (Sesame) and SQL (MySQL), translate between. MySQL for robustness and efficiency for the GUI frontend, RDF to complement YeastHub (as “glue” to make direct and indirect identifier connections).
Term Selection / Weighting using Term Associations in Corpus Lot’s of information in originating UniProt, GO, PFAM, etc. pages and PubMed use it. Tune classifier to actual content in PubMed (word freqs). TF-IDF does this crudely, let’s do better. Conceptually a "good" term is one that is strongly associated in the corpus (i.e. much above the background chance association in the corpus) with many other terms (which themselves have similar strong associations) from the originating documents. Strength of association is ratio of doc freq in search results of a term, over the background doc freq in the corpus (PubMed). > 1 means association.
Example associations in PubMed for term “browser”
Simple Algorithm for Term Selection & Weighting Compute combined term freq vec for originating docs. For each term, do a search with it against PubMed and compute combined doc freq vector for result set. Filter terms which aren’t present at highly above background chance. For each term, do a search with it against PubMed and compute combined doc freq vector for result set. Filter terms which aren’t present at highly above background chance. Compute cosine similarity measure between these two vectors larger means more relevant. Compute cosine similarity measure between these two vectors larger means more relevant. In tests I’ve done, seems to do well, particularly for filtering noise terms (which almost always have a score close to 0).
Network analysis of Term Association Graph Can create directed graph of word inter- associations; edges can be weighted by strength of association. Do network analysis to aid term selection / weighting: Find disconnected components independent concept groups. Find disconnected components independent concept groups. Find hubs likely central concepts in the originating documents, so good features. Find hubs likely central concepts in the originating documents, so good features. Find cliques groups of words that are all strongly co-associated with each other, the bigger the better. Find cliques groups of words that are all strongly co-associated with each other, the bigger the better. Google pagerank for term selection and weighting. Google pagerank for term selection and weighting.
Example – Maximal Clique for Yeast Adenylate Kinase (UniProt P26364)
Largest cliques for “pfam”, “scop”, and “embl” Small, and not overlapping with central largest clique not key terms. I noticed that noise terms mostly have empty or small maximal cliques.
Google PageRank on term association graph. Actually, doesn’t work well. Seems to most highly weight relevant but more general terms (e.g. “cell”, “pfam”, etc.) In retrospect, makes sense: pagerank is trying to find nodes which are linked to by many other nodes and link out to many other nodes, and this is intuitively a signature of more general terms (i.e. they will have a larger, more diffuse set of assocations than more focused terms). There are some results in the literature that PageRank applied to scientific citation graphs does not work well. Maybe Google really isn’t God!