Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.

Slides:



Advertisements
Similar presentations
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Advertisements

Introduction to Information Retrieval
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Literature Mining Tools for Analysis of Genomic Data Ramin Homayouni, Ph.D. Associate Professor of Biology Director of Bioinformatics UTHSC BINF April.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Chapter 5: Information Retrieval and Web Search
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic.
UNIVERSITAS SCIENTIARUM SZEGEDIENSIS UNIVERSITY OF SZEGED D epartment of Software Engineering New Conceptual Coupling and Cohesion Metrics for Object-Oriented.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
GENE INDEXING Janice Ward Indexer/Reviser Index Section, NLM.
Modern Information Retrieval Lecture 2: Key concepts in IR.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
UM/UT Microarray Short Course May 4, 2006
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
1 CS 430: Information Discovery Lecture 5 Ranking.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
GUIDE. P UB M ED
Best pTree organization? level-1 gives te, tf (term level)
gene-to-gene relationships & networks
Document Clustering Based on Non-negative Matrix Factorization
PubMed.
Presentation transcript:

Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Outline Problem / Goals Related Work Information Retrieval –Vector Space Model –Latent Semantic Indexing (LSI) Biological Databases SGO Use & Results

Problem Biological tools are creating vast amounts of data. Current techniques are time-consuming and expensive. Want to know phenotype (function) from genotype (structure/sequence).

Goals Develop a tool to aid researchers in finding and understanding functional gene relationships. Use information that covers whole genome, e.g. literature.

Related Work Jenssen et al. (2001) developed PubGene. –Literature network –Assigns functional association if there is a co- occurrence of gene symbols Wilkinson and Huberman (2004) expanded this idea to find communities of related genes. Yandell and Majoros (2002) use natural language processing techniques to identify nature of relationships.

Related Work Most all literature-based techniques rely on term co-occurrence. What about gene aliases? Solution: Apply a more robust technique.

Information Retrieval Vector Space Model Documents are parsed into tokens. Tokens are assigned a weight of, w ij, of i th token in j th document. An m x n term-by-document matrix, A, is created where –Documents are m-dimensional vectors. –Tokens are n-dimensional vectors.

Information Retrieval Term Weights Term weights are the product of a local and global component tf idf idf2

Information Retrieval Term Weights (cont’d) log-entropy Goal is to give distinguishing terms more weight.

Information Retrieval Query & Similarity Queries are represented by a pseudo-document vector Similarity is the cosine of the angle between document vectors.

Information Retrieval Latent Semantic Indexing (LSI) LSI performs a truncated SVD on A = UΣV T U is the m x n matrix of eigenvectors of AA T V T is the r x n matrix of eigenvectors of A T A Σ is the r x r diagonal matrix containing the r nonnegative singular values of A r is the rank of A A rank-k approximation is given by A k = U k Σ k V k T

Information Retrieval LSI (cont’d) Document-to-document similarity is Queries are projected into low-rank approximation space

Information Retrieval LSI (cont’d) Scaled document vectors can be computed once and stored for quick retrieval. The lower-dimensional space forces queries and documents to be compared in a more conceptual manner and saves storage. Choice of number of factors is an open question. End Effect: LSI can find similarities between documents that have no term co-occurrence.

Information Retrieval Evaluation Measures Precision – ratio of relevant returned documents to the total number of returned documents. Recall – ratio of relevant returned documents to the total number of relevant documents. Goal is to have high precision at all levels of recall. Systems are often evaluated by average precision (AP), which is the average of 11 interpolated precision values at the decile ranges.

Biological Databases MEDLINE MEDLINE (NLM) –Contains 14+ million references to journal articles with a concentration in medicine –Span over 4,600 journals worldwide –1966 to present –~500,000 citations added annually –Each citation is manually indexed with MeSH terms.

Biological Databases PubMed PubMed –Retrieves articles from MEDLINE and other journals. –Can be queried via any combination of attributes.

Biological Databases LocusLink NCBI human-curated database Single query interface to a comprehensive directory for genes and gene reference sequences for key genomes. Provides links to related records in PubMed and other citations when applicable. Provides RefSeq Summary of gene function and links to key MEDLINE citations relevant to each gene.

Biological Databases Overview MEDLINE has lots information –Not all articles relate to genes –Gene terminology problem LocusLink does not cover all relevant citations, but a representative few.

Biological Databases Gene Document Construction Concatenate titles and abstracts of MEDLINE citations cross-referenced in Human, Rat, and Mouse LocusLink entries. Sequencing abstracts included – noise LocusLink references are not comprehensive, so recall of all relevant abstracts is not guaranteed.

SGO Primarily uses LSI to rank genes. Enables user to specify query method –Gene query –Keyword query –Number of factors –Show latent matches Saves previous query sessions.

SGO Interface

SGO Interface (cont’d)

SGO Trees Unfortunately, ranked lists mean little to biologists. Pairwise distances can be formed into a matrix where is the similarity between documents i and j

SGO Trees (cont’d) Fitch-Margoliash (1967) method in PHYLIP is applied to D to generate hierarchical trees. Thresholds can be applied to self-similarity matrix to produce graphs.

SGO Hierarchical Tree

SGO Graph or Nodal Tree

SGO Coding Issues Web interface – must be interactive –Queries are processed on click –Document collections are parsed offline –Trees are constructed offline Storage will eventually become an issue.

Results Test Data Set 50 gene test data set was constructed. –Alzheimer’s Disease –Cancer –Development Reelin signaling pathway used as basis for evaluation –5 primary genes (directly associated) –7 secondary genes (indirectly associated)

Results Primary AP AP for 5 primary genes –61% for 5 factors –84% for 25 factors –84% for 50 factors

Results Secondary AP AP for 12 secondary genes –53% for 5 factors –59% for 25 factors –61% for 50 factors

Results Comparison LSI comparable to tf-idf for 5 primary genes Far superior to tf-idf for 12 second genes –PubMed co-citation identifies 2 of the 7 indirectly related genes –Abstract overlap of LocusLink citations fails to identify any indirectly related genes tf-idf fails on many keyword queries Tested on Gene Ontology classifications (not shown) –Similar tendencies are observed

Results Abstract Representation To simulate scaling up, decrease representation of reelin-related genes AP of 47% on 20,856 Human LocusLink abstracts

Results Hierarchical Tree

Conclusions SGO allows genes to be compared to each other and to keyword (function). SGO identifies latent relationships with promising accuracy. SGO is not meant to replace existing technologies, but to assist researchers –Verify current results –Direct future exploration

Future Work Scale up to entire genome Document construction Incorporate structural or other information for multi-modal similarity Test other models e.g. NMF, QR, etc. Interactive tree building Keep collections current