Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Text Databases Text Types
Latent Semantic Analysis
Information retrieval – LSI, pLSI and LDA
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28: Principal Component Analysis; Latent Semantic Analysis.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Natural Language Processing.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
UM/UT Microarray Short Course May 4, 2006
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Vector Semantics Dense Vectors.
Document Clustering Based on Non-negative Matrix Factorization
Efficient Estimation of Word Representation in Vector Space
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Design open relay based DNS blacklist system
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee presented by J. Jiang

Outline  Brief Overview of Biomedical Literature Mining  The Gene Clustering Problem  Latent Semantic Indexing  Experiments  Conclusions and Discussions

Biomedical Literature Mining Brief Overview  Goal: to find useful information from the large amount of biomedical literature  Tasks include: Identifying relevant literature for a given gene/protein Connecting genes with diseases Grouping genes/proteins by functions Reconstructing and predicting gene networks (ISMB 05’ Tutorial Proposal, H. Shatkay)

Biomedical Literature Mining Brief Overview (cont.)  Approaches: IE & NLP: entities, relations, facts, etc. Many methods rely on co-occurrences of genes/proteins. IR: text categorization and summarization, etc. Hybrid: combining multiple techniques  Challenges include: No fixed nomenclature or sentence structure Indirect links Etc.

The Gene Clustering Problem  To group genes based on their functions  Previous work: Co-occurrence of gene symbols to extract gene relationships Implicit textual relationships Gene clustering using functional information in annotated indices or MEDLINE abstracts

Vector Space Model for Gene Clustering  Glenisson et al., 2003  Bag-of-words, vector space model  Cosine similarity  K-medoids algorithm This paper tries to improve the vector representation of documents using LSA.

Background: LSA  First studied by Deerwester et al., Indexing by Latent Semantic Analysis, J Info Sci, 1990  Motivation: inaccuracy of term matching due to polysemy and synonomy  Assumption: existence of latent semantic structure (“artificial concepts”)  Dimension reduction. Keep the most important dimensions. Similar to PCA.

Singular Value Decomposition  d documents, t terms (in general, t >> d)  d  t matrix X = [x ij ], where x ij denotes the frequency of term j in document i  X can be decomposed as: X = T 0 S 0 D 0, where columns of T 0 are the eigenvectors of XX, and columns of D 0 are the eigenvectors of X X. S 0 is diagonal. S 0 2 is the matrix of eigenvalues of XX (or X X).

SVD (cont.)  The diagonal elements of S 0 are constructed to be positive and ordered in decreasing magnitude.

SVD (cont.)  The eigenvector with the largest eigenvalue represents the dimension along which the variance of the data is maximized.  Keep the k largest elements in S 0, remove other elements, and remove corresponding columns (eigenvectors) in T 0 and D 0, X can be approximated by: X  X hat = TSD.

SVD (cont.)  X hat is the best least-square-fit to X with rank k.

Illustration The first eigenvector The second eigenvector (taken from “A Tutorial on PCA” by Lindsay Smith)

LSA with SVD  Terms are represented by rows of X hat and documents are represented by columns of X hat in the reduced space.  Doc-to-doc similarity: X hat X hat = DS 2 D = DS(DS).  Query is represented as pseudo-document: D q = X q TS -1, where X q is the query vector in the original space. D q is like a row of D.  Query-to-doc similarity: D q S (DS).

Experiments  50 genes in (1) development, (2) Alzheimer Disease, and (3) Cancer Biology are selected  Gene-document: concatenation of abstracts known to be related the gene  Gene-document represented as vectors:

Experiments (cont.)  Keyword query and accession number query  Reelin signaling pathway  GO classification terms and human disease  Direct genes and indirect genes  Hierarchical Clustering

Results

Results (cont.)

 Tried 5, 25, and 50 dimensions. 50 is shown to perform the best.  Tried reducing the numbers of abstracts of Reelin genes. Claimed that AP was not significantly reduced when 50% abstracts were removed.  Claimed that hierarchical clustering agrees with biological relationships.

Discussions  Pros Gene clustering by textual information. Applied LSA to biomedical literature. Indirect linkage can be found through latent concepts.  Cons Requires human annotation to construct gene-documents. Not applicable to new domain. Genes in the experiments are carefully chosen in 3 categories. How does the method perform in general?  Other gene clustering methods?

References  S. Deerwester et al. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41-6,  M.A. Gerolami (2004). Latent Semantic Analysis A General Tutorial Introduction.  H. Shatkay (2005). ISMB 05’ Tutorial Proposal.  H. Shatkay & R. Feldman (2004). Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology, 10-6,

The End  Questions?  Thank you!