Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.

Slides:



Advertisements
Similar presentations
Improved TF-IDF Ranker
Advertisements

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Organizing a spectral image database by using Self-Organizing Maps Research Seminar Oili Kohonen.
Self Organization of a Massive Document Collection
Self Organizing Maps. This presentation is based on: SOM’s are invented by Teuvo Kohonen. They represent multidimensional.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Latent Dirichlet Allocation a generative model for text
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Visualization of AAG Paper Abstracts André Skupin Dept. of Geography University of New Orleans AAG Pittsburgh, April 5, 2000.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
SciTech Strategies, Inc. BETTER MAPS BETTER DECISIONS Science Mapping and Applications: Choices and Trade-offs Kevin W. Boyack, SciTech Strategies Standards.
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.
Artificial Neural Networks Dr. Abdul Basit Siddiqui Assistant Professor FURC.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Community Architectures for Network Information Systems
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Kohonen Mapping and Text Semantics Xia Lin College of Information Science and Technology Drexel University.
Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Chapter 6: Information Retrieval and Web Search
THE ABSTRACT OBJECT RELATIONSHIP BROWSER (absORB) COS 333 Project Demo Thursday, May 7th, 2009 Laura Bai ’10 Natasha Indik ’10 Ryan Bayer ’09 Tsheko Mutungu.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
SINGULAR VALUE DECOMPOSITION (SVD)
Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Applying Key Phrase Extraction to aid Invalidity Search
Learning Literature Search Models from Citation Behavior
Self-organizing map numeric vectors and sequence motifs
Artificial Neural Networks
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE 6(3): e18029

Motivation Compare different similarity measurements Make use of biomedical data set Process large corpus

Procedures 1.define a corpus of documents 2.extract and pre-process the relevant textual information from the corpus 3.calculate pairwise document-document similarities using nine different similarity approaches 4.create similarity matrices keeping only the top-n similarities per document 5.cluster the documents based on this similarity matrix 6.assess each cluster solution using coherence and concentration metrics

Data To build a corpus with titles, abstracts, MeSH terms, and reference lists Matched and combined data from the MEDLINE and Scopus (Elsevier) databases The resulting set was then limited to those documents published from that contained abstracts, at least five MeSH terms, and at least five references in their bibliographies resulting in a corpus comprised of 2,153,769 unique scientific documents Base matrix: word-document co-occurrence matrix

Methods

tf-idf The tf–idf weight (term frequency–inverse document frequency) A statistical measure used to evaluate how important a word is to a document in a collection or corpus The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

tf-idf

LSA Latent semantic analysis

LSA

BM25 Okapi BM25 A ranking function that is widely used by search engines to rank matching documents according to their relevance to a query

BM25

SOM Self-organizing map A form of artificial neural network that generates a low-dimensional geometric model from high-dimensional data SOM may be considered a nonlinear generalization of Principal components analysis (PCA).

SOM 1.Randomize the map's nodes' weight vectors 2.Grab an input vector 3.Traverse each node in the map 1.Use Euclidean distance formula to find similarity between the input vector and the map's node's weight vector 2.Track the node that produces the smallest distance (this node is the best matching unit, BMU) 4.Update the nodes in the neighbourhood of BMU by pulling them closer to the input vector 1.Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t)) 5.Increase t and repeat from 2 while t < λ

Topic modeling Three separate Gibbs-sampled topic models were learned at the following topic resolutions: T= 500, T= 1000 and T=2000 topics. Dirichlet prior hyperparameter settings of b= 0.01 and a = 0.05N/(D.T) were used, where N is the total number of word tokens, D is the number of documents and T is the number of topics.

Topic modeling

PMRA The PMRA ranking measure is used to calculate ‘Related Articles’ in the PubMed interface The de facto standard Proxy

Similarity filtering Reduce matrix size Generate a top-n similarity file from each of the larger similarity matrices n=15, each document thus contributes between 5 and 15 edges to the similarity file

Clustering DrL (now called OpenOrd) A graph layout algorithm that calculates an (x,y) position for each document in a collection using an input set of weighted edges

Evaluation Textual coherence (Jensen-Shannon divergence)

Evaluation Concentration: a metric based on grant acknowledgements from MEDLINE, using a grant-to-article linkage dataset from a previous study

Results

Results (cont.)