Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Aggregating local image descriptors into compact codes
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Graduate : Sheng-Hsuan Wang
Text Similarity David Kauchak CS457 Fall 2011.
Self Organization of a Massive Document Collection
Self Organizing Maps. This presentation is based on: SOM’s are invented by Teuvo Kohonen. They represent multidimensional.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Self Organizing Maps (SOM) Unsupervised Learning.
Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Kohonen Mapping and Text Semantics Xia Lin College of Information Science and Technology Drexel University.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.
Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.
Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
TreeSOM :Cluster analysis in the self- organizing map Neural Networks 19 (2006) Special Issue Reporter 張欽隆 D
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
CUNY Graduate Center December 15 Erdal Kose. Outlines Define SOMs Application Areas Structure Of SOMs (Basic Algorithm) Learning Algorithm Simulation.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
ViSOM - A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
KNN & Naïve Bayes Hongning Wang
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Self-Organizing Maps for Content-Based Image Database Retrieval
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Outline Motivation Objective Introduction Self-Organizing Map Statistical Models of Documents Rapid Construction of Large Document Maps The Document Map of All Electronic Patent Abstracts Conclusion Personal opinion

Motivation To improve the WEBSOM and to organize vast document collections according to textual similarities.

Objective The main goal has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data.

Introduction From Simple Searches to Browsing of Self-Organized Data Collections. Scope of This Work. WEBSOM Dimensionality Latent semantic indexing, LSI. Clustering of words into semantic categories. By a random projection method.

Self Organizing Map The original SOM algorithm.

Self Organizing Map Batch-map SOM : to accelerate the computation of the SOM.

Self Organizing Map Let Vi be the set of all x(t) that have as their closest model.Called the Voronoi set. The number of samples x(t) falling into Vi is called.

Statistical Models of Documents The histograms formed over word clusters using self-organizing semantic maps. This system was called the WEBSOM. The overview of the WEBSOM2 system.

Statistical Models of Documents A. The Primitive Vector-Space Model Inverse document frequency(IDF). Shannon entropy. B. Latent Semantic Indexing(LSI) Sigular-value decomposition(SVD).

Statistical Models of Documents C. Randomly Projected Histograms Original document vector Rectangular random matrix R Projections D. Histograms on the Word Category Map The original version of the WEBSOM. The new method is random projection of the word histograms.

Statistical Models of Documents E. Validation of the Random Projection Method by Small-Scale Preliminary Experiments patents from the whole corpus of abstracts. Equal number of patents from each of the 21 subsections words or word forms. With full 1344 D histograms as document vectors.

Statistical Models of Documents F. Construction of Random Projections of Word Histograms by Pointers. Thresholding(+1 or -1). Sparse matrices(1 and 0).

Statistical Models of Documents Hash table and pointer. The computing time was about 20% of that of the usual matrix-product method. Computational complexity of the random projection with pointers is only In contrast, the big O of the LSI is

Rapid Construction of Large Document Maps A. Fast Distance Computation To tabulate the indexes of the nonzero components of each input vector. Euclidean distances between sparse vectors. We must use low-dimensional models.

Rapid Construction of Large Document Maps B. Estimation of Larger Maps Based on Carefully Constructed Smaller Ones Increasing the number of nodes of the SOM during its construction. The new idea is to estimate good initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map.

Rapid Construction of Large Document Maps dense sparse

Rapid Construction of Large Document Maps C. Rapid Fine-Tuning of the Large Maps 1) Addressing Old Winners: This idea is same with LAB! 2) Initialization of the Pointers: The size of the maps is increased stepwise during learning.~using formula (10). The winner is the map unit for which the inner product with the data vector is the largest.

Rapid Construction of Large Document Maps

3) Parallelized Batch Map Algorithm: The winner search can be implemented in parallel process. 4) Saving Memory by Reducing Representation Accuracy: The sufficient accuracy can be maintained during the computation.

Rapid Construction of Large Document Maps D. Performance Evaluation of the New Methods 1) Numerical Comparison with the Traditional SOM Algorithm: Two performance indexes to measure the quality of the maps:Average quantization error and Classification accuracy Experiments:Two sets of maps

Rapid Construction of Large Document Maps 2) Comparison of the Computational Complexity:, stems from the computation of the small map., results from the VQ step(6) of the batch map algorithm., refers to the estimation of the pointers. N:Data Samples; M:Map Units; d:dimensionality.

The Document Map of All Electronic Patent Abstracts A. Preprocessing We first extracted the titles and the texts for further processing. We removed nontextual information. Mathematical symbols and numbers were converted into special dummy symbols. Contained different words. A set of common words were removed. The remaining vocabulary consisted of words. Finally, we omitted the abstracts in which less than five words remained.

The Document Map of All Electronic Patent Abstracts B. Formation of Statistical Models The final dimensionality we selected 500 and five random pointers were used for each word. The words were weighted using the Shannon entropy of their distribution of occurrence among the subsections of the patent classification system.

The Document Map of All Electronic Patent Abstracts The weight is a measure of the unevenness of the distribution of the word in the subsections. The weights were calculated as follows: be the probability of a randomly chosen instance of the word w occurring in subsection g, and Ng the number of subsections. Shannon entropy Weight

The Document Map of All Electronic Patent Abstracts C. Formation of the Document Map 500-dimensional document vectors. The map was increased twice sixteenfold and one ninefold. Each of the enlarged, estimated maps(cf. Section IV-B) was then fine-tuned by five batch map iteration cycles.

The Document Map of All Electronic Patent Abstracts D. Results When each map node was labeled according to the majority of the subsections in the node. The resulting accuracy was 64%.

Conclusion In this paper the emphasis has been on the up scalability of the methods relating to very large text collections. Contributions: Larger than our previous one. A new method of forming statistical models of documents. Several new fast computing methods.

Personal Opinion Put SOM into a domain knowledge,e.g.IR or …?