Self Organization of a Massive Document Collection

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
TEMPLATE DESIGN © Self Organized Neural Networks Applied to Animal Communication Abstract Background Objective The main.
Lecture 3 Nonparametric density estimation and classification
Neural Networks Chapter 9 Joost N. Kok Universiteit Leiden.
Unsupervised learning. Summary from last week We explained what local minima are, and described ways of escaping them. We investigated how the backpropagation.
Self Organization: Competitive Learning
5/16/2015Intelligent Systems and Soft Computing1 Introduction Introduction Hebbian learning Hebbian learning Generalised Hebbian learning algorithm Generalised.
Kohonen Self Organising Maps Michael J. Watts
Artificial neural networks:
Unsupervised Networks Closely related to clustering Do not require target outputs for each input vector in the training data Inputs are connected to a.
Self-Organizing Map (SOM). Unsupervised neural networks, equivalent to clustering. Two layers – input and output – The input layer represents the input.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
X0 xn w0 wn o Threshold units SOM.
Self Organizing Maps. This presentation is based on: SOM’s are invented by Teuvo Kohonen. They represent multidimensional.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
Self Organization: Hebbian Learning CS/CMPE 333 – Neural Networks.
November 9, 2010Neural Networks Lecture 16: Counterpropagation 1 Unsupervised Learning So far, we have only looked at supervised learning, in which an.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CS Instance Based Learning1 Instance Based Learning.
Neural Networks Lecture 17: Self-Organizing Maps
Lecture 09 Clustering-based Learning
Radial-Basis Function Networks
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
Project reminder Deadline: Monday :00 Prepare 10 minutes long pesentation (in Czech/Slovak), which you’ll present on Wednesday during.
Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.
Self Organized Map (SOM)
Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.
Artificial Neural Network Theory and Application Ashish Venugopal Sriram Gollapalli Ulas Bardak.
Artificial Neural Networks Dr. Abdul Basit Siddiqui Assistant Professor FURC.
Artificial Neural Network Unsupervised Learning
Community Architectures for Network Information Systems
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Adaptive nonlinear manifolds and their applications to pattern.
A two-stage approach for multi- objective decision making with applications to system reliability optimization Zhaojun Li, Haitao Liao, David W. Coit Reliability.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
NEURAL NETWORKS FOR DATA MINING
Self-Organizing Maps Corby Ziesman March 21, 2007.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
UNSUPERVISED LEARNING NETWORKS
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
381 Self Organization Map Learning without Examples.
CUNY Graduate Center December 15 Erdal Kose. Outlines Define SOMs Application Areas Structure Of SOMs (Basic Algorithm) Learning Algorithm Simulation.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Semiconductors, BP&A Planning, DREAM PLAN IDEA IMPLEMENTATION.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.
CHAPTER 14 Competitive Networks Ming-Feng Yeh.
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa

KNN & Naïve Bayes Hongning Wang
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Self-Organizing Network Model (SOM) Session 11
Data Mining, Neural Network and Genetic Programming
One-layer neural networks Approximation problems
Self-Organizing Maps Corby Ziesman March 21, 2007.
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
Feature mapping: Self-organizing Maps
University of Wisconsin - Madison
Artificial Neural Networks
Unsupervised Networks Closely related to clustering
Presentation transcript:

Self Organization of a Massive Document Collection Teuvo Kohonen Samuel Kaski Krista Lagus Jarkko Salojarvi Jukka Honkela Vesa Paatero Antti Saarela

Their Goal Organize large document collections according to textual similarities Search engine Create a useful tool for searching and exploring large document collections

Their Solution Self-organizing maps Groups similar documents together Interactive and easily interpreted Facilitates data mining

Self-organizing maps Unsupervised learning neural network Maps multidimensional data onto a 2 dimensional grid Geometric relations of image points indicate similarity

Self-organizing map algorithm Neurons arranged in a 2 dimensional grid Each neuron has a vector of weights Example: R, G, B values

Self-organizing map algorithm (cont) Initialize the weights For each input, a “winner” is chosen from the set of neurons The “winner” is the neuron most similar to the input Euclidean distance: sqrt ((r1 – r2)2 + (g1 – g2)2 + (b1 – b2)2 + … )

Self-organizing map algorithm (cont) Learning takes place after each input ni(t + 1) = ni(t) + hc(x),i(t) * [x(t) – ni(t)] ni(t) weight vector of neuron i at regression step t x(t) input vector c(x) index of “winning” neuron hc(x),i neighborhood function / smoothing kernel Gaussian Mexican hat

Self-organizing map example 6 shades of red, green, and blue used as input 500 iterations

The Scope of This Work Organizing massive document collections using a self-organizing map Researching the up scalability of self-organizing maps

Original Implementation WEBSOM (1996) Classified ~5000 documents Self-organizing map with “histogram vectors” Weight vectors based on collection of words whose vocabulary and dimensionality were manually controlled

Problem Large vector dimensionality required to classify massive document collections Aiming to classify ~7,000,000 patent abstracts

Goals Reduce dimensionality of histogram vectors Research shortcut algorithms to improve computation time Maintain classification accuracy

Histogram Vector Each component of the vector corresponds to the frequency of occurrence of a particular word Words associated with weights that reflect their power of discrimination between topics

Reducing Dimensionality Find a suitable subset of words that accurately classifies the document collection Randomly Projected Histograms

Randomly Projected Histograms Take original d-dimensional data X and project to a k-dimensional ( k << d ) subspace through the origin Use a random k x d matrix R, the elements in each column of which are normally distributed vectors having unit length: Rk x d Xd x N => new matrix Xk x N

Random Projection Formula

Why Does This Work? Johnson – Lindenstrauss lemma: “If points in a vector space are projected onto a randomly selected subspace of suitably high dimension, then the distances between the points are approximately preserved” If the original distances or similarities are themselves suspect, there is little reason to preserve them completely

In Other Words The similarity of a pair of projected vectors is the same on average as the similarity of the corresponding pair of original vectors Similarity is determined by the dot product of the two vectors

Why Is This Important? We can improve computation time by reducing the histogram vector’s dimensionality

Loss in Accuracy

Optimizing the Random Matrix Simplify the projection matrix R in order to speed up computations Store permanent address pointers from all the locations of the input vector to all locations of the projected matrix for which the matrix element of R is equal to one

So… Using randomly projected histograms, we can reduce the dimensionality of the histogram vectors Using pointer optimization, we can reduce the computing time for the above operation

Map Construction Self-organizing map algorithm is capable of organizing a randomly initialized map Convergence of the map can be sped up if initialized closer to the final state

Map Initialization Estimate larger maps based on the asymptotic values of a much smaller map Interpolate/extrapolate to determine rough values of larger map

Optimizing Map Convergence Once the self-organized map is smoothly ordered, though not asymptotically stable, we can restrict the search for new winners to neurons in the vicinity of the old one. This is significantly faster than performing an exhaustive winner search over the entire map A full search for the winner can be performed intermittently to ensure matches are global bests

Final Process Preprocess text Construct histogram vector for input Reduce dimensionality by random projection Initialize small self-organizing map Train the small map Estimate larger map based on smaller one Repeat last 2 steps until desired map size reached

Performance Evaluation Reduced dimensionality Pointer optimization Non-random initialization of the map Optimized map convergence Multiprocessor parallelism

Largest Map So Far 6,840,568 patent abstracts written in English Self-organizing map composed of 1,002,240 neurons 500-dimension histogram vectors (reduced from 43,222) 5 ones in each column of the random matrix

What It Looks Like

Conclusions Self-organizing maps can be optimized to map massive document collections without losing much in classification accuracy

Questions?