Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration."— Presentation transcript:

1 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration Presenter : Wu, Jia-Hao Authors : Andreas Nurnberger, Marcin Detyniecki ASC (2005) ˜..

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Personal Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation To handle this increased number of e-mails  LISTSERV sends 30 million messages per day in approximately 190000 mailing lists.  The total number of mailing list messages at 36.5 billion per year. The problem of classifying e-mails is particularly difficult.  The mail contains irrelevant information in the from of signatures.  E-mails are very rich in made-up words, slang, abbreviations, as for instance e-mail smiles. ex:, lol  Parts of preceding e-mails that may partly cover different topics.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Provide an intuitive visual profile of the considered mailing lists.  The user can scan easily for e-mails similar in content. Offer an intuitive navigation tool, were similar e-mails are located close to each other.  The imports messages from a mailing list and arranges groups there e- mails based on a similarity measure.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology1-SOM A neural networks that cluster high-dimensional data vectors according to a similarity measure.  Two-dimensional arrangements of squares or hexagons. The clusters are arranged in a low-dimensional topology – grid structure.  Objects are assigned to one cluster are similar to each other as in every cluster analysis.  Objects of nearby clusters are expected to be more similar than objects in more distant clusters.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology1-SOM advantage & disadvantage Advantage:  Intuitive visualization.  Good exploration possibilities. Disadvantage:  The size and shape has to be defined in advance.  Solution  The map usually has to be trained several times.  To compute the classification error.  To add empty cells which no document is assigned,and the growing process can be stopped.  To use the growing SOM method.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology2 - Externally Growing SOM The main alteration  Use a hexagonal map structure.  Restrict the algorithm to add new units around the external units of the map.  To add a new unit close to the external unit, which achieved the highest error.  The accumulated error of one unit of the map exceeds a specified threshold value. The algorithm can be solve this state.

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology2 - A learning method for GSOM 1:Predefine the initial grid size (usually 2 x 2 units are used). 2:initialize the assigned vectors with randomly selected values. Reset error values e i for every unit i. 3:train the map using all inputs patterns for a fixed number of iterations. 4:identify the unit with the largest accumulated error. 5:if the error does not exceed a threshold value stop training. 6:identify the external unit k with the largest accumulated error. 7:add a new unit to the unit k. 8:Continue with step 3. 9:Continue training of the map for a fixed number of iterations. Reduce the learning rate during training.

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Example  Used a data set consisting of 1000 feature vectors defining 6 randomly generated clusters in the 3-dimensional data space (A).  An initial map was trained that be depicted (D).

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.)  Add 150 randomly generated data points of class 4 (B).  The map is depicted (E).

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.)  Create a new class of 150 data points at the center (in-between of the 6 classes) (C).  The map is depicted (F).

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) Experiments.  Each document i is described by a numerical feature vector D i = {x 1,…x t }  The query vector can be compared to each document.  A result list can be obtained by ordering the documents according to the computed similarity.  It is to use binary term vectors.  1→the corresponding word is used in the document.  0→the word is not corresponding.  Improve the performance - Term weighting schemes.  Large weights → used frequently

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) The importance of a word in a specific document of the considered collection: The similarity S of two documents: for each word k in the vocabulary the entropy as defined was computed:

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) A document is described based on a ‘statistical fingerprint’

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) The author created two simple artificial data sets.  Define documents dealing with the topics fuzzy, neuro, genetic and five arbitrary keywords.  Each consisting of 500 feature vectors.  Data set (A) : just one of the keywords and five remaining keywords are arbitrarily chosen.  Data set (B) : contain exactly two of the keywords.

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) A SOM was learned using the data set (A). Second run we added to the training data set (A) (B)

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) The capabilities of the tool  Keyword search and visualization using the maps  the distribution of keyword search results can be visualized by coloring.  Navigating in the document map  We can identify the node’s neighbouring cells.  Content based searching  The document map reuse.  Global and user profile visualization  The user can decide the cells that who want to search.  Visualizing changes  The document map will be record.

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) Content based searching

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments (Cont.) The application interface

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion Advantages:  Applied to text documents, especially the combination of iterative keyword search.  The GSOM are able to adapt their size and structure to the data.  Even if we add the new cells around the growing map. The result does not affect the ability of the map to learn any type of data. Problem:  Very short e-mails are often not correctly classified. Future work:  the use of non-text documents (e.g. images) and the integration of user feedback.

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Comments Advantage  The faster exploration method. Drawback  … Application  Information retrieval.


Download ppt "Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to e-mail database visualization and exploration."

Similar presentations


Ads by Google