Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Slides:



Advertisements
Similar presentations
Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.
Advertisements

AI Pathfinding Representing the Search Space
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Point-and-Line Problems. Introduction Sometimes we can find an exisiting algorithm that fits our problem, however, it is more likely that we will have.
Scheduling with Outliers Ravishankar Krishnaswamy (Carnegie Mellon University) Joint work with Anupam Gupta, Amit Kumar and Danny Segev.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Graph Drawing Zsuzsanna Hollander. Reviewed Papers Effective Graph Visualization via Node Grouping Janet M. Six and Ioannis G. Tollis. Proc InfoVis 2001.
Lecture 21: Spectral Clustering
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Automated Extraction and Parameterization of Motions in Large Data Sets SIGGRAPH’ 2004 Lucas Kovar, Michael Gleicher University of Wisconsin-Madison.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
4. Ad-hoc I: Hierarchical clustering
Efficient and Robust Computation of Resource Clusters in the Internet Efficient and Robust Computation of Resource Clusters in the Internet Chuang Liu,
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Application of Graph Theory to OO Software Engineering Alexander Chatzigeorgiou, Nikolaos Tsantalis, George Stephanides Department of Applied Informatics.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Clustering Unsupervised learning Generating “classes”
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Projective Texture Atlas for 3D Photography Jonas Sossai Júnior Luiz Velho IMPA.
Network Aware Resource Allocation in Distributed Clouds.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
MODELING AND ANALYSIS OF MANUFACTURING SYSTEMS Session 12 MACHINE SETUP AND OPERATION SEQUENCING E. Gutierrez-Miravete Spring 2001.
A Graph-based Friend Recommendation System Using Genetic Algorithm
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Slide 14.1 Nonmetric Scaling MathematicalMarketing Chapter 14 Nonmetric Scaling Measurement, perception and preference are the main themes of this section.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Basic Algorithms and Software for the Layout Problem
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Foundation of Computing Systems
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Topical Clustering of Search Results Date : 2012/11/8 Resource : WSDM’12 Advisor : Dr. Jia-Ling Koh Speaker : Wei Chang 1.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Artificial Intelligence Techniques Internet Applications 4.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Spectral Methods for Dimensionality
Graphcut Textures:Image and Video Synthesis Using Graph Cuts
System for Semi-automatic ontology construction
Personalized Social Image Recommendation
Greedy Algorithm for Community Detection
Scale-Space Representation of 3D Models and Topological Matching
Clustering The process of grouping samples so that the samples are similar within each group.
Presentation transcript:

Semantic Wordfication of Document Collections Presenter: Yingyu Wu

Outline Introduction ProjCloud Technique Results and Comparisons Discussion and Limitations Conclusion

Introduction Word Cloud

Two issues of word cloud: (1) Existing methods do not yet provide an intuitive visual representation that allows to link words on the layout to the documents they are meant to represent. (2) The construction of word clouds inside general polygons with semantical preservation between words.

Contributions: A novel word cloud-based visualization technique, named ProjCloud. (1) combine multidumensional projection and word clouds, which enables to visualize the similarity among documents as well as their corresponding word clouds, extend the exploratory capabilities of the word clouds. (2) A new approach for building word clouds inside polygons while still preserving the semantic relationship among keywords. (3) A mechanism based on spectral sorting that allows arranging words according to their semantic relationship as well as highlighting the most important words in the cloud.

ProjCloud Technique Overview of the sequence of steps

Steps: (1) Mapping document collection into the visual space using a multidimensional projection technique(LSP). (2) Points in the visual space are clustered(polygons). Two versions: automatically and user interactive. (3) Keywords extracted (most frequent words). Compute their relevance in order to guide the semantic preserving placement of words (4) The scaling step take place, keyword are size based on their relevance and on the area of the containing polygon. (5) The optimization algorithm take places to generate the word cloud.

Keyword Relevance and Semantic Relation Let M be the document x tem frequency matrix. Covariance matrix C obtained from M. Build a graph G where each node corresponds to a keyword and an edge e ij connects between two keywords ( W i and W j ) if only if the covariance C ij is among the k-largest ones. Assuming that edge e ij has weight C ij, it used Fiedler vector, assigns a scalar value a ij to each keyword that minimizes: If Cij is big then the Wi and Wj will receive similar values when they are closely related.

The most relevant keyword: Cijmax is the largest covariance in C and Wi and Wj are the corresponding words. The most relevant keyword is Wi if the average covariance between Wi and Wk (k = 1,2,3,..n) is larger than the average covariance of Wj. Once we get the most relevant keyword (Wr), the keyword are sorted in increasing order according to In ProjCloud, the order given by Fiedler vector dictates the position of words into the cloud.

Sizing keywords (1) bounding boxes. (2) the size of keyword is set to the scale value which fits in the interval [fmin, fmax](12,50). (3) If the areas of all keyword bounding boxes is smaller than the area of polygon P, fmax is increased and the values are re-scaled. This process is repeated until the sum of areas of the keywords exceeds the area or P.

The optimization Problem

Results

Comparisons

Discussion and Limitations ProjCloud is largely dependent on the clustering process. If the clustering performs poorly, it will make the word cloud very hard to fit and reed. Empty space between clusters.

Conclusion

Thank you