Clustering C.Watters CS6403.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Traditional IR models Jian-Yun Nie.
Chapter 5: Introduction to Information Retrieval
Clustering Basic Concepts and Algorithms
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
PARTITIONAL CLUSTERING
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Evaluating the Performance of IR Sytems
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Overview of Search Engines
Term and Document Clustering Manual thesaurus generation Automatic thesaurus generation Term clustering techniques: –Cliques,connected components,stars,strings.
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 Computing Relevance, Similarity: The Vector Space Model.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Clustering.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
C.Watterscsci64031 Probabilistic Retrieval Model.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Post-Ranking query suggestion by diversifying search Chao Wang.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Discussion Class 11 Cluster Analysis.
Clustering of Web pages
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Clustering Techniques and IR
Section 7.12: Similarity By: Ralucca Gera, NPS.
Revision (Part II) Ke Chen
Information Organization: Clustering
Representation of documents and queries
Revision (Part II) Ke Chen
What Is Good Clustering?
Text Categorization Berlin Chen 2003 Reference:
Vector Models for IR Gerald Salton, Cornell SMART System
Presentation transcript:

Clustering C.Watters CS6403

Clustering What Why How Results C.Watters CS6403

Clustering Assign items to groups based on some calculation of degree of likeness between items Groups are not known before hand Uses multivariate analysis techniques Feature set determination critical C.Watters CS6403

Example News data Sports, World news, Entertainment etc Short items, items with photos, items with names C.Watters CS6403

Why Improve efficiency of retrieval Improve effectiveness of retrieval Ranking of retrieved results Visualization of results Karnaugh and SOM (self organizing maps) Discovery of content Discovery of relationships C.Watters CS6403

How Put items into groups so that members have a high degree of association within the group AND items have low degree of association with items in other groups Association for IR documents? Feature set? C.Watters CS6403

Feature Sets for IR Clustering Term occurrences Citations Names Structure (tags) Co-occurences (thesaurus construction) C.Watters CS6403

Problems Choosing the best feature set Choosing the similarity measure Evaluation of results Updates Searching clusters C.Watters CS6403

Measures of Similarity Need to quantify the degree of association of an item with others Generally want a measure that is normalized by document vector length Not clear that weighted document terms are better than binary ones in clustering C.Watters CS6403

General Measures Dice coefficient Jaccard Coefficient Cosine Coefficient C.Watters CS6403

Dice Coefficient Binary weights C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

Jaccard Coefficient Binary Weights C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

Cosine Coefficient Binary weights C= Terms in common, A terms in i, and B terms in j C.Watters CS6403

Now what? Need to be able to compare any doc to any other doc Need? 11 12 13 14 15 21 22 23 24 25 31 32 33 34 35 41 42 43 44 45 51 52 53 54 55 Doc-Doc Similarity Matrix C.Watters CS6403

Generating Similarity Matrix Use inverted file Documents with no terms in common do not need similarity calculation Generally generate only one row at a time as needed C.Watters CS6403

Algorithms Problem: sort N things into M groups, where M=[1,N] Choice of algorithm determines M membership C.Watters CS6403

General Classes of Algorithms Hierarchical Nested groups Pairwise connections made Non-hierarchical No overlap Centroid C.Watters CS6403

Evaluation of results Was method appropriate for data set Do the clusters represent the data well Are the docs in the right cluster C.Watters CS6403

How to test? Overlap test Run a known query set and evaluate against known results Randomly select docs and judge relevance to group members Examine distribution of docs in groups Density test = term occurrences docs x unique terms C.Watters CS6403

Concepts to keep in mind Cluster hypothesis Nearest neighbour centroid C.Watters CS6403

Cluster Hypothesis Associations between documents are related to the relevance of documents to queries Van Rijsbergen, 1979 C.Watters CS6403

Nearest Neighbour Find the document most similar to the given one This one is most likely closely related Works with terms, citations, & clusters C.Watters CS6403

Centroids Representative of a cluster May be a document from that cluster May be a composite of doc features from that cluster Why: query-centroid calculations higher level representations of data set build ontologies and thesauri C.Watters CS6403

Visualization of Clusters Kohonen Maps Star maps SOM (self organizing maps) Etc C.Watters CS6403

Samples C.Watters CS6403

Cluster Map C.Watters CS6403 19

Starfield C.Watters CS6403 21

C.Watters CS6403