2007.02.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

2007.02.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 8: Clustering Some slides in this lecture were originally created by Prof. Marti Hearst

2007.02.08 - SLIDE 2IS 240 – Spring 2007 Mini-TREC Need to make groups –Today Systems –SMART (not recommended…) ftp://ftp.cs.cornell.edu/pub/smart –MG (We have a special version if interested) http://www.mds.rmit.edu.au/mg/welcome.html –Cheshire II & 3 II = ftp://cheshire.berkeley.edu/pub/cheshire & http://cheshire.berkeley.edu 3 = http://cheshire3.sourceforge.org –Zprise (Older search system from NIST) http://www.itl.nist.gov/iaui/894.02/works/zp2/zp2.html –IRF (new Java-based IR framework from NIST) http://www.itl.nist.gov/iaui/894.02/projects/irf/irf.html –Lemur http://www-2.cs.cmu.edu/~lemur –Lucene (Java-based Text search engine) http://jakarta.apache.org/lucene/docs/index.html –Others?? (See http://searchtools.com )

2007.02.08 - SLIDE 3IS 240 – Spring 2007 Mini-TREC Proposed Schedule –February 15 – Database and previous Queries –February 27 – report on system acquisition and setup –March 8, New Queries for testing… –April 19, Results due –April 24 or 26, Results and system rankings –May 8 Group reports and discussion

2007.02.08 - SLIDE 4IS 240 – Spring 2007 Review: IR Models Set Theoretic Models –Boolean –Fuzzy –Extended Boolean Vector Models (Algebraic) Probabilistic Models (probabilistic)

2007.02.08 - SLIDE 5IS 240 – Spring 2007 Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

2007.02.08 - SLIDE 6IS 240 – Spring 2007 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

2007.02.08 - SLIDE 7IS 240 – Spring 2007 Vector Space with Term Weights and Cosine Matching 1.0 0.8 0.6 0.4 0.2 0.80.60.40.201.0 D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

2007.02.08 - SLIDE 8IS 240 – Spring 2007 Term Weights in SMART In SMART weights are decomposed into three factors:

2007.02.08 - SLIDE 9IS 240 – Spring 2007 SMART Freq Components Binary maxnorm augmented log

2007.02.08 - SLIDE 10IS 240 – Spring 2007 Collection Weighting in SMART Inverse squared probabilistic frequency

2007.02.08 - SLIDE 11IS 240 – Spring 2007 Term Normalization in SMART sum cosine fourth max

2007.02.08 - SLIDE 12IS 240 – Spring 2007 Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization than having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

2007.02.08 - SLIDE 13IS 240 – Spring 2007 Today Clustering Automatic Classification Cluster-enhanced search

2007.02.08 - SLIDE 14IS 240 – Spring 2007 Overview Introduction to Automatic Classification and Clustering Classification of Classification Methods Classification Clusters and Information Retrieval in Cheshire II DARPA Unfamiliar Metadata Project

2007.02.08 - SLIDE 15IS 240 – Spring 2007 Classification The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. In document classification the items are grouped together because they are likely to be wanted together –For example, items about the same topic.

2007.02.08 - SLIDE 16IS 240 – Spring 2007 Automatic Indexing and Classification Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. Automatic classification attempts to automatically group similar documents using either: –A fully automatic clustering method. –An established classification scheme and set of documents already indexed by that scheme.

2007.02.08 - SLIDE 17IS 240 – Spring 2007 Background and Origins Early suggestion by Fairthorne –“The Mathematics of Classification” Early experiments by Maron (1961) and Borko and Bernick(1963) Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s). Early IR clustering work more concerned with efficiency issues than semantic issues.

2007.02.08 - SLIDE 18IS 240 – Spring 2007 Document Space has High Dimensionality What happens beyond three dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. One approach to handling high dimensionality: Clustering

2007.02.08 - SLIDE 19IS 240 – Spring 2007 Vector Space Visualization

2007.02.08 - SLIDE 20IS 240 – Spring 2007 Cluster Hypothesis The basic notion behind the use of classification and clustering methods: “Closely associated documents tend to be relevant to the same requests.” –C.J. van Rijsbergen

2007.02.08 - SLIDE 21IS 240 – Spring 2007 Classification of Classification Methods Class Structure –Intellectually Formulated Manual assignment (e.g. Library classification) Automatic assignment (e.g. Cheshire Classification Mapping) –Automatically derived from collection of items Hierarchic Clustering Methods (e.g. Single Link) Agglomerative Clustering Methods (e.g. Dattola) Hybrid Methods (e.g. Query Clustering)

2007.02.08 - SLIDE 22IS 240 – Spring 2007 Classification of Classification Methods Relationship between properties and classes –monothetic –polythetic Relation between objects and classes –exclusive –overlapping Relation between classes and classes –ordered –unordered Adapted from Sparck Jones

2007.02.08 - SLIDE 23IS 240 – Spring 2007 Properties and Classes Monothetic –Class defined by a set of properties that are both necessary and sufficient for membership in the class Polythetic –Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

2007.02.08 - SLIDE 24IS 240 – Spring 2007 Monothetic vs. Polythetic A B C D E F G H 1 + + + 2 + + + 3 + + + 4 + + + 5 + + + 6 + + + 7 + + + 8 + + + Polythetic Monothetic Adapted from van Rijsbergen, ‘79

2007.02.08 - SLIDE 25IS 240 – Spring 2007 Exclusive Vs. Overlapping Item can either belong exclusively to a single class Items can belong to many classes, sometimes with a “membership weight”

2007.02.08 - SLIDE 26IS 240 – Spring 2007 Ordered Vs. Unordered Ordered classes have some sort of structure imposed on them –Hierarchies are typical of ordered classes Unordered classes have no imposed precedence or structure and each class is considered on the same “level” –Typical in agglomerative methods

2007.02.08 - SLIDE 27IS 240 – Spring 2007 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

2007.02.08 - SLIDE 28IS 240 – Spring 2007 Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu

2007.02.08 - SLIDE 29IS 240 – Spring 2007 Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others

2007.02.08 - SLIDE 30IS 240 – Spring 2007 Coefficients of Association Simple Dice’s coefficient Jaccard’s coefficient Cosine coefficient Overlap coefficient

2007.02.08 - SLIDE 31IS 240 – Spring 2007 Pair-wise Document Similarity How to compute document similarity?

2007.02.08 - SLIDE 32IS 240 – Spring 2007 Pair-wise Document Similarity (no normalization for simplicity)

2007.02.08 - SLIDE 33IS 240 – Spring 2007 Pair-wise Document Similarity cosine normalization

2007.02.08 - SLIDE 34IS 240 – Spring 2007 Document/Document Matrix

2007.02.08 - SLIDE 35IS 240 – Spring 2007 Clustering Methods Hierarchical Agglomerative Hybrid Automatic Class Assignment

2007.02.08 - SLIDE 36IS 240 – Spring 2007 Hierarchic Agglomerative Clustering Basic method: 1) Calculate all of the interdocument similarity coefficients 2) Assign each document to its own cluster 3) Fuse the most similar pair of current clusters 4) Update the similarity matrix by deleting the rows for the fused clusters and calculating entries for the row and column representing the new cluster (centroid) 5) Return to step 3 if there is more than one cluster left

2007.02.08 - SLIDE 37IS 240 – Spring 2007 Hierarchic Agglomerative Clustering ABCDEFGHIABCDEFGHI

2007.02.08 - SLIDE 40IS 240 – Spring 2007 Hierarchical Methods 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

2007.02.08 - SLIDE 41IS 240 – Spring 2007 Threshold =.1 Single Link Dissimilarity Matrix 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 0 4 0 0 0 5 1 0 0 1 1 2 3 4 2 1 3 5 4

2007.02.08 - SLIDE 42IS 240 – Spring 2007 Threshold =.2 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 1 4 0 0 0 5 1 0 0 1 1 2 3 4 2 1 3 5 4

2007.02.08 - SLIDE 43IS 240 – Spring 2007 Threshold =.3 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 1 4 1 1 1 5 1 0 0 1 1 2 3 4 2 1 3 5 4

2007.02.08 - SLIDE 44IS 240 – Spring 2007 K-means & Rocchio Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Doc 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Rocchio’s method

2007.02.08 - SLIDE 45IS 240 – Spring 2007 K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering –take a small sample –group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

2007.02.08 - SLIDE 46IS 240 – Spring 2007 Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”

2007.02.08 - SLIDE 47IS 240 – Spring 2007 S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

2007.02.08 - SLIDE 48IS 240 – Spring 2007

2007.02.08 - SLIDE 49IS 240 – Spring 2007

2007.02.08 - SLIDE 50IS 240 – Spring 2007

2007.02.08 - SLIDE 51IS 240 – Spring 2007

2007.02.08 - SLIDE 52IS 240 – Spring 2007 Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation:

2007.02.08 - SLIDE 53IS 240 – Spring 2007 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

2007.02.08 - SLIDE 54IS 240 – Spring 2007 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

2007.02.08 - SLIDE 55IS 240 – Spring 2007 Concept “Landscapes” Pharmocology Anatomy Legal Disease Hospitals (e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per document Single concept per document No titles No titles Browsing without search Browsing without search

2007.02.08 - SLIDE 56IS 240 – Spring 2007 Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and facets?

2007.02.08 - SLIDE 57IS 240 – Spring 2007 Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

2007.02.08 - SLIDE 58IS 240 – Spring 2007 Automatic Categorization in Cheshire II Cheshire supports a method we call “classification clustering” that relies on having a set of records that have been indexed using some controlled vocabulary. Classification clustering has the following steps…

2007.02.08 - SLIDE 59IS 240 – Spring 2007 Cheshire II - Cluster Generation Define basis for clustering records. –Select field (I.e., the contolled vocabulary terms) to form the basis of the cluster. –Evidence Fields to use as contents of the pseudo- documents. (E.g. the titles or other topical parts) During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

2007.02.08 - SLIDE 60IS 240 – Spring 2007 Cheshire II - Two-Stage Retrieval Using the LC Classification System –Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (e.g., subject headings, titles, etc.) –Permits searching by any term in the class –Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. –User selects classes to feed back for the “second stage” search of documents. Can be used with any classified/Indexed collection.

2007.02.08 - SLIDE 61IS 240 – Spring 2007 Cheshire II Demo Examples from the: –SciMentor(BioSearch) project Journal of Biological Chemistry and MEDLINE data –CHESTER (EconLit) Journal of Economic Literature subjects –Unfamiliar Metadata & TIDES Projects Basis for clusters is a normalized Library of Congress Class Number Evidence is provided by terms from record titles (and subject headings for the “all languages” Five different training sets (Russian, German, French, Spanish, and All Languages Testing cross-language retrieval and classification –4W Project Search

2007.02.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2007.02.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2007.02.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2007.02.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback