9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack

9/13/2001Information Organization and Retrieval Last Time Content Analysis: – Transformation of raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Zipf distribution –Word co-occurrences non-independent

Information Organization and Retrieval Document Processing Steps

9/13/2001Information Organization and Retrieval Zipf Distribution Rank = order of words’ frequency of occurrence The product of the frequency of words (f) and their rank (r) is approximately constant

9/13/2001Information Organization and Retrieval Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

Information Organization and Retrieval Zipf Distribution (Same curve on linear and log scale)

9/13/2001Information Organization and Retrieval Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

9/13/2001Information Organization and Retrieval Today Document Vectors Inverted Files Vector Space Model Term Weighting Clustering

9/13/2001Information Organization and Retrieval Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of (floating point) numbers –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

9/13/2001Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

9/13/2001Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

9/13/2001Information Organization and Retrieval Document Vectors novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids

9/13/2001Information Organization and Retrieval We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Information Organization and Retrieval Documents in 3D Space

9/13/2001Information Organization and Retrieval Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed into vectors –Pre-processing includes tokenization, stemming, collocations/phrases –Documents occupy multi-dimensional space.

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed?

9/13/2001Information Organization and Retrieval Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

9/13/2001Information Organization and Retrieval Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

9/13/2001Information Organization and Retrieval How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

9/13/2001Information Organization and Retrieval How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

9/13/2001Information Organization and Retrieval How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

9/13/2001Information Organization and Retrieval How Inverted Files are Created Then the file can be split into –A Dictionary file and –A Postings file

9/13/2001Information Organization and Retrieval How Inverted Files are Created Dictionary Postings

9/13/2001Information Organization and Retrieval Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

9/13/2001Information Organization and Retrieval How Inverted Files are Used Dictionary Postings Boolean Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

9/13/2001Information Organization and Retrieval Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents This makes partial matching possible

9/13/2001Information Organization and Retrieval Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

9/13/2001Information Organization and Retrieval Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector

9/13/2001Information Organization and Retrieval Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

9/13/2001Information Organization and Retrieval Assigning Weights to Terms Binary Weights Raw term frequency tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

9/13/2001Information Organization and Retrieval Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

9/13/2001Information Organization and Retrieval Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

9/13/2001Information Organization and Retrieval Assigning Weights tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

9/13/2001Information Organization and Retrieval tf x idf

9/13/2001Information Organization and Retrieval Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents

9/13/2001Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

9/13/2001Information Organization and Retrieval Computing Similarity Scores -- Preview 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

9/13/2001Information Organization and Retrieval Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. Next time we will look in detail at ranking methods One approach to handling high dimensionality:Clustering

9/13/2001Information Organization and Retrieval Vector Space Visualization

9/13/2001Information Organization and Retrieval Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others

9/13/2001Information Organization and Retrieval Text Clustering Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990 Term 1 Term 2

9/13/2001Information Organization and Retrieval Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

9/13/2001Information Organization and Retrieval Types of Clustering Hierarchical vs. Flat Hard vs.Soft vs. Disjunctive (set vs. uncertain vs. multiple assignment)

9/13/2001Information Organization and Retrieval Pair-wise Document Similarity novagalaxy heath’wood filmroledietfur 1 3 1 5 2 2 1 5 4 1 ABCDABCD How to compute document similarity?

9/13/2001Information Organization and Retrieval Pair-wise Document Similarity (no normalization for simplicity) novagalaxy heath’wood filmroledietfur 1 3 1 5 2 2 1 5 4 1 ABCDABCD

9/13/2001Information Organization and Retrieval Pair-wise Document Similarity (cosine normalization)

9/13/2001Information Organization and Retrieval Document/Document Matrix

9/13/2001Information Organization and Retrieval Hierarchical Clustering (Agglomerative) ABCDEFGHIABCDEFGHI

9/13/2001Information Organization and Retrieval

9/13/2001Information Organization and Retrieval Types of Hierarchical Clustering Top-down vs. Bottom-up O(n**2) vs. O(n**3) Single-link vs. complete-link (local coherence vs. global coherence)

9/13/2001Information Organization and Retrieval Flat Clustering K-Means –Hard –O(n) EM (soft version of K-Means)

9/13/2001Information Organization and Retrieval K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

9/13/2001Information Organization and Retrieval Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re- clusters the documents within Resulting new groups have different “themes”

9/13/2001Information Organization and Retrieval S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

9/13/2001Information Organization and Retrieval Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation:

9/13/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow “Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995

9/13/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise et al., 1995

9/13/2001Information Organization and Retrieval Concept “Landscapes” Browsing without search Pharmocology Anatomy Legal Disease Hospitals (e.g., Xia Lin, “Visualization for the Document Space,” 1992) Based on Kohonen feature maps; See http://websom.hut.fi/websom/

9/13/2001Information Organization and Retrieval More examples of information visualization Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999) Martin Dodge, www.cybergeography.org

9/13/2001Information Organization and Retrieval Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and faceted queries? e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)

9/13/2001Information Organization and Retrieval More information on content analysis and clustering Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999) Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)

9/13/2001Information Organization and Retrieval Next Time Vector Space Ranking Probabilistic Models and Ranking

9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Similar presentations

Presentation on theme: "9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Similar presentations

Presentation on theme: "9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback