Presentation is loading. Please wait.

Presentation is loading. Please wait.

9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

Similar presentations


Presentation on theme: "9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,"— Presentation transcript:

1 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack

2 9/18/2001Information Organization and Retrieval Last Time Document Vectors Inverted Files Vector Space Model Term Weighting Clustering

3 9/18/2001Information Organization and Retrieval Document Vectors novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids

4 9/18/2001Information Organization and Retrieval We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

5 9/18/2001Information Organization and Retrieval Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

6 9/18/2001Information Organization and Retrieval Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

7 9/18/2001Information Organization and Retrieval Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents This makes partial matching possible

8 9/18/2001Information Organization and Retrieval Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

9 9/18/2001Information Organization and Retrieval Assigning Weights tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

10 9/18/2001Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

11 9/18/2001Information Organization and Retrieval Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

12 9/18/2001Information Organization and Retrieval Text Clustering Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990 Term 1 Term 2

13 9/18/2001Information Organization and Retrieval Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

14 9/18/2001Information Organization and Retrieval Types of Clustering Hierarchical vs. Flat Hard vs.Soft vs. Disjunctive (set vs. uncertain vs. multiple assignment)

15 9/18/2001Information Organization and Retrieval

16 9/18/2001Information Organization and Retrieval Flat Clustering K-Means –Hard –O(n) EM (soft version of K-Means)

17 9/18/2001Information Organization and Retrieval K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

18 9/18/2001Information Organization and Retrieval

19 9/18/2001Information Organization and Retrieval

20 9/18/2001Information Organization and Retrieval Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re- clusters the documents within Resulting new groups have different “themes”

21 9/18/2001Information Organization and Retrieval Scatter/Gather Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

22

23

24

25 9/18/2001Information Organization and Retrieval Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation:

26 9/18/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow “Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995

27 9/18/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise et al., 1995

28 9/18/2001Information Organization and Retrieval Concept “Landscapes” Browsing without search Pharmocology Anatomy Legal Disease Hospitals (e.g., Xia Lin, “Visualization for the Document Space,” 1992) Based on Kohonen feature maps; See http://websom.hut.fi/websom/

29 9/18/2001Information Organization and Retrieval More examples of information visualization Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999) Martin Dodge, www.cybergeography.org

30 9/18/2001Information Organization and Retrieval Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and faceted queries? e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)

31 9/18/2001Information Organization and Retrieval More information on content analysis and clustering Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999) Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)

32 9/18/2001Information Organization and Retrieval And now on to… Vector Space Ranking Probabilistic Models and Ranking


Download ppt "9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,"

Similar presentations


Ads by Google