Presentation is loading. Please wait.

Presentation is loading. Please wait.

SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.

Similar presentations


Presentation on theme: "SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15."— Presentation transcript:

1 SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15

2 UCB SIMS 202 Review Content Analysis: Content Analysis: Transformation of raw text into more computationally useful forms Transformation of raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties Words in text collections exhibit interesting statistical properties Zipf distribution Zipf distribution Word co-occurrences non-independent Word co-occurrences non-independent Text documents are transformed to vectors Text documents are transformed to vectors Pre-processing Pre-processing Vectors represent multi-dimensional space Vectors represent multi-dimensional space

3 UCB SIMS 202 Zipf Distribution Rank = order of words’ frequency of occurrence The product of the frequency of words (f) and their rank (r) is approximately constant

4 UCB SIMS 202 Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. There are always a few very frequent tokens that are not good discriminators. Called “stop words” in IR Called “stop words” in IR Usually correspond to linguistic notion of “closed-class” words Usually correspond to linguistic notion of “closed-class” words English examples: to, from, on, and, the,... English examples: to, from, on, and, the,... Grammatical classes that don’t take on new members. Grammatical classes that don’t take on new members. There are always a large number of tokens that occur almost once and can mess up algorithms. There are always a large number of tokens that occur almost once and can mess up algorithms. Medium frequency words most descriptive Medium frequency words most descriptive

5 UCB SIMS 202 Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

6 UCB SIMS 202 Statistical Independence vs. Dependence How likely is token W to appear, given that we’ve seen token V? How likely is token W to appear, given that we’ve seen token V? Non-independence implies that tokens that co-occur may be related in some meaningful way. Non-independence implies that tokens that co-occur may be related in some meaningful way. Very simple corpus-processing algorithms producing meaningful results. Very simple corpus-processing algorithms producing meaningful results.

7 UCB SIMS 202 Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

8 UCB SIMS 202 Un-Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

9 UCB SIMS 202 Computing Co-occurence Compute for a window of words Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

10 UCB SIMS 202 Document Vectors Documents are represented as “bags of words” Documents are represented as “bags of words” Represented as vectors when used computationally Represented as vectors when used computationally A vector is like an array of floating point A vector is like an array of floating point Has direction and magnitude Has direction and magnitude Each vector holds a place for every term in the collection Each vector holds a place for every term in the collection Therefore, most vectors are sparse Therefore, most vectors are sparse

11 UCB SIMS 202 Document Vectors novagalaxy heath’wood filmroledietfur 1.0 0.5 0.3 1.0 0.5 0.3 0.5 1.0 0.5 1.0 1.0 0.8 0.7 1.0 0.8 0.7 0.9 1.0 0.5 0.9 1.0 0.5 1.0 1.0 1.0 1.0 0.9 1.0 0.9 1.0 0.5 0.7 0.9 0.5 0.7 0.9 0.6 1.0 0.3 0.2 0.8 0.6 1.0 0.3 0.2 0.8 0.7 0.5 0.1 0.3 0.7 0.5 0.1 0.3 ABCDEFGHIABCDEFGHI Document ids

12 UCB SIMS 202 Topics for Today Multiple-dimensionality of Document Space Multiple-dimensionality of Document Space Automatic Methods for Automatic Methods for Clustering Clustering Creating Thesaurus Terms Creating Thesaurus Terms Review and Sample Questions for Midterm Review and Sample Questions for Midterm

13 UCB SIMS 202 Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

14 UCB SIMS 202 Numbers represent how many documents share the indicated subset of terms. Numbers represent how many documents share the indicated subset of terms. How to represent similarity among five terms? Six? How to represent similarity among five terms? Six? diet hot star fur 10 3 5 4 Document Similarity

15 UCB SIMS 202 Document Space has High Dimensionality What happens beyond three dimensions? What happens beyond three dimensions? Similarity still has to do with how many tokens are shared in common. Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. More terms -> harder to understand which subsets of words are shared among similar documents. One approach to handling high dimensionality: One approach to handling high dimensionality:Clustering

16 UCB SIMS 202 Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Finds overall similarities among groups of tokens Picks out some themes, ignores others Picks out some themes, ignores others

17 UCB SIMS 202 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

18 UCB SIMS 202 Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu

19 UCB SIMS 202 Pair-wise Document Similarity novagalaxy heath’wood filmroledietfur 1 3 1 1 3 1 5 2 5 2 2 1 5 2 1 5 4 1 4 1 ABCDABCD How to compute document similarity?

20 UCB SIMS 202 Pair-wise Document Similarity (no normalization for simplicity) novagalaxy heath’wood filmroledietfur 1 3 1 1 3 1 5 2 5 2 2 1 5 2 1 5 4 1 4 1 ABCDABCD

21 UCB SIMS 202 Pair-wise Document Similarity (cosine normalization)

22 UCB SIMS 202 Document/Document Matrix

23 UCB SIMS 202 Agglomerative Clustering ABCDEFGHIABCDEFGHI

24 UCB SIMS 202 Agglomerative Clustering ABCDEFGHIABCDEFGHI

25 UCB SIMS 202 Agglomerative Clustering ABCDEFGHIABCDEFGHI

26 UCB SIMS 202 K-Means Clustering 1 Create a pair-wise similarity measure 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering 2 Find K centers using agglomerative clustering take a small sample take a small sample group bottom up until K groups found group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary 4 Repeat 3 as necessary

27 UCB SIMS 202 Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes” Resulting new groups have different “themes”

28 UCB SIMS 202 S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous 7 miscelleneous Clustering and re-clustering is entirely automated

29

30

31

32 UCB SIMS 202 Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation: “Project” these onto a 2D graphical representation:

33 UCB SIMS 202 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

34 UCB SIMS 202 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

35 UCB SIMS 202 Concept “Landscapes” Pharmocology Anatomy Legal Disease Hospitals (e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per document Single concept per document No titles No titles Browsing without search Browsing without search

36 UCB SIMS 202 Clustering Advantages: Advantages: See some main themes See some main themes Disadvantage: Disadvantage: Many ways documents could group together are hidden Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and facets? Thinking point: what is the relationship to classification systems and facets?


Download ppt "SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15."

Similar presentations


Ads by Google