Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Similar presentations


Presentation on theme: "Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)"— Presentation transcript:

1 Advanced Multimedia Text Clustering Tamara Berg

2 Reminder - Classification Given some labeled training documents Determine the best label for a test (query) document

3 What if we don’t have labeled data? We can’t do classification.

4 What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

5 What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure.

6 What if we don’t have labeled data? We can’t do classification. What can we do? – Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. – Often similarity is assessed according to a distance measure. – Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

7

8

9 Any of the similarity metrics we talked about before (SSD, angle between vectors)

10 Document Clustering Clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar.

11 Source: Hinrich Schutze

12

13

14

15

16 Google news Flickr Clusters

17 Source: Hinrich Schutze

18 How to cluster Documents

19 Reminder - Vector Space Model  Documents are represented as vectors in term space  A vector distance/similarity measure between two documents is used to compare documents Slide from Mitch Marcus

20 Document Vectors: One location for each word. novagalaxy heat h’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

21 Document Vectors novagalaxy heat h’wood filmroledietfur 10 5 3 5 10 10 8 7 910 5 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids Slide from Mitch Marcus

22 TF x IDF Calculation Slide from Mitch Marcus W1W2W3…Wn A

23 Features F1F2F3…Fn A Define whatever features you like: Length of longest string of CAP’s Number of $’s Useful words for the task …

24 Similarity between documents A = [10 5 3 0 0 0 0 0]; G = [5 0 7 0 0 9 0 0]; E = [0 0 0 0 0 10 10 0]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ? Which pair of documents are the most similar?

25 Source: Hinrich Schutze

26 source: Dan Klein

27 K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik

28 K-means clustering Want to minimize sum of squared Euclidean distances between points x i and their nearest cluster centers m k source: Svetlana Lazebnik

29

30

31

32

33

34

35

36

37

38

39

40 source: Dan Klein

41

42

43 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: Source: Hinrich Schutze

44 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. Source: Hinrich Schutze

45 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) Source: Hinrich Schutze

46 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Source: Hinrich Schutze

47 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. Source: Hinrich Schutze

48 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! Source: Hinrich Schutze

49 Convergence of K Means K-means converges to a fixed point in a finite number of iterations. Proof: The sum of squared distances (RSS) decreases during reassignment. (because each vector is moved to a closer centroid) RSS decreases during recomputation. Thus: We must reach a fixed point. But we don’t know how long convergence will take! If we don’t care about a few docs switching back and forth, then convergence is usually fast (< 10-20 iterations). Source: Hinrich Schutze

50 source: Dan Klein

51

52 Source: Hinrich Schutze

53

54 Hierarchical clustering strategies Agglomerative clustering Start with each point in a separate cluster At each iteration, merge two of the “closest” clusters Divisive clustering Start with all points grouped into a single cluster At each iteration, split the “largest” cluster source: Svetlana Lazebnik

55 source: Dan Klein

56

57 Divisive Clustering Top-down (instead of bottom-up as in Agglomerative Clustering) Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. Source: Hinrich Schutze

58 Flat or hierarchical clustering? For high efficiency, use flat clustering (e.g. k means) For deterministic results: hierarchical clustering When a hierarchical structure is desired: hierarchical algorithm Hierarchical clustering can also be applied if K cannot be predetermined (can start without knowing K) Source: Hinrich Schutze

59 For Thurs Read Chapter 6 of textbook


Download ppt "Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)"

Similar presentations


Ads by Google